The Datafication of Early Modern Ordinances
The project Entangled Histories used early modern printed normative texts. The computer used to have significant problems being able to read Dutch Gothic print, which is used in the vast majority of the sources. Using the Handwritten Text Recognition suite Transkribus (v.1.07-v.1.10), we reprocessed the original scans that had poor quality OCR, obtaining a Character Error Rate (CER) much lower than our initial expectations of <5% CER. This result is a significant improvement that enables the searching through 75,000 pages of printed normative texts from the seventeen provinces, also known a... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | Artikel |
Erscheinungsdatum: | 2020 |
Reihe/Periodikum: | Romein , C A , de Gruijter , M & Veldhoen , S F 2020 , ' The Datafication of Early Modern Ordinances ' , DH Benelux Journal , vol. 2 . < http://journal.dhbenelux.org/journal/issues/002/article-23-romein/article-23-romein.pdf > |
Schlagwörter: | Early Modern Printed Ordinances / Text recognition / Text segmentation / Dutch Gothic Print / Transkribus / Annif / Machine Learning / Categorisation |
Sprache: | Englisch |
Permalink: | https://search.fid-benelux.de/Record/base-28995502 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | https://pure.knaw.nl/portal/en/publications/97a53282-c1e9-448c-8565-bb28c8ef27fe |
The project Entangled Histories used early modern printed normative texts. The computer used to have significant problems being able to read Dutch Gothic print, which is used in the vast majority of the sources. Using the Handwritten Text Recognition suite Transkribus (v.1.07-v.1.10), we reprocessed the original scans that had poor quality OCR, obtaining a Character Error Rate (CER) much lower than our initial expectations of <5% CER. This result is a significant improvement that enables the searching through 75,000 pages of printed normative texts from the seventeen provinces, also known as the Low Countries. The books of ordinances are compilations; thus, segmentation is essential to retrace the individual norms. We have applied – and compared – four different methods: ABBYY, P2PaLA, NLE Document Recognition and a custom rule-based tool that combines lexical features with font recognition. Each text (norm) in the books concerns one or more topics or categories. A selection of normative texts was manually labelled with internationally used (hierarchical) categories. Using Annif, a tool for automatic subject indexing, the computer was trained to apply the categories by itself. Automatic metadata makes it easier to search relevant texts and allows further analysis. Text recognition, segmentation and categorisation of norms together constitute the datafication of the Early Modern Ordinances. Our experiments for automating these steps have resulted in a provisional process for datafication of this and similar collections.