6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds

Verfasser:	Liesbeth Keijser
Dokumenttyp:	other
Erscheinungsdatum:	2024
Verlag/Hrsg.:	Zenodo
Schlagwörter:	Transciptions / Verenigde Oost-Indische Compagnie / West-Indische Compagnie / Notarial deeds / Nationaal Archief / Noord-Hollands Archief / Transkribus
Sprache:	unknown
Permalink:	https://search.fid-benelux.de/Record/base-29506496
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://doi.org/10.5281/zenodo.11126658

The National Archives of the Netherlands and Noord-Hollands Archief conducted a project usingthe Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts. The transcribed archives are 17 th and 18 th century documents from the Dutch East-India Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces. In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset. With the transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model. The following Transkribus HTR+ model has been trained for the text recognition: "IJsberg". More information about the model can be found here . See the chapter "Dutch Handwriting". However, the Transkribus team retrained the model with PyLaia technology, which improved the HTR+ model. This PyLaia model is not publicly available. Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17 th and 18 th century. The Loghi Handwritten Text Recognition Toolkit has been added to the arsenal of the Nation Archives of the Netherlands. 1.05.11.14, Notarissen Suriname tot 1828 [digitaal duplicaat] has been processed with this tooling. The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview below.Scroll to the bottom of the page to download the actual files. For more information on how the Dutch National Archive innovate on digital accessibility click here . For open data access of scans and inventories of the National Archives click here . Disclaimer : due to a variety of languages used and the bad state of the documents the HTR results of ...