An Optical Character Recognition Software Benchmark for Old Dutch Texts on the EYRA Platform

Digitized collections of printed historical texts are important for research in Digital Humanities. However, acquiring high-quality machine readable texts using currently available Optical Character Recognition (OCR) methods is a challenge. OCR Quality is affected by old fonts, old printing techniques, bleedthrough of the ink, paper quality, old spelling, multiple columns and so on. It is unclear which OCR methods perform best. Therefore, we are currently in the process of setting up a benchmark to enable the evaluation of the performance of OCR software on old Dutch texts. The benchmark is be... Mehr ...

Verfasser: Cuper, Mirjam
Mendrik, Adriënne M.
van Meersbergen, Maarten
Klaver, Tom
Pawar, Pushpanjali
Langedijk, Annette
Wilms, Lotte
Dokumenttyp: Artikel
Erscheinungsdatum: 2020
Verlag/Hrsg.: Zenodo
Schlagwörter: OCR / historical texts / benchmark / quality / evaluation / replication / performance metrics
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-29049908
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://doi.org/10.5281/zenodo.3872918

Digitized collections of printed historical texts are important for research in Digital Humanities. However, acquiring high-quality machine readable texts using currently available Optical Character Recognition (OCR) methods is a challenge. OCR Quality is affected by old fonts, old printing techniques, bleedthrough of the ink, paper quality, old spelling, multiple columns and so on. It is unclear which OCR methods perform best. Therefore, we are currently in the process of setting up a benchmark to enable the evaluation of the performance of OCR software on old Dutch texts. The benchmark is being set-up on the EYRA benchmark platform (eyrabenchmark.net) developed by The Netherlands eScience Center and SURF.