Is your OCR good enough? Probably so. An assessment of the impact of OCR quality on downstream tasks for Dutch texts

We conduct an assessment of the impact of OCR quality in collections in Dutch, considering two tasks: document classification and document clustering via topic modelling. We find that for both topic modelling (using LDA) and document classification (using a variety of methods, including deep neural networks), working with an OCRed version of a corpus does not in general compromise results. On the contrary, it may sometimes lead to better results. While more work is needed, including on evaluating different datasets and methods, our results further confirm previous work in suggesting that the q... Mehr ...

Verfasser: Todorov, Konstantin
Cuper, Mirjam
Colavizza, Giovanni
Dokumenttyp: conferencePaper
Erscheinungsdatum: 2021
Schlagwörter: OCR / Dutch / Natural Language Processing / Machine Learning / Text Analysis
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-27078324
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://zenodo.org/record/4843629

We conduct an assessment of the impact of OCR quality in collections in Dutch, considering two tasks: document classification and document clustering via topic modelling. We find that for both topic modelling (using LDA) and document classification (using a variety of methods, including deep neural networks), working with an OCRed version of a corpus does not in general compromise results. On the contrary, it may sometimes lead to better results. While more work is needed, including on evaluating different datasets and methods, our results further confirm previous work in suggesting that the quality of existing OCR is often sufficient to apply machine learning techniques.