Is your OCR good enough? Probably so. An assessment of the impact of OCR quality on downstream tasks for Dutch texts
We conduct an assessment of the impact of OCR quality in collections in Dutch, considering two tasks: document classification and document clustering via topic modelling. We find that for both topic modelling (using LDA) and document classification (using a variety of methods, including deep neural networks), working with an OCRed version of a corpus does not in general compromise results. On the contrary, it may sometimes lead to better results. While more work is needed, including on evaluating different datasets and methods, our results further confirm previous work in suggesting that the q... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | conferencePaper |
Erscheinungsdatum: | 2021 |
Verlag/Hrsg.: |
Zenodo
|
Schlagwörter: | OCR / Dutch / Natural Language Processing / Machine Learning / Text Analysis |
Sprache: | Englisch |
Permalink: | https://search.fid-benelux.de/Record/base-29049991 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | https://doi.org/10.5281/zenodo.4843629 |
We conduct an assessment of the impact of OCR quality in collections in Dutch, considering two tasks: document classification and document clustering via topic modelling. We find that for both topic modelling (using LDA) and document classification (using a variety of methods, including deep neural networks), working with an OCRed version of a corpus does not in general compromise results. On the contrary, it may sometimes lead to better results. While more work is needed, including on evaluating different datasets and methods, our results further confirm previous work in suggesting that the quality of existing OCR is often sufficient to apply machine learning techniques.