Evaluating the performance and usability of a Tesseract-based OCR workflow on French-Dutch bilingual historical sources

Verfasser:	Van den broeck, Alec Dejaeghere, Tess Foket, Lise Ducatteeuw, Vincent Landuyt, Julie Birkholz, Julie Verbruggen, Christophe Lamsens, Frederic Chambers, Sally
Dokumenttyp:	lecture
Erscheinungsdatum:	2022
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-27078559
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://zenodo.org/record/6602981

The study of texts using a qualitative approach remains the dominant modus operandi in humanities research (D. Nguyen et al., 2020). While most humanities researchers emphasize the critical examination of texts, digital research methodologies are gradually being adopted as complementary options (Levenberg et al., 2018). These computational practices allow researchers to process, aggregate and analyze large quantities of texts. Analytical techniques can help humanities scholars uncover principles and patterns that were previously hidden or identify salient sources for further qualitative research (Bod, 2013; Aiello & Simeone, 2019). However, to support these and more advanced use cases such as Natural Language Processing (NLP), sources must be digitized and transformed into a machine-readable format through Optical Character Recognition (OCR) (Lopresti, 2009). Despite the fact that OCR software is frequently used to convert analogue sources into digital texts, off-the-shelf OCR tools are usually less adapted to historical sources leading to errors in text transcription (Martínek et al., 2020; Nguyen et al., 2021; Smith & Cordell, 2018). Another disadvantage to these models is that they are very susceptible to noise, resulting in relatively low text detection accuracy. Methods of digital text analysis have the potential to further expand the field of humanities (Blevins & Robichaud, 2011; Kuhn, 2019; Nguyen et al., 2021). However, as OCR quality has a profound impact on these methods, it is important that OCR-generated text is as accurate as possible to avoid bias (Traub et al., 2015; Strien et al., 2020). Adapting OCR systems to distinct historical sources is not only expensive and time-consuming, but the technical knowledge required to (re)train OCR models is often perceived as a hurdle by humanists (Nguyen et al., 2021; Smith & Cordell, 2018). Consequently, research efforts are often geared towards improving the output of the off-the-shelf OCR tools through a process of error analysis and ...