OCR error detection and post-correction with Word2vec and BERTje on Dutch historical data ...

A high quality of OCR-output (Optical Character Recognition) has many benefits. Documents become more accessible to readers and NLP tasks can thrive on the data. However, for many reasons, such as the condition of the documents, the OCR-output of historical documents suffers from a significant amount of errors. This study focuses on detecting and correcting these errors after the OCR process has taken place. It has a focus on Dutch historical data. A comparison will be made between the performance of two methods often used for this over the last few years: word2vec and BERT. While BERT has bee... Mehr ...

Verfasser:	Nynke Van 'T Hof Provatorova, Vera Cuper, Mirjam Kanoulas, Evangelos
Dokumenttyp:	Scholarlyarticle
Erscheinungsdatum:	2022
Verlag/Hrsg.:	Zenodo
Schlagwörter:	OCR post-correction / Natural Language Processing / Word Embedding Models / historical data / digital heritage
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-28981451
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://dx.doi.org/10.5281/zenodo.6574957

Suche in Bibliothekskatalogen:

	Prüfen Sie die Verfügbarkeit in Ihrer Heimatbibliothek
	Suche deutschlandweit und international (KVK – Karlsruher Virtueller Katalog)
	Suche weltweit im Worldcatworldwide_worldcat

Suche via Google:

Suche via Google

Suche in Google Scholar

Suche in Google Books