Content providers, Researchers, Technology and the Crowd::Discovering the Best Possible Collaborative Strategies for Datafication and Publication of a Dutch Historical Newspaper Corpus

Verfasser:	Romein, C.A. van Zundert, Joris J. Depuydt, Katrien van der Sijs, Nicoline de Does, Jesse de Jong, Ruud de Bonth, Roland Fannee, Mathieu
Dokumenttyp:	conferenceObject
Erscheinungsdatum:	2023
Schlagwörter:	HTR / Newspapers
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-27026358
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://pure.knaw.nl/portal/en/publications/433ac99d-3a87-41d7-bc82-c2e921450e46

The National Library of the Netherlands (KB) makes over 130 million pages of newspapers, journals and books accessible via Delpher, and the collection is still growing. Delpher offers users the option to search the texts, to create selections based on metadata, and highlights results in the images (which is helpful because newspaper pages, for instance, can be a complex mix of articles). The digitisation and datafication is outsourced and omnifont OCR[1] is used.The advantage of this strategy is that large amounts of material can be made available quickly at low cost. The quality of the OCR for challenging material however, is very poor and unsuitable for most types of research. Moreover, the data is not structured in such a way that the material is suitable for the type of retrieval and research done in digital humanities and corpus linguistics. A good example of challenging material is the collection of 17th century newspapers. They are an important source for historical (linguistic) research. However, the poor quality of the OCRed text, the inadequate structuring, and the available metadata in the KB version makes such research virtually impossible. Van der Sijs therefore proposed to work with volunteers[2] on a better datafication of the text and improved structuring and metadata. The idea was not only to deliver a better version to the KB for Delpher, but also to make the newspaper accessible in an alternative environment, suitable for both digital humanities researchers and corpus linguists. The work was executed in two stages. A thorough evaluation of the workflow led to an entirely new approach for the second stage. In our contribution, we will present both approaches, discuss the problems we encountered, and the solutions we devised. For both stages the starting point was a collection of images, the available OCR in ALTO format and associated metadata. In the ALTO files, region segmentation was already corrected and article segmentation was applied. The quality of the OCR was too low to serve as a basis ...