Parsing migration paths from administrative language. The case of Luxembourg files of migration administration

Administrative archives produce serial documents that lend themselves to data extraction, because of their regular format. The progress made, both in OCR/HTR and layout detection, opens up these large collections for the development of digitisation pipeline leading to computer readable data from analogue forms. However, these forms contain also unstructured text that needs specific processing to parse them. We propose to discuss our experience in extracting information from administrative forms in which civil servants transcribed the ten past years of migration of declarants arriving to Luxemb... Mehr ...

Verfasser: Bunout, Estelle
Dokumenttyp: conferencePaper
Erscheinungsdatum: 2024
Verlag/Hrsg.: Zenodo
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-28688987
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://doi.org/10.5281/zenodo.10573580

Administrative archives produce serial documents that lend themselves to data extraction, because of their regular format. The progress made, both in OCR/HTR and layout detection, opens up these large collections for the development of digitisation pipeline leading to computer readable data from analogue forms. However, these forms contain also unstructured text that needs specific processing to parse them. We propose to discuss our experience in extracting information from administrative forms in which civil servants transcribed the ten past years of migration of declarants arriving to Luxembourg in the 20 th century. These paragraphs contain precious information and follow a regular structure and vocabulary, referring to information located either in another part of the form or to other forms of the same archival fond.