Automatically Interpreting Dutch Tombstone Inscriptions

Digital preservation of tombstones is important in the context of cultural heritage but a costly process. We propose a way of automatically reading tombstone inscriptions with the aim of assisting human annotators and data curators. Our method comprises a pipeline of dedicated components where the input is an image of a tombstone, and the output is an interpretation (represented as a directed acyclic graph) comprising the names of the deceased, dates of birth and death, places of birth and death, and biblical references. The three main components in the pipeline are (1) Label Detection, (2) Op... Mehr ...

Verfasser: Bos, Johan
Marocico, Cristian
Tatar, A. Emin
Mzayek, Yasmin
Dokumenttyp: Artikel
Erscheinungsdatum: 2022
Reihe/Periodikum: Bos , J , Marocico , C , Tatar , A E & Mzayek , Y 2022 , ' Automatically Interpreting Dutch Tombstone Inscriptions ' , Computational Linguistics in the Netherlands Journal , vol. 12 , pp. 253–267 .
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-27445351
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://hdl.handle.net/11370/015958d5-d27d-4e15-b571-2f40e7cecab7

Digital preservation of tombstones is important in the context of cultural heritage but a costly process. We propose a way of automatically reading tombstone inscriptions with the aim of assisting human annotators and data curators. Our method comprises a pipeline of dedicated components where the input is an image of a tombstone, and the output is an interpretation (represented as a directed acyclic graph) comprising the names of the deceased, dates of birth and death, places of birth and death, and biblical references. The three main components in the pipeline are (1) Label Detection, (2) Optical Character Recognition (OCR) and (3) Semantic Interpretation. The Label Detection component uses an off-the-shelf deep Learning algorithm trained on tombstone images to detect the bounded boxes and labels for the entities mentioned above (names, locations, and dates). The OCR component then takes in each of the detected labels and recognizes the text contained therein. Finally, the interpretation component performs post-processing, normalizes dates and places, and puts all the information together into a meaning representation coded as a directed acyclic graph. There are several challenges that need to be addressed, such as correcting OCR errors, interpreting unusual segmentation of words, recognising abbreviations, dealing with multiple languages, and accommodating different notational variants of dates. The system, the first of its kind, is developed and evaluated with the help of an annotated corpus of 1,100 tombstone inscriptions. Evaluation is carried out by calculating graph overlap of the system output compared to gold standard. The results are encouraging, with an F-score of 67% outperforming a random baseline of 40%.