Dutch Historical Spelling Normalization for Parsing and Coreference Resolution

Non-canonical language can be handled in an NLP pipeline using normalization of the input (e.g., MoNoise; van der Goot & van Noord, CLINjournal 2017) or domain adaptation of the pipeline (e.g., Hupkes & Bod, LREC 2016); we focus on the former. MoNoise shows that normalization is effective for social media language. We consider a different domain: Dutch literature from Project Gutenberg. We work with 9 fragments that make up the OpenBoek corpus (van den Berg et al., CLIN 2021). The fragments consist of 10,000+ tokens from texts first published 1860-1920, both translated and originally D... Mehr ...

Verfasser: Postma, Priscilla
Donker, Rina
Stam, Ruth
Roorda, Athalia
van Cranenburgh, Andreas
van Noord, Gertjan
Dokumenttyp: conferenceObject
Erscheinungsdatum: 2022
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-27057986
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://hdl.handle.net/11370/0b2d486f-fcf7-4fa3-b7ce-c8cbb2103d16

Non-canonical language can be handled in an NLP pipeline using normalization of the input (e.g., MoNoise; van der Goot & van Noord, CLINjournal 2017) or domain adaptation of the pipeline (e.g., Hupkes & Bod, LREC 2016); we focus on the former. MoNoise shows that normalization is effective for social media language. We consider a different domain: Dutch literature from Project Gutenberg. We work with 9 fragments that make up the OpenBoek corpus (van den Berg et al., CLIN 2021). The fragments consist of 10,000+ tokens from texts first published 1860-1920, both translated and originally Dutch. MoNoise consists of several modules: a lookup table, automatic spelling correction (aspell), and word embeddings; we aim to explore these techniques on our data in future work. Here we report results of a rule-based approach implemented with a sed script (i.e., regular expressions) for normalizing frequently occurring non-standard spellings. The output consists of instructions to the Alpino parser (van Noord, TALN 2006) to treat words with non-canonical orthography as if they occur with modern spelling. The advantage of this approach is that the resulting parse trees contain the original tokens, and existing annotation layers (such as coreference) do not have to be re-aligned. Consider the following sentence from Couperus, Eline Vere (ch. I, § II): 18-1|- Is het [ @alt zo zoo ] goed ? vroeg zij met bevende stem , [ @alt ene eene ] , van te voren bestudeerde poze aannemende . Here [ @alt zo zoo ] indicates that the original token zoo should be treated as zo. Besides doubled vowels, other frequent spelling normalizations are de/den, zei/zeide, and mensen/menschen. When multiple alternatives are given the parser considers the input as a lattice and uses the sequence of tokens that generates the most likely parse. Parse trees for the above sentence show that the automatic spelling normalization is not perfect (the correct normalization of eene is een with POS lid rather than ene), but it does lead to a correct bracketing ...