Challenges in tagging and parsing spoken dialects of Dutch

This paper reports on the construction of a tagged and parsed pilot corpus of the southern Dutch dialects. The corpus aims to facilitate diachronic research into the syntax of Dutch, as its dialects have retained many interesting (morpho)syntactic features which can often be traced back to changes starting in or characteristics retained from older stages of historical Dutch. The discussion mainly focuses on initial test results achieved by applying existing NLP tools which have been developed or optimised for POS tagging and parsing standard Dutch. We report on initial tests on our data with F... Mehr ...

Verfasser: Farasyn, Melissa
Ghyselen, Anne-Sophie
Van Keymeulen, Jacques
Breitbarth, Anne
Dokumenttyp: Artikel
Erscheinungsdatum: 2022
Verlag/Hrsg.: University of Konstanz
Schlagwörter: tagging / parsing / dialects / Dutch / corpus / spoken dialects
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-28993976
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://ojs.ub.uni-konstanz.de/hs/index.php/hs/article/view/92

This paper reports on the construction of a tagged and parsed pilot corpus of the southern Dutch dialects. The corpus aims to facilitate diachronic research into the syntax of Dutch, as its dialects have retained many interesting (morpho)syntactic features which can often be traced back to changes starting in or characteristics retained from older stages of historical Dutch. The discussion mainly focuses on initial test results achieved by applying existing NLP tools which have been developed or optimised for POS tagging and parsing standard Dutch. We report on initial tests on our data with Frog, TreeTagger and Alpino. We discuss some of the challenges we have encountered working with spoken, unstandardised language in general on the one hand and on specific (morpho)syntactic problems for POS tagging and parsing the southern Dutch dialects on the other hand. The challenges and solutions we present in this pilot study will inform our choices for the NLP tools we will use or adapt for the development of a more extensive annotated corpus.