Weigh your words--memory-based lemmatization for Middle Dutch

This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling , containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a... Mehr ...

Verfasser: Kestemont, Mike
Daelemans, Walter
De Pauw, Guy
Dokumenttyp: TEXT
Erscheinungsdatum: 2010
Verlag/Hrsg.: Oxford University Press
Schlagwörter: Original Articles
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-27024335
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://llc.oxfordjournals.org/cgi/content/short/25/3/287

This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling , containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can ‘learn’ intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature.