Weigh your words--memory-based lemmatization for Middle Dutch
This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling , containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | TEXT |
Erscheinungsdatum: | 2010 |
Verlag/Hrsg.: |
Oxford University Press
|
Schlagwörter: | Original Articles |
Sprache: | Englisch |
Permalink: | https://search.fid-benelux.de/Record/base-28992641 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | http://llc.oxfordjournals.org/cgi/content/short/25/3/287 |
This article deals with the lemmatization of Middle Dutch literature. This text collection—like any other medieval corpus—is characterized by an enormous spelling variation, which makes it difficult to perform a computational analysis of this kind of data. Lemmatization is therefore an essential preprocessing step in many applications, since it allows the abstraction from superficial textual variation, for instance in spelling. The data we will work with is the Corpus-Gysseling , containing all surviving Middle Dutch literary manuscripts dated before 1300 AD. In this article we shall present a language-independent system that can ‘learn’ intra-lemma spelling variation. We describe a series of experiments with this system, using Memory-Based Machine Learning and propose two solutions for the lemmatization of our data: the first procedure attempts to generate new spelling variants, the second one seeks to implement a novel string distance metric to better detect spelling variants. The latter system attempts to rerank candidates suggested by a classic Levenshtein distance, leading to a substantial gain in lemmatization accuracy. This research result is encouraging and means a substantial step forward in the computational study of Middle Dutch literature. Our techniques might be of interest to other research domains as well because of their language-independent nature.