Bag & Tag'em - A new Dutch stemmer

We propose a novel stemming algorithm that is both robust and accurate compared to state-of-the-art solutions, yet addresses several of the problems that current stemmers face in the Dutch language. The main issue is that most current stemmers cannot handle 3 rd person singular forms of verbs and many irregular words and conjugations, unless a (nearly) brute-force approach is used. Our algorithm combines a new tagging module with a stemmer that uses tag-specific sets of rigid rules: the Bag & Tag'em (BT) algorithm. The tagging module is developed and evaluated using three algorithms: Multi... Mehr ...

Verfasser: Jonker, Anne
de Ruijt, Corné
de Gruijl, Jornt R.
Dokumenttyp: contributionToPeriodical
Erscheinungsdatum: 2020
Verlag/Hrsg.: European Language Resources Association (ELRA)
Schlagwörter: Dutch / PoS tagging / Stemming
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-27074557
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://research.vu.nl/en/publications/341e40c7-466a-473c-8436-96cd71ae7eb3

We propose a novel stemming algorithm that is both robust and accurate compared to state-of-the-art solutions, yet addresses several of the problems that current stemmers face in the Dutch language. The main issue is that most current stemmers cannot handle 3 rd person singular forms of verbs and many irregular words and conjugations, unless a (nearly) brute-force approach is used. Our algorithm combines a new tagging module with a stemmer that uses tag-specific sets of rigid rules: the Bag & Tag'em (BT) algorithm. The tagging module is developed and evaluated using three algorithms: Multinomial Logistic Regression (MLR), Neural Network (NN) and Extreme Gradient Boosting (XGB). The stemming module's performance is compared with that of current state-of-the-art stemming algorithms for the Dutch Language. Even though there is still room for improvement, the new BT algorithm performs well in the sense that it is more accurate than the current stemmers and faster than brute-force-like algorithms. The code and data used for this paper can be found at: https://github.com/Anne-Jonker/Bag-Tag-em.