An exploration of language identification techniques for the Dutch folktale database

The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on... Mehr ...

Verfasser: Trieschnigg, Dolf
Hiemstra, Djoerd
Theune, Mariët
Jong, Franciska de
Meder, Theo
Dokumenttyp: article in monograph or in proceedings
Erscheinungsdatum: 2012
Verlag/Hrsg.: LREC organization
Sprache: unknown
Permalink: https://search.fid-benelux.de/Record/base-27453834
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://purl.utwente.nl/publications/82013

The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low. The studied dataset consisting of over 39,000 documents in 16 languages and dialects is available on request for followup research.