NLP De Luxe - Challenges for Natural Language Processing in Luxembourg

Verfasser:	LOTHRITZ, Cedric
Dokumenttyp:	doctoral thesis
Erscheinungsdatum:	2023
Verlag/Hrsg.:	Unilu - University of Luxembourg
Schlagwörter:	NLP / natural language processing / linguistics / luxembourg / luxembourgish / multilingualism / fintech / language modeling / bert / named entity recognition / de-identification / anonymisation / chatbot / conversational ai / luxembert / low-resource / data augmentation / pre-training / Engineering / computing & technology / Computer science / Ingénierie / informatique & technologie / Sciences informatiques
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-29109237
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://orbilu.uni.lu/handle/10993/54910

The Grand Duchy of Luxembourg is a small country in Western Europe, which, despite its size, is an important global financial centre. Due to its highly multilingual population, and the fact that one of its national languages - Luxembourgish - is regarded as a low-resource language, this country lends itself naturally to a wide variety of interesting research opportunities in the domain of Natural Language Processing (NLP). This thesis discusses and addresses challenges with regard to domain-specific and language-specific NLP, using the unique linguistic situation in Luxembourg as an elaborate case study. We focus on three main topics: (I) NLP challenges present in the financial domain, specifically handling personal names in sensitive documents, (II) NLP challenges related to multilingualism, and (III) NLP challenges for low-resource languages with Luxembourgish as the language of interest. With regard to NLP challenges in the financial domain, we address the challenge of finding and anonymising names in documents. Firstly, an empirical study on the usefulness of Transformer-based deep learning models is presented on the task of Fine-Grained Named Entity Recognition. This empirical study was conducted for a wide array of domains, including the financial domain. We show that Transformer-based models, and in particular BERT models, yield the best performance for this task. We furthermore show that the performance is also strongly dependent on the domain itself, regardless of the choice of model. The automatic detection of names in text documents in turn facilitates the anonymisation of these documents. However, anonymisation can distort data and have a negative effect on models that are built on that data. We investigate the impact of anonymisation of personal names on the performance of deep learning models trained on a large number of NLP tasks. Based on our experiments, we establish which anonymisation strategy should be used to guarantee accurate NLP models. Regarding NLP challenges related to multilingualism, ...