Comparing Pre-Training Schemes for Luxembourgish BERT Models

Verfasser:	Lothritz, Cedric Ezzini, Saad Purschke, Christoph Bissyande, Tegawendé François D Assise Klein, Jacques Olariu, Isabella Boytsov, Andrey Lefebvre, Clement Goujon, Anne
Dokumenttyp:	conference paper
Erscheinungsdatum:	2023
Schlagwörter:	natural language processing / luxembourgish / NLP / BERT / pre-training / language model / computational linguistics / datasets / low-resource language / luxembert / Engineering / computing & technology / Computer science / Ingénierie / informatique & technologie / Sciences informatiques
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-27134188
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://orbilu.uni.lu/handle/10993/55778

peer reviewed ; Despite the widespread use of pre-trained models in NLP, well-performing pre-trained models for low-resource languages are scarce. To address this issue, we propose two novel BERT models for the Luxembourgish language that improve on the state of the art. We also present an empirical study on both the performance and robustness of the investigated BERT models. We compare the models on a set of downstream NLP tasks and evaluate their robustness against different types of data perturbations. Additionally, we provide novel datasets to evaluate the performance of Luxembourgish language models. Our findings reveal that pre-training a pre-loaded model has a positive effect on both the performance and robustness of fine-tuned models and that using the German GottBERT model yields a higher performance while the multilingual mBERT results in a more robust model. This study provides valuable insights for researchers and practitioners working with low-resource languages and highlights the importance of considering pre-training strategies when building language models.