Benchmarking zero-shot text classification for Dutch

Verfasser:	De Langhe, Loic Maladry, Aaron Vanroy, Bram De Bruyne, Luna Singh, Pranaydeep Lefever, Els De Clercq, Orphée
Dokumenttyp:	Artikel
Erscheinungsdatum:	2024
Schlagwörter:	Computer. Automation / Linguistics
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-27449070
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://hdl.handle.net/10067/2053620151162165141

Abstract: The advent and popularisation of Large Language Models (LLMs) have given rise to promptbased Natural Language Processing (NLP) techniques which eliminate the need for large manually annotated corpora and computationally expensive supervised training or fine-tuning processes. Zero-shot learning in particular presents itself as an attractive alternative to the classical train-development-test paradigm for many downstream tasks as it provides a quick and inexpensive way of directly leveraging the implicitly encoded knowledge in LLMs. Despite the large interest in zero-shot applications within the domain of NLP as a whole, there is often no consensus on the methodology, analysis and evaluation of zero-shot pipelines. As a tentative step towards finding such a consensus, this work provides a detailed overview of available methods, resources, and caveats for zero-shot prompting within the Dutch language domain. At the same time, we present centralised zero-shot benchmark results on a large variety of Dutch NLP tasks using a series of standardised datasets. These tasks vary in subjectivity and domain, ranging from more social information extraction tasks (sentiment, emotion and irony detection for social media) to factual tasks (news topic classification and event coreference resolution). To ensure that the benchmark results are representative, we investigated a selection of zero-shot methodologies for a variety of state-of-the-art Dutch Natural Language Inference models (NLI), Masked Language models (MLM), and autoregressive language models. The output on each test set was compared to the best performance achieved using supervised methods. Our findings indicate that task-specific fine-tuning delivers superior performance in all but one (emotion detection) task. In the zero-shot settings it could be observed that large generative models through prompting seem to outperform NLI models, which in turn perform better than the MLM approach. Finally, we note several caveats and challenges tied to using zero-shot ...