Benchmarking the Simplification of Dutch Municipal Text

Text simplification (TS) makes written information more accessible to all people, especially those with cognitive or language impairments. Despite much progress in TS due to advances in NLP technology, the bottleneck issue of lack of data for low-resource languages persists. Dutch is one of these languages that lack a monolingual simplification corpus. In this paper, we use English as a pivot language for the simplification of Dutch medical and municipal text. We experiment with augmenting training data and corpus choice for this pivot-based approach. We compare the results to a baseline and a... Mehr ...

Verfasser: Vlantis, Daniel
Gornishka, Iva
Wang, Shuai
Dokumenttyp: contributionToPeriodical
Erscheinungsdatum: 2024
Verlag/Hrsg.: European Language Resources Association (ELRA)
Schlagwörter: Dutch municipal text / GPT / Large language models / Pivot-based text simplification
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-28637052
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : https://research.vu.nl/en/publications/14adedd4-bb0e-49ed-83fc-0db8edf6a34d

Text simplification (TS) makes written information more accessible to all people, especially those with cognitive or language impairments. Despite much progress in TS due to advances in NLP technology, the bottleneck issue of lack of data for low-resource languages persists. Dutch is one of these languages that lack a monolingual simplification corpus. In this paper, we use English as a pivot language for the simplification of Dutch medical and municipal text. We experiment with augmenting training data and corpus choice for this pivot-based approach. We compare the results to a baseline and an end-to-end LLM approach using the GPT 3.5 Turbo model. Our evaluation shows that, while we can substantially improve the results of the pivot pipeline, the zero-shot end-to-end GPT-based simplification performs better on all metrics. Our work shows how an existing pivot-based pipeline can be improved for simplifying Dutch medical text. Moreover, we provide baselines for the comparison in the domain of Dutch municipal text and make our corresponding evaluation dataset publicly available.