The RTBF Corpus: a dataset of 750,000 Belgian French news articles published between 2008 and 2021

In this paper, we introduce the RTBF Corpus, a large diachronic corpus of 767,204 Belgian French news articles published between 2008 and 2021 by the Belgian public service media RTBF. We present the contents and structure of the corpus, along with the different layers of metadata available for each text. We also describe the three different versions of the articles available in the corpus (depending on the cleaning and preprocessing steps applied to the text). The RTBF corpus is freely available online in CSV format (https://dataverse.uclouvain.be/dataset.xhtml?persistentId=doi:10.14428/DVN/P... Mehr ...

Verfasser: Escouflaire, Louis
Bogaert, Jérémie
Descampe, Antonin
Fairon, Cédrick
International Conference on Corpus Linguistics (JLC)
Dokumenttyp: conferenceObject
Erscheinungsdatum: 2023
Schlagwörter: corpus / RTBF / journalism / press corpus / French
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-28876868
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://hdl.handle.net/2078.1/276580

In this paper, we introduce the RTBF Corpus, a large diachronic corpus of 767,204 Belgian French news articles published between 2008 and 2021 by the Belgian public service media RTBF. We present the contents and structure of the corpus, along with the different layers of metadata available for each text. We also describe the three different versions of the articles available in the corpus (depending on the cleaning and preprocessing steps applied to the text). The RTBF corpus is freely available online in CSV format (https://dataverse.uclouvain.be/dataset.xhtml?persistentId=doi:10.14428/DVN/PEVSSI), for research and teaching purposes only.