The RTBF Corpus: a dataset of 750,000 Belgian French news articles published between 2008 and 2021
In this paper, we introduce the RTBF Corpus, a large diachronic corpus of 767,204 Belgian French news articles published between 2008 and 2021 by the Belgian public service media RTBF. We present the contents and structure of the corpus, along with the different layers of metadata available for each text. We also describe the three different versions of the articles available in the corpus (depending on the cleaning and preprocessing steps applied to the text). The RTBF corpus is freely available online in CSV format (https://dataverse.uclouvain.be/dataset.xhtml?persistentId=doi:10.14428/DVN/P... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | conferenceObject |
Erscheinungsdatum: | 2023 |
Schlagwörter: | corpus / RTBF / journalism / press corpus / French |
Sprache: | Englisch |
Permalink: | https://search.fid-benelux.de/Record/base-28876868 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | http://hdl.handle.net/2078.1/276580 |
In this paper, we introduce the RTBF Corpus, a large diachronic corpus of 767,204 Belgian French news articles published between 2008 and 2021 by the Belgian public service media RTBF. We present the contents and structure of the corpus, along with the different layers of metadata available for each text. We also describe the three different versions of the articles available in the corpus (depending on the cleaning and preprocessing steps applied to the text). The RTBF corpus is freely available online in CSV format (https://dataverse.uclouvain.be/dataset.xhtml?persistentId=doi:10.14428/DVN/PEVSSI), for research and teaching purposes only.