Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0

ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were machine translated to English and the translation linguistically annotated. Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words. The translation to English was... Mehr ...

Verfasser: Kuzman, Taja
Ljubešić, Nikola
Erjavec, Tomaž
Kopp, Matyáš
Ogrodniczuk, Maciej
Osenova, Petya
Fišer, Darja
Pirker, Hannes
Wissik, Tanja
Schopper, Daniel
Kirnbauer, Martin
Mochtak, Michal
Rupnik, Peter
Pol, Henk van der
Depoorter, Griet
de Does, Jesse
Simov, Kiril
Grigorova, Vladislava
Grigorov, Ilko
Jongejan, Bart
Haltrup Hansen, Dorte
Navarretta, Costanza
Mölder, Martin
Kahusk, Neeme
Vider, Kadri
Bel, Nuria
Antiba-Cartazo, Iván
Pisani, Marilina
Zevallos, Rodolfo
Regueira, Xosé Luís
Vladu, Adina Ioana
Magariños, Carmen
Bardanca, Daniel
Barcala, Mario
Garcia, Marcos
Pérez Lago, María
García Louzao, Pedro
Vivel Couso, Ainhoa
Vázquez Abuín, Marta
García Díaz, Noelia
Vidal Miguéns, Adrián
Fernández Rei, Elisa
Diwersy, Sascha
Luxardo, Giancarlo
Coole, Matthew
Rayson, Paul
Nwadukwe, Amanda
Gkoumas, Dimitris
Papavassiliou, Vassilis
Prokopidis, Prokopis
Gavriilidou, Maria
Piperidis, Stelios
Ligeti-Nagy, Noémi
Jelencsik-Mátyus, Kinga
Varga, Zsófia
Dodé, Réka
Barkarson, Starkaður
Agnoloni, Tommaso
Bartolini, Roberto
Frontini, Francesca
Montemagni, Simonetta
Quochi, Valeria
Venturi, Giulia
Ruisi, Manuela
Marchetti, Carlo
Battistoni, Roberto
Darģis, Roberts
van Heusden, Ruben
Marx, Maarten
Depuydt, Katrien
Tungland, Lars Magne
Rudolf, Michał
Nitoń, Bartłomiej
Aires, José
Mendes, Amália
Cardoso, Aida
Pereira, Rui
Yrjänäinen, Väinö
Norén, Fredrik Mohammadi
Magnusson, Måns
Jarlbrink, Johan
Meden, Katja
Pančur, Andrej
Ojsteršek, Mihael
Çöltekin, Çağrı
Kryvenko, Anna
Dokumenttyp: corpus
Erscheinungsdatum: 2023
Verlag/Hrsg.: CLARIN ERIC
Schlagwörter: Parla-CLARIN / parliamentary debates / COVID-19 / TEI / Bulgarian Parliament / Croatian Parliament / Polish Parliament / Slovenian Parliament / Czech Parliament / Icelandic Parliament / Belgian Parliament / Danish Parliament / Dutch Parliament / Turkish Parliament / Italian Parliament / Hungarian Parliament / Latvian Parliament / French Parliament / Bosnian Parliament / Catalonian Parliament / Galician Parliament / Greek Parliament / Norwegian Parliament / Portuguese Parliament / Serbian Parliament / Swedish Parliament / Ukrainian Parliament / Austrian Parliament / Estonian Parliament
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-26509046
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://hdl.handle.net/11356/1810

ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were machine translated to English and the translation linguistically annotated. Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) with OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/), using the English language model. For NER the conll03 model with 4 NE classes was used. Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account. Note also that some metadata errors were noticed after the source 3.0 corpora were released, and were corrected for the MTed corpus, so there are slight differences in the metadata between the two. The files associated with this entry include the linguistically annotated corpora in several formats: the corpora in thje canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corproa in the CoNLL-U format with TSV speech metadata. In contrast to the source language corpora, the CoNLL-U ...