Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.0

ParlaMint-en.ana 4.0 is the English machine translation of the ParlaMint.ana 4.0 (http://hdl.handle.net/11356/1860) set of corpora of parliamentary debates across Europe. The translation is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). M... Mehr ...

Verfasser: Kuzman, Taja
Ljubešić, Nikola
Erjavec, Tomaž
Kopp, Matyáš
Ogrodniczuk, Maciej
Osenova, Petya
Rayson, Paul
Vidler, John
Agerri, Rodrigo
Agirrezabal, Manex
Agnoloni, Tommaso
Aires, José
Albini, Monica
Alkorta, Jon
Antiba-Cartazo, Iván
Arrieta, Ekain
Barcala, Mario
Bardanca, Daniel
Barkarson, Starkaður
Bartolini, Roberto
Battistoni, Roberto
Bel, Nuria
Bonet Ramos, Maria del Mar
Calzada Pérez, María
Cardoso, Aida
Çöltekin, Çağrı
Coole, Matthew
Darģis, Roberts
de Does, Jesse
de Libano, Ruben
Depoorter, Griet
Depuydt, Katrien
Diwersy, Sascha
Dodé, Réka
Fernandez, Kike
Fernández Rei, Elisa
Frontini, Francesca
Garcia, Marcos
García Díaz, Noelia
García Louzao, Pedro
Gavriilidou, Maria
Gkoumas, Dimitris
Grigorov, Ilko
Grigorova, Vladislava
Haltrup Hansen, Dorte
Iruskieta, Mikel
Jarlbrink, Johan
Jelencsik-Mátyus, Kinga
Jongejan, Bart
Kahusk, Neeme
Kirnbauer, Martin
Kryvenko, Anna
Ligeti-Nagy, Noémi
Luxardo, Giancarlo
Magariños, Carmen
Magnusson, Måns
Marchetti, Carlo
Marx, Maarten
Meden, Katja
Mendes, Amália
Mochtak, Michal
Mölder, Martin
Montemagni, Simonetta
Navarretta, Costanza
Nitoń, Bartłomiej
Norén, Fredrik Mohammadi
Nwadukwe, Amanda
Ojsteršek, Mihael
Pančur, Andrej
Papavassiliou, Vassilis
Pereira, Rui
Pérez Lago, María
Piperidis, Stelios
Pirker, Hannes
Pisani, Marilina
Pol, Henk van der
Prokopidis, Prokopis
Quochi, Valeria
Regueira, Xosé Luís
Rudolf, Michał
Ruisi, Manuela
Rupnik, Peter
Schopper, Daniel
Simov, Kiril
Sinikallio, Laura
Skubic, Jure
Tamper, Minna
Tungland, Lars Magne
Tuominen, Jouni
van Heusden, Ruben
Varga, Zsófia
Vázquez Abuín, Marta
Venturi, Giulia
Vidal Miguéns, Adrián
Vider, Kadri
Vivel Couso, Ainhoa
Vladu, Adina Ioana
Wissik, Tanja
Yrjänäinen, Väinö
Zevallos, Rodolfo
Fišer, Darja
Dokumenttyp: corpus
Erscheinungsdatum: 2023
Verlag/Hrsg.: CLARIN ERIC
Schlagwörter: Parla-CLARIN / parliamentary debates / COVID-19 / TEI / Bulgarian Parliament / Croatian Parliament / Polish Parliament / Slovenian Parliament / Czech Parliament / Icelandic Parliament / Belgian Parliament / Danish Parliament / Dutch Parliament / Turkish Parliament / Italian Parliament / Hungarian Parliament / Latvian Parliament / French Parliament / Bosnian Parliament / Catalonian Parliament / Galician Parliament / Greek Parliament / Norwegian Parliament / Portugese Parliament / Serbian Parliament / Swedish Parliament / Ukrainian Parliament / Austrian Parliament / Estonian Parliament / Spanish Parliament / Finnish Parliament / Basque Parliament / British Parliament
Sprache: Englisch
Permalink: https://search.fid-benelux.de/Record/base-26501832
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://hdl.handle.net/11356/1864

ParlaMint-en.ana 4.0 is the English machine translation of the ParlaMint.ana 4.0 (http://hdl.handle.net/11356/1860) set of corpora of parliamentary debates across Europe. The translation is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas). Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account. The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include MT-generated word-alignment and pyMusas USAS tags, as well as the tags and lemmas produced for the purposes of semantic tagging by Spacy (https://spacy.io/), when they are different from the default annotations. Also included is the 4.0 release of the sample data ...