Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 4.0

Verfasser:	Kuzman, Taja Ljubešić, Nikola Erjavec, Tomaž Kopp, Matyáš Ogrodniczuk, Maciej Osenova, Petya Rayson, Paul Vidler, John Agerri, Rodrigo Agirrezabal, Manex Agnoloni, Tommaso Aires, José Albini, Monica Alkorta, Jon Antiba-Cartazo, Iván Arrieta, Ekain Barcala, Mario Bardanca, Daniel Barkarson, Starkaður Bartolini, Roberto Battistoni, Roberto Bel, Nuria Bonet Ramos, Maria del Mar Calzada Pérez, María Cardoso, Aida Çöltekin, Çağrı Coole, Matthew Darģis, Roberts de Does, Jesse de Libano, Ruben Depoorter, Griet Depuydt, Katrien Diwersy, Sascha Dodé, Réka Fernandez, Kike Fernández Rei, Elisa Frontini, Francesca Garcia, Marcos García Díaz, Noelia García Louzao, Pedro Gavriilidou, Maria Gkoumas, Dimitris Grigorov, Ilko Grigorova, Vladislava Haltrup Hansen, Dorte Iruskieta, Mikel Jarlbrink, Johan Jelencsik-Mátyus, Kinga Jongejan, Bart Kahusk, Neeme Kirnbauer, Martin Kryvenko, Anna Ligeti-Nagy, Noémi Luxardo, Giancarlo Magariños, Carmen Magnusson, Måns Marchetti, Carlo Marx, Maarten Meden, Katja Mendes, Amália Mochtak, Michal Mölder, Martin Montemagni, Simonetta Navarretta, Costanza Nitoń, Bartłomiej Norén, Fredrik Mohammadi Nwadukwe, Amanda Ojsteršek, Mihael Pančur, Andrej Papavassiliou, Vassilis Pereira, Rui Pérez Lago, María Piperidis, Stelios Pirker, Hannes Pisani, Marilina Pol, Henk van der Prokopidis, Prokopis Quochi, Valeria Regueira, Xosé Luís Rudolf, Michał Ruisi, Manuela Rupnik, Peter Schopper, Daniel Simov, Kiril Sinikallio, Laura Skubic, Jure Tamper, Minna Tungland, Lars Magne Tuominen, Jouni van Heusden, Ruben Varga, Zsófia Vázquez Abuín, Marta Venturi, Giulia Vidal Miguéns, Adrián Vider, Kadri Vivel Couso, Ainhoa Vladu, Adina Ioana Wissik, Tanja Yrjänäinen, Väinö Zevallos, Rodolfo Fišer, Darja
Dokumenttyp:	corpus
Erscheinungsdatum:	2023
Verlag/Hrsg.:	CLARIN ERIC
Schlagwörter:	Parla-CLARIN / parliamentary debates / COVID-19 / TEI / Bulgarian Parliament / Croatian Parliament / Polish Parliament / Slovenian Parliament / Czech Parliament / Icelandic Parliament / Belgian Parliament / Danish Parliament / Dutch Parliament / Turkish Parliament / Italian Parliament / Hungarian Parliament / Latvian Parliament / French Parliament / Bosnian Parliament / Catalonian Parliament / Galician Parliament / Greek Parliament / Norwegian Parliament / Portugese Parliament / Serbian Parliament / Swedish Parliament / Ukrainian Parliament / Austrian Parliament / Estonian Parliament / Spanish Parliament / Finnish Parliament / Basque Parliament / British Parliament
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-29298162
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	http://hdl.handle.net/11356/1864

ParlaMint-en.ana 4.0 is the English machine translation of the ParlaMint.ana 4.0 (http://hdl.handle.net/11356/1860) set of corpora of parliamentary debates across Europe. The translation is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas). Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account. The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include MT-generated word-alignment and pyMusas USAS tags, as well as the tags and lemmas produced for the purposes of semantic tagging by Spacy (https://spacy.io/), when they are different from the default annotations. Also included is the 4.0 release of the sample data ...