Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision... Mehr ...

Verfasser: Reynaert, Martin
Oostdijk, Nelleke
De Clercq, Orph´ee
Heuvel, Henk van den
Jong, Franciska de
Dokumenttyp: article in monograph or in proceedings
Erscheinungsdatum: 2010
Verlag/Hrsg.: European Language Resources Association (ELRA)
Sprache: unknown
Permalink: https://search.fid-benelux.de/Record/base-26678434
Datenquelle: BASE; Originalkatalog
Powered By: BASE
Link(s) : http://purl.utwente.nl/publications/72111