Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus
In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | article in monograph or in proceedings |
Erscheinungsdatum: | 2010 |
Verlag/Hrsg.: |
European Language Resources Association (ELRA)
|
Sprache: | unknown |
Permalink: | https://search.fid-benelux.de/Record/base-29036274 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | http://purl.utwente.nl/publications/72111 |