Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision... Mehr ...

Verfasser:	Reynaert, Martin Oostdijk, Nelleke De Clercq, Orph´ee Heuvel, Henk van den Jong, Franciska de
Dokumenttyp:	article in monograph or in proceedings
Erscheinungsdatum:	2010
Verlag/Hrsg.:	European Language Resources Association (ELRA)
Sprache:	unknown
Permalink:	https://search.fid-benelux.de/Record/base-29036274
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	http://purl.utwente.nl/publications/72111

Suche in Bibliothekskatalogen:

	Prüfen Sie die Verfügbarkeit in Ihrer Heimatbibliothek
	Suche deutschlandweit und international (KVK – Karlsruher Virtueller Katalog)
	Suche weltweit im Worldcatworldwide_worldcat

Suche via Google:

Suche via Google

Suche in Google Scholar

Suche in Google Books