Psycholinguistic LIWC and n-gram counts in a corpus of 1145 English and Dutch novels ...
This dataset consists of CSV files with word counts in several corpora: - 694 English language novels from male and female authors classified by authors' sexual orientation (heterosexual, bisexual, homosexual) - 401 bestselling Dutch language novels - 50 novels nominated for Dutch literary prizes Each corpus comes with: - LIWC counts; this file also includes the available metadata for each novel. The English data was created with LIWC 2015. The Dutch data was created with the validated translation of LIWC 2001. - Word counts (unigrams) and bigram counts per novel. All text has been converted t... Mehr ...
Verfasser: | |
---|---|
Dokumenttyp: | dataset |
Erscheinungsdatum: | 2020 |
Verlag/Hrsg.: |
Mendeley
|
Schlagwörter: | Literature / Computational Linguistics / Psycholinguistics |
Sprache: | unknown |
Permalink: | https://search.fid-benelux.de/Record/base-28979471 |
Datenquelle: | BASE; Originalkatalog |
Powered By: | BASE |
Link(s) : | https://dx.doi.org/10.17632/tmp32v54ss.2 |
This dataset consists of CSV files with word counts in several corpora: - 694 English language novels from male and female authors classified by authors' sexual orientation (heterosexual, bisexual, homosexual) - 401 bestselling Dutch language novels - 50 novels nominated for Dutch literary prizes Each corpus comes with: - LIWC counts; this file also includes the available metadata for each novel. The English data was created with LIWC 2015. The Dutch data was created with the validated translation of LIWC 2001. - Word counts (unigrams) and bigram counts per novel. All text has been converted to lowercase. Contractions are tokenized into separate tokens, e.g., can't => ca n't Two restrictions are applied: - only unigrams or bigrams that occur in at least 10 texts are retained - only the 100k most frequent are retained - Overall word counts and bigram counts; i.e., the sum across all novels. All files are encoded in UTF-8. The word counts were extracted with the countngrams.py script. ...