Non-word Attributes’ Efficiency in Text Mining Authorship Prediction

Verfasser:	Mustafa Tareef Kamil
Dokumenttyp:	Artikel
Erscheinungsdatum:	2019
Reihe/Periodikum:	Journal of Intelligent Systems, Vol 29, Iss 1, Pp 1408-1415 (2019)
Verlag/Hrsg.:	De Gruyter
Schlagwörter:	machine learning / stylometric / authorship attribution / saba / non-word attribute / Science / Q / Electronic computers. Computer science / QA75.5-76.95
Sprache:	Englisch
Permalink:	https://search.fid-benelux.de/Record/base-27281264
Datenquelle:	BASE; Originalkatalog
Powered By:	BASE
Link(s) :	https://doi.org/10.1515/jisys-2019-0068

Literature scripts can be compared to paintings, in an artistic way as well as in the perspective of financial value, whereas the value of these scripts rise and fall depending on their author’s popularity. Authors’ scripts represent a specific style of writing that can be measured and compared using a text mining field called Stylometric. Stylometric analysis depends on some features called authorship attributes, and these attributes or features can be used in special algorithms and methods to reach that aim. Generally, each method selected in the Stylometric field uses a variety of attributes to reach higher prediction accuracy. The aim of this research is to improve the accuracy of authorship prediction in literary works based on the artistic writing style of the authors. To achieve that, a new set of attributes will be used with the Stylometric Authorship Balanced Attribution method, which was chosen in this research among several other machine language methods because of its delicateness in authorship prediction projects. The attributes that have been used by most of the researchers were word frequencies (single word, pair of words, or trio of words), which led to some prediction mistakes. In this research, a new set of attributes is used to decrease these mistakes. These proposed non-word attributes are named sentence length, special characters, and punctuation symbols. The results obtained by using these proposed attributes were excellent.