Abstract
© 2017, Budapest Tech Polytechnical Institution. All rights reserved.For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.
Original language | English |
---|---|
Pages (from-to) | 59-78 |
Journal | Acta Polytechnica Hungarica |
Volume | 14 |
Issue number | 3 |
DOIs | |
Publication status | Published - 2017 |
Externally published | Yes |
Funding
This work was supported by the Mexican Government (Conacyt projects 240844 and 20161958, SIP-IPN 20171813, 20171344, and 20172008, SNI, COFAA-IPN) and by the Portuguese Government, through Fundac¸ão para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013. The authors express their gratitude to
Funders | Funder number |
---|---|
Fundac¸ão para a Ciência e a Tecnologia | |
Mexican Government | |
Fundação para a Ciência e a Tecnologia | UID/CEC/50021/2013 |
Consejo Nacional de Ciencia y Tecnología | 20171344, SIP-IPN 20171813, 20172008, 20161958, 240844 |
Sistema Nacional de Investigadores |