Authorship attribution in portuguese using character N-grams

I. Markov, J. Baptista, O. Pichardo-Lagunas

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

© 2017, Budapest Tech Polytechnical Institution. All rights reserved.For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.
Original languageEnglish
Pages (from-to)59-78
JournalActa Polytechnica Hungarica
Volume14
Issue number3
DOIs
Publication statusPublished - 2017
Externally publishedYes

Funding

This work was supported by the Mexican Government (Conacyt projects 240844 and 20161958, SIP-IPN 20171813, 20171344, and 20172008, SNI, COFAA-IPN) and by the Portuguese Government, through Fundac¸ão para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013. The authors express their gratitude to

FundersFunder number
Fundac¸ão para a Ciência e a Tecnologia
Mexican Government
Fundação para a Ciência e a TecnologiaUID/CEC/50021/2013
Consejo Nacional de Ciencia y Tecnología20171344, SIP-IPN 20171813, 20172008, 20161958, 240844
Sistema Nacional de Investigadores

    Fingerprint

    Dive into the research topics of 'Authorship attribution in portuguese using character N-grams'. Together they form a unique fingerprint.

    Cite this