Abstract
We present a method for gender and language variety identification using a convolutional neural network (CNN). We compare the performance of this method with a traditional machine learning algorithm-support vector machines (SVM) trained on character n-grams (n = 3-8) and lexical features (unigrams and bigrams of words), and their combinations. We use a single multi-labeled corpus composed of news articles in different varieties of Spanish developed specifically for these tasks. We present a convolutional neural network trained on word- and sentence-level embeddings architecture that can be successfully applied to gender and language variety identification on a relatively small corpus (less than 10,000 documents). Our experiments show that the deep learning approach outperforms a traditional machine learning approach on both tasks, when named entities are present in the corpus. However, when evaluating the performance of these approaches reducing all named entities to a single symbol NE to avoid topic-dependent features, the drop in accuracy is higher for the deep learning approach.
| Original language | English |
|---|---|
| Pages (from-to) | 4845-4855 |
| Number of pages | 11 |
| Journal | Journal of Intelligent & Fuzzy Systems : Applications in Engineering and Technology |
| Volume | 36 |
| Issue number | 5 |
| Early online date | 14 May 2019 |
| DOIs | |
| Publication status | Published - May 2019 |
| Externally published | Yes |
Bibliographical note
© 2019-IOS Press and the authors.Funding
The work was done with partial support of the Mexican Government via the CONACYT project 240844 and Instituto Politécnico Nacional grants SIP-20181849, SIP-20171813, and SIP-20181792. The work was done when A. Gelbukh was visiting the Research Institute for Information and Language Processing, University of Wolverhampton, on a grant from the Sabbatical Year Program of the CONACYT, Mexico.
| Funders | Funder number |
|---|---|
| Mexican Government | |
| Research Institute for Information and Language Processing, University of Wolverhampton | |
| Instituto Politécnico Nacional | SIP-20171813, SIP-20181792, SIP-20181849 |
| Consejo Nacional de Ciencia y Tecnología | 240844 |
Fingerprint
Dive into the research topics of 'A convolutional neural network approach for gender and language variety identification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver