Abstract
Over the past years, the amount of online hate speech has been growing steadily. Among multiple approaches to automatically detect hateful content online, ensemble learning is considered one of the best strategies, as shown by several studies on English and other languages. In this paper, we evaluate state-of-the-art approaches for Dutch hate speech detection both under in-domain and cross-domain hate speech detection conditions, and introduce a new ensemble approach with additional features for detecting hateful content in Dutch social media. The ensemble consists of the gradient boosting classifier that incorporates state-of-the-art transformer-based pre-trained language models for Dutch (i.e., BERTje and RobBERT), a robust SVM approach, and additional input information such as the number of emotion-conveying and hateful words, the number of personal pronouns, and the length of the message. The ensemble significantly outperforms all the individual models both in the in-domain and cross-domain hate speech detection settings. We perform an in-depth error analysis focusing on the explicit and implicit hate speech instances, providing various insights into open challenges in Dutch hate speech detection and directions for future research.
Original language | English |
---|---|
Title of host publication | Natural Language Processing and Information Systems |
Subtitle of host publication | 27th International Conference on Applications of Natural Language to Information Systems, NLDB 2022, Valencia, Spain, June 15–17, 2022, Proceedings |
Editors | Paolo Rosso, Valerio Basile, Raquel Martínez, Elisabeth Métais, Farid Meziane |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 3-15 |
Number of pages | 13 |
ISBN (Electronic) | 9783031084737 |
ISBN (Print) | 9783031084720 |
DOIs | |
Publication status | Published - 2022 |
Event | 27th International Conference on Applications of Natural Language to Information Systems, NLDB 2022 - Valencia, Spain Duration: 15 Jun 2022 → 17 Jun 2022 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 13286 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Conference
Conference | 27th International Conference on Applications of Natural Language to Information Systems, NLDB 2022 |
---|---|
Country/Territory | Spain |
City | Valencia |
Period | 15/06/22 → 17/06/22 |
Bibliographical note
Funding Information:This research has been supported by the Flemish Research Foundation through the bilateral research project FWO G070619N “The linguistic landscape of hate speech on social media”. The research also received funding from the Flemish Government (AI Research Program).
Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Funding
This research has been supported by the Flemish Research Foundation through the bilateral research project FWO G070619N “The linguistic landscape of hate speech on social media”. The research also received funding from the Flemish Government (AI Research Program).
Keywords
- Hate speech
- Dutch
- Cross-domain
- Ensemble