DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Urja Khurana, Eric Nalisnick, Antske Fokkens

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

When building a predictive model, it is often difficult to ensure that application-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the application specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DEFVERIFY: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DEFVERIFY to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.

Original languageEnglish
Title of host publicationProceedings of the 31st International Conference on Computational Linguistics
EditorsOwen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
PublisherAssociation for Computational Linguistics (ACL)
Pages4341-4358
Number of pages18
ISBN (Electronic)9798891761964
Publication statusPublished - 2025
Event31st International Conference on Computational Linguistics, COLING 2025 - Abu Dhabi, United Arab Emirates
Duration: 19 Jan 202524 Jan 2025

Publication series

NameProceedings - International Conference on Computational Linguistics, COLING
PublisherACL
VolumePart F206484-1
ISSN (Print)2951-2093

Conference

Conference31st International Conference on Computational Linguistics, COLING 2025
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Period19/01/2524/01/25

Bibliographical note

Publisher Copyright:
© 2025 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?'. Together they form a unique fingerprint.

Cite this