Capturing Contentiousness: Constructing the Contentious Terms in Context Corpus

Ryan Brate, Andrei Nesterov, Valentin Vogelmann, Jacco Van Ossenbruggen, Laura Hollink, Marieke Van Erp

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Recent initiatives by cultural heritage institutions in addressing outdated and offensive language used in their collections demonstrate the need for further understanding into when terms are problematic or contentious. This paper presents an annotated dataset of 2,715 unique samples of terms in context, drawn from a historical newspaper archive, collating 21,800 annotations of contentiousness from expert and crowd workers. We describe the contents of the corpus by analysing inter-rater agreement and differences between experts and crowd workers. In addition, we demonstrate the potential of the corpus for automated detection of contentiousness. We show that a simple classifier applied to the embedding representation of a target word provides a better than baseline performance in predicting contentiousness. We find that the term itself and the context play a role in whether a term is considered contentious.

Original languageEnglish
Title of host publicationK-CAP '21
Subtitle of host publicationProceedings of the 11th on Knowledge Capture Conference
PublisherAssociation for Computing Machinery, Inc
Pages17-24
Number of pages8
ISBN (Electronic)9781450384575
DOIs
Publication statusPublished - Dec 2021
Event11th ACM International Conference on Knowledge Capture, K-CAP 2021 - Virtual, Online, United States
Duration: 2 Dec 20213 Dec 2021

Conference

Conference11th ACM International Conference on Knowledge Capture, K-CAP 2021
Country/TerritoryUnited States
CityVirtual, Online
Period2/12/213/12/21

Bibliographical note

Funding Information:
This work was funded by the EuropeanaTech Challenge for Euro-peana Artificial Intelligence and Machine Learning datasets, ‘Culturally Aware AI’ funded by NWO, and SABIO funded by the Dutch Digital Heritage Network. The authors would like to thank the Cultural AI Lab and KNAW HuC colleagues for their comments and annotations and the anonymous Prolific annotators. Special thanks to Mirjam Cuper (National Library of the Netherlands) for guiding KB and Europeana procedures, Lynda Hardman (CWI) for the suggestions on the article editing, and the anonymous reviewers for their constructive feedback.

Publisher Copyright:
© 2021 ACM.

Keywords

  • bias
  • crowdsourcing
  • datasets
  • knowledge capture

Fingerprint

Dive into the research topics of 'Capturing Contentiousness: Constructing the Contentious Terms in Context Corpus'. Together they form a unique fingerprint.

Cite this