Querylog-based assessment of retrievability bias in a large newspaper corpus

Myriam C. Traub, Thaer Samar, Jacco Van Ossenbruggen, Jiyin He, Arjen de Vries, Lynda Hardman

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous evaluations have been performed on TREC collections using simulated query sets. The question remains, however, how representative this approach is of more realistic settings. To address this question, we investigate the effectiveness of the retrievability measure using a large digitized newspaper corpus, featuring two characteristics that distinguishes our experiments from previous studies: (1) compared to TREC collections, our collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simulated queries, the collection comes with real user query logs including click data. First, we assess the retrievability bias imposed on the newspaper collection by different IR models. We assess the retrievability measure and confirm its ability to capture the retrievability bias in our setup. Second, we show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the retrievability results.

Original languageEnglish
Title of host publicationJCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries
PublisherInstitute of Electrical and Electronics Engineers, Inc.
Pages7-16
Number of pages10
Volume2016-September
ISBN (Electronic)9781450342292
DOIs
Publication statusPublished - 1 Sept 2016
Event16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 - Newark, United States
Duration: 19 Jun 201623 Jun 2016

Conference

Conference16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016
Country/TerritoryUnited States
CityNewark
Period19/06/1623/06/16

Keywords

  • Digital Humanities
  • Digital Library
  • Retrievability Bias
  • User Query Logs

Fingerprint

Dive into the research topics of 'Querylog-based assessment of retrievability bias in a large newspaper corpus'. Together they form a unique fingerprint.

Cite this