Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Myriam C. Traub, Thaer Samar, Jacco Van Ossenbruggen, Lynda Hardman

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

24 Downloads (Pure)

Abstract

Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

Original languageEnglish
Title of host publicationJCDL '18: Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages29-36
Number of pages8
ISBN (Electronic)9781450351782
DOIs
Publication statusPublished - May 2018
Event18th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018 - Fort Worth, United States
Duration: 3 Jun 20187 Jun 2018

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
ISSN (Print)1552-5996

Conference

Conference18th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018
Country/TerritoryUnited States
CityFort Worth
Period3/06/187/06/18

Funding

FundersFunder number
Horizon 2020 Framework Programme676247

    Keywords

    • data quality
    • digital library
    • ocr
    • retrievability bias

    Fingerprint

    Dive into the research topics of 'Impact of Crowdsourcing OCR Improvements on Retrievability Bias'. Together they form a unique fingerprint.

    Cite this