Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Myriam C. Traub, Thaer Samar, Jacco Van Ossenbruggen, Lynda Hardman

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

Original languageEnglish
Title of host publicationJCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages29-36
Number of pages8
ISBN (Electronic)9781450351782
DOIs
Publication statusPublished - 23 May 2018
Event18th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018 - Fort Worth, United States
Duration: 3 Jun 20187 Jun 2018

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
ISSN (Print)1552-5996

Conference

Conference18th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018
CountryUnited States
CityFort Worth
Period3/06/187/06/18

Fingerprint

Optical character recognition
Digital libraries

Keywords

  • data quality
  • digital library
  • ocr
  • retrievability bias

Cite this

Traub, M. C., Samar, T., Van Ossenbruggen, J., & Hardman, L. (2018). Impact of Crowdsourcing OCR Improvements on Retrievability Bias. In JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries (pp. 29-36). (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/3197026.3197046
Traub, Myriam C. ; Samar, Thaer ; Van Ossenbruggen, Jacco ; Hardman, Lynda. / Impact of Crowdsourcing OCR Improvements on Retrievability Bias. JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. Institute of Electrical and Electronics Engineers Inc., 2018. pp. 29-36 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).
@inproceedings{3e107d5ad9b84dec8b7f98a665a3848d,
title = "Impact of Crowdsourcing OCR Improvements on Retrievability Bias",
abstract = "Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.",
keywords = "data quality, digital library, ocr, retrievability bias",
author = "Traub, {Myriam C.} and Thaer Samar and {Van Ossenbruggen}, Jacco and Lynda Hardman",
year = "2018",
month = "5",
day = "23",
doi = "10.1145/3197026.3197046",
language = "English",
series = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "29--36",
booktitle = "JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries",
address = "United States",

}

Traub, MC, Samar, T, Van Ossenbruggen, J & Hardman, L 2018, Impact of Crowdsourcing OCR Improvements on Retrievability Bias. in JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Institute of Electrical and Electronics Engineers Inc., pp. 29-36, 18th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018, Fort Worth, United States, 3/06/18. https://doi.org/10.1145/3197026.3197046

Impact of Crowdsourcing OCR Improvements on Retrievability Bias. / Traub, Myriam C.; Samar, Thaer; Van Ossenbruggen, Jacco; Hardman, Lynda.

JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. Institute of Electrical and Electronics Engineers Inc., 2018. p. 29-36 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Impact of Crowdsourcing OCR Improvements on Retrievability Bias

AU - Traub, Myriam C.

AU - Samar, Thaer

AU - Van Ossenbruggen, Jacco

AU - Hardman, Lynda

PY - 2018/5/23

Y1 - 2018/5/23

N2 - Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

AB - Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

KW - data quality

KW - digital library

KW - ocr

KW - retrievability bias

UR - http://www.scopus.com/inward/record.url?scp=85048741675&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85048741675&partnerID=8YFLogxK

U2 - 10.1145/3197026.3197046

DO - 10.1145/3197026.3197046

M3 - Conference contribution

T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

SP - 29

EP - 36

BT - JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Traub MC, Samar T, Van Ossenbruggen J, Hardman L. Impact of Crowdsourcing OCR Improvements on Retrievability Bias. In JCDL 2018 - Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries. Institute of Electrical and Electronics Engineers Inc. 2018. p. 29-36. (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1145/3197026.3197046