TY - GEN
T1 - Impact of Crowdsourcing OCR Improvements on Retrievability Bias
AU - Traub, Myriam C.
AU - Samar, Thaer
AU - Van Ossenbruggen, Jacco
AU - Hardman, Lynda
PY - 2018/5
Y1 - 2018/5
N2 - Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
AB - Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.
KW - data quality
KW - digital library
KW - ocr
KW - retrievability bias
UR - http://www.scopus.com/inward/record.url?scp=85048741675&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85048741675&partnerID=8YFLogxK
U2 - 10.1145/3197026.3197046
DO - 10.1145/3197026.3197046
M3 - Conference contribution
AN - SCOPUS:85048741675
T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
SP - 29
EP - 36
BT - JCDL '18: Proceedings of the 18th ACM/IEEE Joint Conference on Digital Libraries
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2018
Y2 - 3 June 2018 through 7 June 2018
ER -