TY - GEN
T1 - Impact analysis of OCR quality on research tasks in digital archives
AU - Traub, Myriam C.
AU - Van Ossenbruggen, Jacco
AU - Hardman, Lynda
PY - 2015
Y1 - 2015
N2 - Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.
AB - Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.
KW - Digital humanities
KW - Digital libraries
KW - OCR quality
UR - http://www.scopus.com/inward/record.url?scp=84944722905&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84944722905&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-24592-8_19
DO - 10.1007/978-3-319-24592-8_19
M3 - Conference contribution
AN - SCOPUS:84944722905
SN - 9783319245911
VL - 9316
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 252
EP - 263
BT - Research and Advanced Technology for Digital Libraries - 19th International Conference on Theory and Practice of Digital Libraries, TPDL 2015, Proceedings
PB - Springer/Verlag
T2 - 19th International Conference on Theory and Practice of Digital Libraries, TPDL 2015
Y2 - 14 September 2015 through 18 September 2015
ER -