Quantifying retrieval bias in Web archive search

Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Lynda Hardman, Arjen P. de Vries

Research output: Contribution to journalArticle

Abstract

A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.

LanguageEnglish
Pages57-75
Number of pages19
JournalInternational Journal on Digital Libraries
Volume19
Issue number1
Early online date14 Nov 2017
DOIs
StatePublished - 1 Mar 2018

Fingerprint

trend
aggregation
time

Keywords

  • Evaluation
  • Retrieval bias
  • Web archive

Cite this

Samar, Thaer ; Traub, Myriam C. ; van Ossenbruggen, Jacco ; Hardman, Lynda ; de Vries, Arjen P./ Quantifying retrieval bias in Web archive search. In: International Journal on Digital Libraries. 2018 ; Vol. 19, No. 1. pp. 57-75
@article{e73a51ced69f4726bafeeed8216f6914,
title = "Quantifying retrieval bias in Web archive search",
abstract = "A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.",
keywords = "Evaluation, Retrieval bias, Web archive",
author = "Thaer Samar and Traub, {Myriam C.} and {van Ossenbruggen}, Jacco and Lynda Hardman and {de Vries}, {Arjen P.}",
year = "2018",
month = "3",
day = "1",
doi = "10.1007/s00799-017-0215-9",
language = "English",
volume = "19",
pages = "57--75",
journal = "International Journal on Digital Libraries",
issn = "1432-5012",
publisher = "Springer",
number = "1",

}

Quantifying retrieval bias in Web archive search. / Samar, Thaer; Traub, Myriam C.; van Ossenbruggen, Jacco; Hardman, Lynda; de Vries, Arjen P.

In: International Journal on Digital Libraries, Vol. 19, No. 1, 01.03.2018, p. 57-75.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Quantifying retrieval bias in Web archive search

AU - Samar,Thaer

AU - Traub,Myriam C.

AU - van Ossenbruggen,Jacco

AU - Hardman,Lynda

AU - de Vries,Arjen P.

PY - 2018/3/1

Y1 - 2018/3/1

N2 - A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.

AB - A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.

KW - Evaluation

KW - Retrieval bias

KW - Web archive

UR - http://www.scopus.com/inward/record.url?scp=85017663462&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85017663462&partnerID=8YFLogxK

U2 - 10.1007/s00799-017-0215-9

DO - 10.1007/s00799-017-0215-9

M3 - Article

VL - 19

SP - 57

EP - 75

JO - International Journal on Digital Libraries

T2 - International Journal on Digital Libraries

JF - International Journal on Digital Libraries

SN - 1432-5012

IS - 1

ER -

Samar T, Traub MC, van Ossenbruggen J, Hardman L, de Vries AP. Quantifying retrieval bias in Web archive search. International Journal on Digital Libraries. 2018 Mar 1;19(1):57-75. Available from, DOI: 10.1007/s00799-017-0215-9