Comparing topic coverage in breadth-first and depth-first crawls using anchor texts

Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.

Original languageEnglish
Title of host publicationResearch and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings
PublisherSpringer/Verlag
Pages133-146
Number of pages14
Volume9819
ISBN (Print)9783319439969
DOIs
Publication statusPublished - 2016
Event20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016 - Hannover, Germany
Duration: 5 Sep 20169 Sep 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9819
ISSN (Print)03029743
ISSN (Electronic)16113349

Conference

Conference20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016
CountryGermany
CityHannover
Period5/09/169/09/16

Fingerprint

Breadth
Anchors
Websites
Coverage
Entire
World Wide Web
Seed
Cover
String Matching
Text
Strategy
Query

Cite this

Samar, T., Traub, M. C., van Ossenbruggen, J., & de Vries, A. P. (2016). Comparing topic coverage in breadth-first and depth-first crawls using anchor texts. In Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings (Vol. 9819, pp. 133-146). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9819). Springer/Verlag. https://doi.org/10.1007/978-3-319-43997-6_11
Samar, Thaer ; Traub, Myriam C. ; van Ossenbruggen, Jacco ; de Vries, Arjen P. / Comparing topic coverage in breadth-first and depth-first crawls using anchor texts. Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings. Vol. 9819 Springer/Verlag, 2016. pp. 133-146 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{c8dee605b6fc4c1c89b12b793a81e1a9,
title = "Comparing topic coverage in breadth-first and depth-first crawls using anchor texts",
abstract = "Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.",
author = "Thaer Samar and Traub, {Myriam C.} and {van Ossenbruggen}, Jacco and {de Vries}, {Arjen P.}",
year = "2016",
doi = "10.1007/978-3-319-43997-6_11",
language = "English",
isbn = "9783319439969",
volume = "9819",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer/Verlag",
pages = "133--146",
booktitle = "Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings",

}

Samar, T, Traub, MC, van Ossenbruggen, J & de Vries, AP 2016, Comparing topic coverage in breadth-first and depth-first crawls using anchor texts. in Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings. vol. 9819, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9819, Springer/Verlag, pp. 133-146, 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Hannover, Germany, 5/09/16. https://doi.org/10.1007/978-3-319-43997-6_11

Comparing topic coverage in breadth-first and depth-first crawls using anchor texts. / Samar, Thaer; Traub, Myriam C.; van Ossenbruggen, Jacco; de Vries, Arjen P.

Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings. Vol. 9819 Springer/Verlag, 2016. p. 133-146 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9819).

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Comparing topic coverage in breadth-first and depth-first crawls using anchor texts

AU - Samar, Thaer

AU - Traub, Myriam C.

AU - van Ossenbruggen, Jacco

AU - de Vries, Arjen P.

PY - 2016

Y1 - 2016

N2 - Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.

AB - Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.

UR - http://www.scopus.com/inward/record.url?scp=84984813513&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84984813513&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-43997-6_11

DO - 10.1007/978-3-319-43997-6_11

M3 - Conference contribution

SN - 9783319439969

VL - 9819

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 133

EP - 146

BT - Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings

PB - Springer/Verlag

ER -

Samar T, Traub MC, van Ossenbruggen J, de Vries AP. Comparing topic coverage in breadth-first and depth-first crawls using anchor texts. In Research and Advanced Technology for Digital Libraries - 20th International Conference on Theory and Practice of Digital Libraries, TPDL 2016, Proceedings. Vol. 9819. Springer/Verlag. 2016. p. 133-146. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-43997-6_11