Capturing the ineffable: Collecting, analysing, and automating web document quality assessments

Davide Ceolin, Julia Noordegraaf, Lora Aroyo

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Automatic estimation of the quality of Web documents is a challenging task, especially because the definition of quality heavily depends on the individuals who define it, on the context where it applies, and on the nature of the tasks at hand. Our long-term goal is to allow automatic assessment of Web document quality tailored to specific user requirements and context. This process relies on the possibility to identify document characteristics that indicate their quality. In this paper, we investigate these characteristics as follows: (1) we define features of Web documents that may be indicators of quality; (2) we design a procedure for automatically extracting those features; (3) develop a Web application to present these results to niche users to check the relevance of these features as quality indicators and collect quality assessments; (4) we analyse user’s qualitative assessment of Web documents to refine our definition of the features that determine quality, and establish their relevant weight in the overall quality, i.e., in the summarizing score users attribute to a document, determining whether it meets their standards or not. Hence, our contribution is threefold: a Web application for nichesourcing quality assessments; a curated dataset ofWeb document assessments; and a thorough analysis of the quality assessments collected by means of two case studies involving experts (journalists and media scholars). The dataset obtained is limited in size but highly valuable because of the quality of the experts that provided it. Our analyses show that: (1) it is possible to automate the process of Web document quality estimation to a level of high accuracy; (2) document features shown in isolation are poorly informative to users; and (3) related to the tasks we propose (i.e., choosing Web documents to use as a source for writing an article on the vaccination debate), the most important quality dimensions are accuracy, trustworthiness, and precision.

Original languageEnglish
Title of host publicationKnowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings
PublisherSpringer/Verlag
Pages83-97
Number of pages15
Volume10024 LNAI
ISBN (Print)9783319490038
DOIs
Publication statusPublished - 2016
Event20th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2016 - Bologna, Italy
Duration: 19 Nov 201623 Nov 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10024 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Conference

Conference20th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2016
CountryItaly
CityBologna
Period19/11/1623/11/16

Fingerprint

Quality Assessment
Web Application
Trustworthiness
Vaccination
Niche
Threefolds
Isolation
High Accuracy
Attribute

Cite this

Ceolin, D., Noordegraaf, J., & Aroyo, L. (2016). Capturing the ineffable: Collecting, analysing, and automating web document quality assessments. In Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings (Vol. 10024 LNAI, pp. 83-97). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10024 LNAI). Springer/Verlag. https://doi.org/10.1007/978-3-319-49004-5_6
Ceolin, Davide ; Noordegraaf, Julia ; Aroyo, Lora. / Capturing the ineffable : Collecting, analysing, and automating web document quality assessments. Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings. Vol. 10024 LNAI Springer/Verlag, 2016. pp. 83-97 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{a22305fce4a14fa9b4e62d546837a8e7,
title = "Capturing the ineffable: Collecting, analysing, and automating web document quality assessments",
abstract = "Automatic estimation of the quality of Web documents is a challenging task, especially because the definition of quality heavily depends on the individuals who define it, on the context where it applies, and on the nature of the tasks at hand. Our long-term goal is to allow automatic assessment of Web document quality tailored to specific user requirements and context. This process relies on the possibility to identify document characteristics that indicate their quality. In this paper, we investigate these characteristics as follows: (1) we define features of Web documents that may be indicators of quality; (2) we design a procedure for automatically extracting those features; (3) develop a Web application to present these results to niche users to check the relevance of these features as quality indicators and collect quality assessments; (4) we analyse user’s qualitative assessment of Web documents to refine our definition of the features that determine quality, and establish their relevant weight in the overall quality, i.e., in the summarizing score users attribute to a document, determining whether it meets their standards or not. Hence, our contribution is threefold: a Web application for nichesourcing quality assessments; a curated dataset ofWeb document assessments; and a thorough analysis of the quality assessments collected by means of two case studies involving experts (journalists and media scholars). The dataset obtained is limited in size but highly valuable because of the quality of the experts that provided it. Our analyses show that: (1) it is possible to automate the process of Web document quality estimation to a level of high accuracy; (2) document features shown in isolation are poorly informative to users; and (3) related to the tasks we propose (i.e., choosing Web documents to use as a source for writing an article on the vaccination debate), the most important quality dimensions are accuracy, trustworthiness, and precision.",
author = "Davide Ceolin and Julia Noordegraaf and Lora Aroyo",
year = "2016",
doi = "10.1007/978-3-319-49004-5_6",
language = "English",
isbn = "9783319490038",
volume = "10024 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer/Verlag",
pages = "83--97",
booktitle = "Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings",

}

Ceolin, D, Noordegraaf, J & Aroyo, L 2016, Capturing the ineffable: Collecting, analysing, and automating web document quality assessments. in Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings. vol. 10024 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10024 LNAI, Springer/Verlag, pp. 83-97, 20th International Conference on Knowledge Engineering and Knowledge Management, EKAW 2016, Bologna, Italy, 19/11/16. https://doi.org/10.1007/978-3-319-49004-5_6

Capturing the ineffable : Collecting, analysing, and automating web document quality assessments. / Ceolin, Davide; Noordegraaf, Julia; Aroyo, Lora.

Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings. Vol. 10024 LNAI Springer/Verlag, 2016. p. 83-97 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10024 LNAI).

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Capturing the ineffable

T2 - Collecting, analysing, and automating web document quality assessments

AU - Ceolin, Davide

AU - Noordegraaf, Julia

AU - Aroyo, Lora

PY - 2016

Y1 - 2016

N2 - Automatic estimation of the quality of Web documents is a challenging task, especially because the definition of quality heavily depends on the individuals who define it, on the context where it applies, and on the nature of the tasks at hand. Our long-term goal is to allow automatic assessment of Web document quality tailored to specific user requirements and context. This process relies on the possibility to identify document characteristics that indicate their quality. In this paper, we investigate these characteristics as follows: (1) we define features of Web documents that may be indicators of quality; (2) we design a procedure for automatically extracting those features; (3) develop a Web application to present these results to niche users to check the relevance of these features as quality indicators and collect quality assessments; (4) we analyse user’s qualitative assessment of Web documents to refine our definition of the features that determine quality, and establish their relevant weight in the overall quality, i.e., in the summarizing score users attribute to a document, determining whether it meets their standards or not. Hence, our contribution is threefold: a Web application for nichesourcing quality assessments; a curated dataset ofWeb document assessments; and a thorough analysis of the quality assessments collected by means of two case studies involving experts (journalists and media scholars). The dataset obtained is limited in size but highly valuable because of the quality of the experts that provided it. Our analyses show that: (1) it is possible to automate the process of Web document quality estimation to a level of high accuracy; (2) document features shown in isolation are poorly informative to users; and (3) related to the tasks we propose (i.e., choosing Web documents to use as a source for writing an article on the vaccination debate), the most important quality dimensions are accuracy, trustworthiness, and precision.

AB - Automatic estimation of the quality of Web documents is a challenging task, especially because the definition of quality heavily depends on the individuals who define it, on the context where it applies, and on the nature of the tasks at hand. Our long-term goal is to allow automatic assessment of Web document quality tailored to specific user requirements and context. This process relies on the possibility to identify document characteristics that indicate their quality. In this paper, we investigate these characteristics as follows: (1) we define features of Web documents that may be indicators of quality; (2) we design a procedure for automatically extracting those features; (3) develop a Web application to present these results to niche users to check the relevance of these features as quality indicators and collect quality assessments; (4) we analyse user’s qualitative assessment of Web documents to refine our definition of the features that determine quality, and establish their relevant weight in the overall quality, i.e., in the summarizing score users attribute to a document, determining whether it meets their standards or not. Hence, our contribution is threefold: a Web application for nichesourcing quality assessments; a curated dataset ofWeb document assessments; and a thorough analysis of the quality assessments collected by means of two case studies involving experts (journalists and media scholars). The dataset obtained is limited in size but highly valuable because of the quality of the experts that provided it. Our analyses show that: (1) it is possible to automate the process of Web document quality estimation to a level of high accuracy; (2) document features shown in isolation are poorly informative to users; and (3) related to the tasks we propose (i.e., choosing Web documents to use as a source for writing an article on the vaccination debate), the most important quality dimensions are accuracy, trustworthiness, and precision.

UR - http://www.scopus.com/inward/record.url?scp=84997119426&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84997119426&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-49004-5_6

DO - 10.1007/978-3-319-49004-5_6

M3 - Conference contribution

SN - 9783319490038

VL - 10024 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 83

EP - 97

BT - Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings

PB - Springer/Verlag

ER -

Ceolin D, Noordegraaf J, Aroyo L. Capturing the ineffable: Collecting, analysing, and automating web document quality assessments. In Knowledge Engineering and Knowledge Management - 20th International Conference, EKAW 2016, Proceedings. Vol. 10024 LNAI. Springer/Verlag. 2016. p. 83-97. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-49004-5_6