Literally better: Analyzing and improving the quality of literals

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.

LanguageEnglish
Pages131-150
Number of pages20
JournalSemantic Web
Volume9
Issue number1
DOIs
StatePublished - 2018

Fingerprint

Linguistics
Cleaning
Semantics

Keywords

  • data observatory
  • Data quality
  • linked data
  • quality assessment
  • quality improvement

Cite this

@article{7fcc8392689e4c6f8bc5c8ae55eb89a2,
title = "Literally better: Analyzing and improving the quality of literals",
abstract = "Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.",
keywords = "data observatory, Data quality, linked data, quality assessment, quality improvement",
author = "Wouter Beek and Filip Ilievski and Jeremy Debattista and Stefan Schlobach and Jan Wielemaker",
year = "2018",
doi = "10.3233/SW-170288",
language = "English",
volume = "9",
pages = "131--150",
journal = "Semantic Web",
issn = "1570-0844",
publisher = "IOS Press",
number = "1",

}

Literally better : Analyzing and improving the quality of literals. / Beek, Wouter; Ilievski, Filip; Debattista, Jeremy; Schlobach, Stefan; Wielemaker, Jan.

In: Semantic Web, Vol. 9, No. 1, 2018, p. 131-150.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Literally better

T2 - Semantic Web

AU - Beek,Wouter

AU - Ilievski,Filip

AU - Debattista,Jeremy

AU - Schlobach,Stefan

AU - Wielemaker,Jan

PY - 2018

Y1 - 2018

N2 - Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.

AB - Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.

KW - data observatory

KW - Data quality

KW - linked data

KW - quality assessment

KW - quality improvement

UR - http://www.scopus.com/inward/record.url?scp=85037706022&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037706022&partnerID=8YFLogxK

U2 - 10.3233/SW-170288

DO - 10.3233/SW-170288

M3 - Article

VL - 9

SP - 131

EP - 150

JO - Semantic Web

JF - Semantic Web

SN - 1570-0844

IS - 1

ER -