Network metrics for assessing the quality of entity resolution between multiple datasets

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.

Original languageEnglish
Title of host publicationKnowledge Engineering and Knowledge Management
Subtitle of host publication21st International Conference, EKAW 2018, Nancy, France, November 12-16, Proceedings
EditorsAmedeo Napoli, Chiara Ghidini, Yannick Toussaint, Catherine Faron Zucker
Place of PublicationBasel
PublisherSpringer Nature Switzerland AG
Pages147-162
Number of pages16
ISBN (Electronic)9783030036676
ISBN (Print)9783030036669
DOIs
Publication statusPublished - 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11313
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Fingerprint

Gold
Metric
Semantic Web
Evaluation Method
Innovation
Experimental Study
Verify
Predict
Evaluate
Evaluation
Estimate
Standards
Truth
Knowledge

Keywords

  • Network metrics
  • Data Integration
  • Entity resolution
  • Data integration

VU Research Profile

  • Connected World

Cite this

Idrissou, O. A. K., van Harmelen, F., & van den Besselaar, P. A. A. (2018). Network metrics for assessing the quality of entity resolution between multiple datasets. In A. Napoli, C. Ghidini, Y. Toussaint, & C. Faron Zucker (Eds.), Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, Proceedings (pp. 147-162). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11313). Basel: Springer Nature Switzerland AG. https://doi.org/10.1007/978-3-030-03667-6_10
Idrissou, O.A.K. ; van Harmelen, Frank ; van den Besselaar, P.A.A. / Network metrics for assessing the quality of entity resolution between multiple datasets. Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, Proceedings. editor / Amedeo Napoli ; Chiara Ghidini ; Yannick Toussaint ; Catherine Faron Zucker. Basel : Springer Nature Switzerland AG, 2018. pp. 147-162 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{ae824c9530eb43e2934e6c5ddcc57acf,
title = "Network metrics for assessing the quality of entity resolution between multiple datasets",
abstract = "Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.",
keywords = "Network metrics, Data Integration, Entity resolution, Data integration",
author = "O.A.K. Idrissou and {van Harmelen}, Frank and {van den Besselaar}, P.A.A.",
year = "2018",
doi = "10.1007/978-3-030-03667-6_10",
language = "English",
isbn = "9783030036669",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Nature Switzerland AG",
pages = "147--162",
editor = "Amedeo Napoli and Chiara Ghidini and Yannick Toussaint and {Faron Zucker}, Catherine",
booktitle = "Knowledge Engineering and Knowledge Management",

}

Idrissou, OAK, van Harmelen, F & van den Besselaar, PAA 2018, Network metrics for assessing the quality of entity resolution between multiple datasets. in A Napoli, C Ghidini, Y Toussaint & C Faron Zucker (eds), Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11313, Springer Nature Switzerland AG, Basel, pp. 147-162. https://doi.org/10.1007/978-3-030-03667-6_10

Network metrics for assessing the quality of entity resolution between multiple datasets. / Idrissou, O.A.K.; van Harmelen, Frank; van den Besselaar, P.A.A.

Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, Proceedings. ed. / Amedeo Napoli; Chiara Ghidini; Yannick Toussaint; Catherine Faron Zucker. Basel : Springer Nature Switzerland AG, 2018. p. 147-162 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11313).

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Network metrics for assessing the quality of entity resolution between multiple datasets

AU - Idrissou, O.A.K.

AU - van Harmelen, Frank

AU - van den Besselaar, P.A.A.

PY - 2018

Y1 - 2018

N2 - Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.

AB - Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.

KW - Network metrics

KW - Data Integration

KW - Entity resolution

KW - Data integration

UR - http://www.scopus.com/inward/record.url?scp=85067557290&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85067557290&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-03667-6_10

DO - 10.1007/978-3-030-03667-6_10

M3 - Conference contribution

SN - 9783030036669

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 147

EP - 162

BT - Knowledge Engineering and Knowledge Management

A2 - Napoli, Amedeo

A2 - Ghidini, Chiara

A2 - Toussaint, Yannick

A2 - Faron Zucker, Catherine

PB - Springer Nature Switzerland AG

CY - Basel

ER -

Idrissou OAK, van Harmelen F, van den Besselaar PAA. Network metrics for assessing the quality of entity resolution between multiple datasets. In Napoli A, Ghidini C, Toussaint Y, Faron Zucker C, editors, Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, Proceedings. Basel: Springer Nature Switzerland AG. 2018. p. 147-162. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-03667-6_10