Evaluating Similarity Measures for Dataset Search

Xu Wang*, Zhisheng Huang, Frank van Harmelen

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

62 Downloads (Pure)


Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like..”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier DataSearch, a commercial dataset search engine. With the help of F-measure evaluation measure and nDCG evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.

Original languageEnglish
Title of host publicationWeb Information Systems Engineering – WISE 2020
Subtitle of host publication21st International Conference, Amsterdam, The Netherlands, October 20–24, 2020, Proceedings, Part II
EditorsZhisheng Huang, Wouter Beek, Hua Wang, Yanchun Zhang, Rui Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages14
ISBN (Electronic)9783030620080
ISBN (Print)9783030620073
Publication statusPublished - 2020
Event21st International Conference on Web Information Systems Engineering, WISE 2020 - Amsterdam, Netherlands
Duration: 20 Oct 202024 Oct 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12343 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference21st International Conference on Web Information Systems Engineering, WISE 2020


  • Data science
  • Dataset search
  • Google Distance
  • Ontology-based similarity
  • Semantic similarity


Dive into the research topics of 'Evaluating Similarity Measures for Dataset Search'. Together they form a unique fingerprint.

Cite this