Evaluating Similarity Measures for Dataset Search

Xu Wang*, Zhisheng Huang, Frank van Harmelen

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

235 Downloads (Pure)

Abstract

Dataset search engines help scientists to find research datasets for scientific experiments. Current dataset search engines are query-driven, making them limited by the appropriate specification of search queries. An alternative would be to adopt a recommendation paradigm (“if you like this dataset, you’ll also like..”). Such a recommendation service requires an appropriate similarity metric between datasets. Various similarity measures have been proposed in computational linguistics and informational retrieval. The goal of this paper is to determine which similarity measure is suitable for a dataset search engine. We will report our experiments on different similarity measures over datasets. We will evaluate these similarity measures against the gold standards which are developed for Elsevier DataSearch, a commercial dataset search engine. With the help of F-measure evaluation measure and nDCG evaluation measure, we find that Wu-Palmer Similarity, a similarity measure which is based on hierarchical terminologies, can score quite good in our benchmarks.

Original languageEnglish
Title of host publicationWeb Information Systems Engineering – WISE 2020
Subtitle of host publication21st International Conference, Amsterdam, The Netherlands, October 20–24, 2020, Proceedings, Part II
EditorsZhisheng Huang, Wouter Beek, Hua Wang, Yanchun Zhang, Rui Zhou
PublisherSpringer Science and Business Media Deutschland GmbH
Pages38-51
Number of pages14
Volume2
ISBN (Electronic)9783030620080
ISBN (Print)9783030620073
DOIs
Publication statusPublished - 2020
Event21st International Conference on Web Information Systems Engineering, WISE 2020 - Amsterdam, Netherlands
Duration: 20 Oct 202024 Oct 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12343 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Web Information Systems Engineering, WISE 2020
Country/TerritoryNetherlands
CityAmsterdam
Period20/10/2024/10/20

Funding

Acknowledgements. This work has been funded by the Netherlands Science Foundation NWO grant nr. 652.001.002, it is co-funded by Elsevier B.V., with funding for the first author by the China Scholarship Council (CSC) grant number 201807730060. We are grateful to our colleagues in Elsevier for sharing their dataset, and to all of our colleagues in the Data Search project for their valuable input.

FundersFunder number
Netherlands Science foundation NWO
China Scholarship Council201807730060
Not added652.001.002

    Keywords

    • Data science
    • Dataset search
    • Google Distance
    • Ontology-based similarity
    • Semantic similarity

    Fingerprint

    Dive into the research topics of 'Evaluating Similarity Measures for Dataset Search'. Together they form a unique fingerprint.

    Cite this