Ontology-based methods for classifying scientific datasets into research domains: Much harder than expected

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

22 Downloads (Pure)

Abstract

Scientific datasets are increasingly stored, published, and re-used online. This has prompted major search engines to start services dedicated to finding research datasets online. However, to date such services are limited to keyword search, and provide little or no semantic guidance. Determining the scientific domain for a given dataset is a crucial part in dataset recommendation and search: "Which research domain does this dataset belong to?". In this paper we investigate and compare a number of novel ontology-based methods to answer that question, using the distance between a domain-ontology and a dataset as an estimator for the domain(s) into which the dataset should be classified. We also define a simple keyword-based classifier based on the Normalized Google Distance, and we evaluate all classifiers on a hand-constructed gold standard. Our two main findings are that the seemingly simple task of determining the domain(s) of a dataset is surprisingly much harder than expected (even when performed under highly simplified circumstances), and that (again surprisingly), the use of ontologies seems to be of little help in this task, with the simple keyword-based classifier outperforming every ontology-based classifier. We constructed a gold-standard benchmark for our experiments which we make available online for others to use.

Original languageEnglish
Title of host publicationProceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR
Subtitle of host publicationVolume 1: KDIR
EditorsAna Fred, Joaquim Filipe
PublisherSciTePress
Pages153-160
Number of pages8
Volume1
ISBN (Electronic)9789897584749
ISBN (Print)9789897584749
DOIs
Publication statusPublished - Nov 2020
Event12th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2020 - Part of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2020 - Virtual, Online
Duration: 2 Nov 20204 Nov 2020

Publication series

NameProceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

Conference

Conference12th International Conference on Knowledge Discovery and Information Retrieval, KDIR 2020 - Part of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2020
CityVirtual, Online
Period2/11/204/11/20

Bibliographical note

Funding Information:
This work has been funded by the Netherlands Science Foundation NWO grant nr. 652.001.002 which is also partially funded by Elsevier. The first author is funded by by the China Scholarship Council (CSC) under grant number 201807730060.

Publisher Copyright:
Copyright © 2020 by SCITEPRESS - Science and Technology Publications, Lda. All rights reserved.

Keywords

  • Data science
  • Domain classification
  • Google distance
  • Ontology classification
  • Semantic similarity

Fingerprint

Dive into the research topics of 'Ontology-based methods for classifying scientific datasets into research domains: Much harder than expected'. Together they form a unique fingerprint.

Cite this