Copyright © 2021 by SCITEPRESS - Science and Technology Publications, Lda. All rights reservedDataset search is a special application of information retrieval, which aims to help scientists with finding the datasets they want. Current dataset search engines are query-driven, which implies that the results are limited by the ability of the user to formulate the appropriate query. In this paper we aim to solve this limitation by framing dataset search as a recommendation task: given a dataset by the user, the search engine recommends similar datasets. We solve this dataset recommendation task using a similarity approach. We provide a simple benchmark task to evaluate different approaches for this dataset recommendation task. We also evaluate the recommendation task with several similarity approaches in the biomedical domain. We benchmark 8 different similarity metrics between datasets, including both ontology-based techniques and techniques from machine learning. Our results show that the task of recommending scientific datasets based on meta-data as it occurs in realistic dataset collections is a hard task. None of the ontology-based methods manage to perform well on this task, and are outscored by the majority of the machine-learning methods. Of these ML methods only one of the approaches performs reasonably well, and even then only reaches 70% accuracy.
|Title of host publication||Proceedings of the 10th International Conference on Data Science, Technology and Applications, DATA 2021|
|Editors||C. Quix, S. Hammoudi, W. van der Aalst|
|Publication status||Published - 2021|
|Event||10th International Conference on Data Science, Technology and Applications, DATA 2021 - Virtual, Online|
Duration: 6 Jul 2021 → 8 Jul 2021
|Conference||10th International Conference on Data Science, Technology and Applications, DATA 2021|
|Period||6/07/21 → 8/07/21|