A high-performance approach to string similarity using most frequent K characters

Andre Valdestilhas, Tommaso Soru, Axel-Cyrille Ngonga Ngomo

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

The amount of available data has been growing significantly over the last decades. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases, becomes an increasingly difficult problem, in particular w.r.t. the runtime of these tasks. Consequently, it is of utmost importance to provide time-efficient approaches for similarity joins in the Web of Data. While a number of scalable approaches have been developed for various measures, the Most Frequent k Characters (MFKC) measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggest that our approach scales well even when dealing with millions of potential comparisons.
Original languageEnglish
Title of host publicationOM 2017 - Proceedings of the 12th International Workshop on Ontology Matching, co-located with the 16th International Semantic Web Conference, ISWC 2017
EditorsJ. Euzenat, P. Shvaiko, E. Jimenez-Ruiz, O. Hassanzadeh, M. Cheatham
PublisherCEUR-WS
Pages1-12
Volume2032
Publication statusPublished - 2017
Externally publishedYes
Event12th International Workshop on Ontology Matching, OM 2017 - Vienna, Austria
Duration: 21 Oct 2017 → …

Publication series

NameCEUR Workshop Proceedings
ISSN (Print)1613-0073

Conference

Conference12th International Workshop on Ontology Matching, OM 2017
Country/TerritoryAustria
CityVienna
Period21/10/17 → …

Fingerprint

Dive into the research topics of 'A high-performance approach to string similarity using most frequent K characters'. Together they form a unique fingerprint.

Cite this