TY - GEN
T1 - A high-performance approach to string similarity using most frequent K characters
AU - Valdestilhas, Andre
AU - Soru, Tommaso
AU - Ngomo, Axel-Cyrille Ngonga
PY - 2017
Y1 - 2017
N2 - The amount of available data has been growing significantly over the last decades. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases, becomes an increasingly difficult problem, in particular w.r.t. the runtime of these tasks. Consequently, it is of utmost importance to provide time-efficient approaches for similarity joins in the Web of Data. While a number of scalable approaches have been developed for various measures, the Most Frequent k Characters (MFKC) measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggest that our approach scales well even when dealing with millions of potential comparisons.
AB - The amount of available data has been growing significantly over the last decades. Thus, linking entries across heterogeneous data sources such as databases or knowledge bases, becomes an increasingly difficult problem, in particular w.r.t. the runtime of these tasks. Consequently, it is of utmost importance to provide time-efficient approaches for similarity joins in the Web of Data. While a number of scalable approaches have been developed for various measures, the Most Frequent k Characters (MFKC) measure has not been tackled in previous works. We hence present a sequence of filters that allow discarding comparisons when executing bounded similarity computations without losing recall. Therewith, we can reduce the runtime of bounded similarity computations by approximately 70%. Our experiments with a single-threaded, a parallel and a GPU implementation of our filters suggest that our approach scales well even when dealing with millions of potential comparisons.
UR - http://www.scopus.com/inward/record.url?scp=85040376018&partnerID=8YFLogxK
M3 - Conference contribution
VL - 2032
T3 - CEUR Workshop Proceedings
SP - 1
EP - 12
BT - OM 2017 - Proceedings of the 12th International Workshop on Ontology Matching, co-located with the 16th International Semantic Web Conference, ISWC 2017
A2 - Euzenat, J.
A2 - Shvaiko, P.
A2 - Jimenez-Ruiz, E.
A2 - Hassanzadeh, O.
A2 - Cheatham, M.
PB - CEUR-WS
T2 - 12th International Workshop on Ontology Matching, OM 2017
Y2 - 21 October 2017
ER -