Bidirectional string anchors: A new string sampling mechanism

Grigorios Loukides*, Solon P. Pissis

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers w and k, it selects the lexicographically smallest length-k substring in every fragment of w consecutive length-k substrings (in every sliding window of length w + k − 1). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Although they do not have good worst-case guarantees on their size, they are often small in practice. They thus have been successfully employed in several string processing applications. Two main disadvantages of minimizers sampling mechanisms are: first, they also do not have good guarantees on the expected size of their samples for every combination of w and k; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. To alleviate these disadvantages, we introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive integer ℓ, our mechanism selects the lexicographically smallest rotation in every length-ℓ fragment (in every sliding window of length ℓ). We show that bd-anchors samples are also approximately uniform, locally consistent, and computable in linear time. In addition, our experiments using several datasets demonstrate that the bd-anchors sample sizes decrease proportionally to ℓ; and that these sizes are competitive to or smaller than the minimizers sample sizes using the analogous sampling parameters. We provide theoretical justification for these results by analyzing the expected size of bd-anchors samples. We also show that by using any bd-anchors sample, we can construct, in near-linear time, an index which requires linear (extra) space in the size of the sample and answers on-line pattern searches in near-optimal time. We further show, using several datasets, that a simple implementation of our index is consistently faster for on-line pattern searches than an analogous implementation of a minimizers-based index [Grabowski and Raniszewski, Softw. Pract. Exp. 2017]. Finally, we highlight the applicability of bd-anchors by developing an efficient and effective heuristic for top-K similarity search under edit distance. We show, using synthetic datasets, that our heuristic is more accurate and more than one order of magnitude faster in top-K similarity searches than the state-of-the-art tool for the same purpose [Zhang and Zhang, KDD 2020].

Original languageEnglish
Title of host publication29th Annual European Symposium on Algorithms (ESA 2021)
EditorsPetra Mutzel, Rasmus Pagh, Grzegorz Herman
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Pages1-21
Number of pages21
ISBN (Electronic)9783959772044
DOIs
Publication statusPublished - 31 Aug 2021
Event29th Annual European Symposium on Algorithms, ESA 2021 - Vitual, Lisbon, Portugal
Duration: 6 Sep 20218 Sep 2021

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume204
ISSN (Print)1868-8969

Conference

Conference29th Annual European Symposium on Algorithms, ESA 2021
Country/TerritoryPortugal
CityVitual, Lisbon
Period6/09/218/09/21

Bibliographical note

Funding Information:
Grigorios Loukides: This paper is part of the Leverhulme Trust RPG-2019-399 project. Solon P. Pissis: This paper is part of the PANGAIA project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk?odowska-Curie grant agreement No 872539. This paper is also part of the ALPACA project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk?odowska-Curie grant agreement No 956229.

Funding Information:
Funding Grigorios Loukides: This paper is part of the Leverhulme Trust RPG-2019-399 project. Solon P. Pissis: This paper is part of the PANGAIA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 872539. This paper is also part of the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 956229.

Publisher Copyright:
© Grigorios Loukides and Solon P. Pissis; licensed under Creative Commons License CC-BY 4.0

Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.

Keywords

  • String algorithms
  • String sampling
  • Text indexing
  • Top-K similarity search

Fingerprint

Dive into the research topics of 'Bidirectional string anchors: A new string sampling mechanism'. Together they form a unique fingerprint.

Cite this