Suffix-Prefix Queries on a Dictionary

Grigorios Loukides*, Solon P. Pissis*, Sharma V. Thankachan*, Wiktor Zuba*

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S1, . . ., Sk, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1, k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ nO(1), APSP can be solved in the optimal O(n + k2) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992]. In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k2 usually dominates n, and thus the k2 factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries: One-to-One(i, j): output SPLi,j. One-to-All(i): output SPLi,j for every j ∈ [1, k]. Report(i, ℓ): output all distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Count(i, ℓ): output the number of distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Top(i, K): output K distinct j ∈ [1, k] with the highest values of SPLi,j breaking ties arbitrarily. We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ nO(1). We show the following upper bounds: Query Space (words) Query time Note One-to-One(i, j) O(n) O(log log k) Theorem 11 One-to-All(i) O(n) O(k) Theorem 14 Report(i, ℓ) O(n) O(log n/log log n + output) Theorem 19(i) Count(i, ℓ) O(n) O(log n/log log n) Theorem 19(ii) Top(i, K) O(n) O(log2 n/log log n + K) Theorem 22 We also present efficient algorithms for constructing these data structures.

Original languageEnglish
Title of host publication34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Subtitle of host publication[Proceedings]
EditorsLaurent Bulteau, Zsuzsanna Liptak
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Pages1-20
Number of pages20
ISBN (Print)9783959772761
DOIs
Publication statusPublished - 2023
Event34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 - Marne-la-Vallee, France
Duration: 26 Jun 202328 Jun 2023

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume259
ISSN (Print)1868-8969

Conference

Conference34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023
Country/TerritoryFrance
CityMarne-la-Vallee
Period26/06/2328/06/23

Bibliographical note

Funding Information:
Funding This work is supported in part by the Royal Society grant IES\R3\193209. Solon P. Pissis: Supported in part by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Sharma V. Thankachan: Supported by the U.S. National Science Foundation (NSF) grant CCF-2146003. Wiktor Zuba: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003.

Publisher Copyright:
© 2023 Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. All rights reserved.

Keywords

  • all-pairs suffix-prefix
  • internal pattern matching
  • suffix-prefix queries

Fingerprint

Dive into the research topics of 'Suffix-Prefix Queries on a Dictionary'. Together they form a unique fingerprint.

Cite this