Suffix-Prefix Queries on a Dictionary

Grigorios Loukides*, Solon P. Pissis*, Sharma V. Thankachan*, Wiktor Zuba*

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S1, . . ., Sk, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1, k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ nO(1), APSP can be solved in the optimal O(n + k2) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992]. In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k2 usually dominates n, and thus the k2 factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries: One-to-One(i, j): output SPLi,j. One-to-All(i): output SPLi,j for every j ∈ [1, k]. Report(i, ℓ): output all distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Count(i, ℓ): output the number of distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Top(i, K): output K distinct j ∈ [1, k] with the highest values of SPLi,j breaking ties arbitrarily. We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ nO(1). We show the following upper bounds: Query Space (words) Query time Note One-to-One(i, j) O(n) O(log log k) Theorem 11 One-to-All(i) O(n) O(k) Theorem 14 Report(i, ℓ) O(n) O(log n/log log n + output) Theorem 19(i) Count(i, ℓ) O(n) O(log n/log log n) Theorem 19(ii) Top(i, K) O(n) O(log2 n/log log n + K) Theorem 22 We also present efficient algorithms for constructing these data structures.

Original language English 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023) [Proceedings] Laurent Bulteau, Zsuzsanna Liptak Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing 1-20 20 9783959772761 https://doi.org/10.4230/LIPIcs.CPM.2023.21 Published - 2023 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 - Marne-la-Vallee, FranceDuration: 26 Jun 2023 → 28 Jun 2023

Publication series

Name Leibniz International Proceedings in Informatics, LIPIcs 259 1868-8969

Conference

Conference 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 France Marne-la-Vallee 26/06/23 → 28/06/23

Bibliographical note

Funding Information:
Funding This work is supported in part by the Royal Society grant IES\R3\193209. Solon P. Pissis: Supported in part by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Sharma V. Thankachan: Supported by the U.S. National Science Foundation (NSF) grant CCF-2146003. Wiktor Zuba: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003.