Abstract
In the all-pairs suffix-prefix (APSP) problem, we are given a dictionary R of k strings, S1, . . ., Sk, of total length n, and we are asked to find the length SPLi,j of the longest string that is both a suffix of Si and a prefix of Sj, for all i, j ∈ [1, k]. APSP is a classic problem in string algorithms with many applications in bioinformatics. When all strings of the dictionary are over an integer alphabet of size σ ≤ nO(1), APSP can be solved in the optimal O(n + k2) time with the use of the generalized suffix tree of the dictionary [Gusfield et al., Inf. Process. Lett. 1992]. In many bioinformatics applications, such as in sequence assembly, the size k of dictionary R is very large. In particular, k2 usually dominates n, and thus the k2 factor is the bottleneck both in the time and in the space complexity of such applications. We thus initiate a holistic study on several data structure variants of APSP. In particular, we consider the following types of queries: One-to-One(i, j): output SPLi,j. One-to-All(i): output SPLi,j for every j ∈ [1, k]. Report(i, ℓ): output all distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Count(i, ℓ): output the number of distinct j ∈ [1, k] such that SPLi,j ≥ ℓ, where ℓ ≥ 0 is an integer. Top(i, K): output K distinct j ∈ [1, k] with the highest values of SPLi,j breaking ties arbitrarily. We assume the standard word RAM model of computation with word size w = Ω(log n) and an integer alphabet of size σ ≤ nO(1). We show the following upper bounds: Query Space (words) Query time Note One-to-One(i, j) O(n) O(log log k) Theorem 11 One-to-All(i) O(n) O(k) Theorem 14 Report(i, ℓ) O(n) O(log n/log log n + output) Theorem 19(i) Count(i, ℓ) O(n) O(log n/log log n) Theorem 19(ii) Top(i, K) O(n) O(log2 n/log log n + K) Theorem 22 We also present efficient algorithms for constructing these data structures.
Original language | English |
---|---|
Title of host publication | 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023) |
Subtitle of host publication | [Proceedings] |
Editors | Laurent Bulteau, Zsuzsanna Liptak |
Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |
Pages | 1-20 |
Number of pages | 20 |
ISBN (Print) | 9783959772761 |
DOIs | |
Publication status | Published - 2023 |
Event | 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 - Marne-la-Vallee, France Duration: 26 Jun 2023 → 28 Jun 2023 |
Publication series
Name | Leibniz International Proceedings in Informatics, LIPIcs |
---|---|
Volume | 259 |
ISSN (Print) | 1868-8969 |
Conference
Conference | 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 |
---|---|
Country/Territory | France |
City | Marne-la-Vallee |
Period | 26/06/23 → 28/06/23 |
Bibliographical note
Funding Information:Funding This work is supported in part by the Royal Society grant IES\R3\193209. Solon P. Pissis: Supported in part by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively. Sharma V. Thankachan: Supported by the U.S. National Science Foundation (NSF) grant CCF-2146003. Wiktor Zuba: Supported by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003.
Publisher Copyright:
© 2023 Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. All rights reserved.
Keywords
- all-pairs suffix-prefix
- internal pattern matching
- suffix-prefix queries