Abstract
An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si ∈ T[i] for all i ∈ [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1−ϵ)-time algorithm, thus no O ((N1m2 + N2m1)1−ϵ)-time algorithm and no O ((N1n2 + N2n1)1−ϵ)-time algorithm, for any constant ϵ > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2−ϵf(n1, n2))-time algorithm, for any constant ϵ > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Õ(N1ω−1n2 + N2ω−1n1)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison.
Original language | English |
---|---|
Title of host publication | 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023) |
Subtitle of host publication | [proceedings] |
Editors | Laurent Bulteau, Zsuzsanna Liptak |
Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |
Pages | 1-20 |
Number of pages | 20 |
ISBN (Print) | 9783959772761 |
DOIs | |
Publication status | Published - 2023 |
Event | 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 - Marne-la-Vallee, France Duration: 26 Jun 2023 → 28 Jun 2023 |
Publication series
Name | Leibniz International Proceedings in Informatics, LIPIcs |
---|---|
Volume | 259 |
ISSN (Print) | 1868-8969 |
Conference
Conference | 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 |
---|---|
Country/Territory | France |
City | Marne-la-Vallee |
Period | 26/06/23 → 28/06/23 |
Bibliographical note
Funding Information:Funding The work in this paper is supported in part: by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003; and by PNRR ECS00000017 Tuscany Health Ecosystem Spoke 6 “Precision medicine & personalized healthcare”, funded by the European Commission under the NextGeneration EU programme. Jakub Radoszewski: Supported by the Polish National Science Center, grant no. 2018/31/D/ ST6/03991.
Publisher Copyright:
© 2023 Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. All rights reserved.
Funding
Funding The work in this paper is supported in part: by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003; and by PNRR ECS00000017 Tuscany Health Ecosystem Spoke 6 “Precision medicine & personalized healthcare”, funded by the European Commission under the NextGeneration EU programme. Jakub Radoszewski: Supported by the Polish National Science Center, grant no. 2018/31/D/ ST6/03991.
Keywords
- acronym identification
- elastic-degenerate string
- languages intersection
- pangenome
- sequence comparison