Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications

Esteban Gabory*, Nadia Pisanti*, Jakub Radoszewski*, Wiktor Zuba*, Moses Njagi Mwaniki*, Solon P. Pissis*, Michelle Sweering*

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si ∈ T[i] for all i ∈ [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1−ϵ)-time algorithm, thus no O ((N1m2 + N2m1)1−ϵ)-time algorithm and no O ((N1n2 + N2n1)1−ϵ)-time algorithm, for any constant ϵ > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2−ϵf(n1, n2))-time algorithm, for any constant ϵ > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Õ(N1ω−1n2 + N2ω−1n1)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison.

Original languageEnglish
Title of host publication34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Subtitle of host publication[proceedings]
EditorsLaurent Bulteau, Zsuzsanna Liptak
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Pages1-20
Number of pages20
ISBN (Print)9783959772761
DOIs
Publication statusPublished - 2023
Event34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 - Marne-la-Vallee, France
Duration: 26 Jun 202328 Jun 2023

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume259
ISSN (Print)1868-8969

Conference

Conference34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023
Country/TerritoryFrance
CityMarne-la-Vallee
Period26/06/2328/06/23

Bibliographical note

Funding Information:
Funding The work in this paper is supported in part: by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003; and by PNRR ECS00000017 Tuscany Health Ecosystem Spoke 6 “Precision medicine & personalized healthcare”, funded by the European Commission under the NextGeneration EU programme. Jakub Radoszewski: Supported by the Polish National Science Center, grant no. 2018/31/D/ ST6/03991.

Publisher Copyright:
© 2023 Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. All rights reserved.

Funding

Funding The work in this paper is supported in part: by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; by the Netherlands Organisation for Scientific Research (NWO) through Gravitation-grant NETWORKS-024.002.003; and by PNRR ECS00000017 Tuscany Health Ecosystem Spoke 6 “Precision medicine & personalized healthcare”, funded by the European Commission under the NextGeneration EU programme. Jakub Radoszewski: Supported by the Polish National Science Center, grant no. 2018/31/D/ ST6/03991.

Keywords

  • acronym identification
  • elastic-degenerate string
  • languages intersection
  • pangenome
  • sequence comparison

Fingerprint

Dive into the research topics of 'Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications'. Together they form a unique fingerprint.

Cite this