Abstract
An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O (nm15 √log m+N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016]. Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction, we show that a combinatorial algorithm solving the EDSM problem in \scrO (nm1.5 - \epsilon + N) time, for any \epsilon > 0, refutes this conjecture. Our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication. String periodicity and fast Fourier transform are two standard tools in string algorithms. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a noncombinatorial \scrO \~(nm\omega -1 + N)-time algorithm for EDSM, where \omega denotes the matrix multiplication exponent and the \scrO \~(\cdot ) notation suppresses polylog factors. To the best of our knowledge, we are the first to combine these tools. In particular, using the fact that \omega < 2.373 [Alman and Williams, SODA 2021; Le Gall, ISSAC 2014; Williams, STOC 2012], we obtain an \scrO (nm1.373 + N)-time algorithm for EDSM. An important building block in our solution that might find applications in other problems is a method of selecting a small set of length-\ell substrings of the pattern, called anchors, so that any occurrence of a string from an ED text set contains at least one but not too many (on average) such anchors inside.
Original language | English |
---|---|
Pages (from-to) | 549-576 |
Number of pages | 28 |
Journal | SIAM Journal on Computing |
Volume | 51 |
Issue number | 3 |
Early online date | 19 May 2022 |
DOIs | |
Publication status | Published - Jun 2022 |
Bibliographical note
Funding Information:The first author is supported by the Netherlands Organisation for Scientific Research (NWO) project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL).” The second author is partially supported by the Bekker programme of the Polish National Agency for Academic Exchange (PPN/BEK/2020/1/00444). The third and fifth authors are supported by the University of Pisa under the “PRA - Progetti di Ricerca di Ateneo” (Institutional Research Grants) - Project PRA 20202021 26 “Metodi Informatici Integrati per la Biomedica” and partially supported by MIUR-SIR project CMACBioSeq “Combinatorial methods for analysis and compression of biological sequences” grant RBSI146R5L. The third and fourth authors are partially supported by the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk̷lodowska-Curie grant agreement 956229. The fourth author is partially supported by the PANGAIA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk̷lodowska-Curie grant agreement 872539.
Funding Information:
\ast Received by the editors September 21, 2020; accepted for publication (in revised form) November 8, 2021; published electronically May 19, 2022. A preliminary version appeared in Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019), pp. 21:1--21:15. https://doi.org/10.1137/20M1368033 Funding: The first author is supported by the Netherlands Organisation for Scientific Research (NWO) project OCENW.GROOT.2019.015 ``Optimization for and with Machine Learning (OPTIMAL)."" The second author is partially supported by the Bekker programme of the Polish National Agency for Academic Exchange (PPN/BEK/2020/1/00444). The third and fifth authors are supported by the University of Pisa under the ``PRA --Progetti di Ricerca di Ateneo"" (Institutional Research Grants) - Project PRA 20202021 26 ``Metodi Informatici Integrati per la Biomedica"" and partially supported by MIUR-SIR project CMACBioSeq ``Combinatorial methods for analysis and compression of biological sequences"" grant RBSI146R5L. The third and fourth authors are partially supported by the ALPACA project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk\lodowska-Curie grant agreement 956229. The fourth author is partially supported by the PANGAIA project that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk\lodowska-Curie grant agreement 872539. \dagger CWI, 1090 GB, Amsterdam, the Netherlands ([email protected]). \ddagger University of Wroc\law, 50-370 Wroc\law, Poland ([email protected]). \S University of Pisa, 56126 Pisa, Italy ([email protected], [email protected]).
Publisher Copyright:
© 2022 Society for Industrial and Applied Mathematics
Keywords
- elastic-degenerate string
- fast Fourier transform
- matrix multiplication
- pattern matching
- string algorithms