Abstract
Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let S = S[1] . . . S[n] be a string over a totally ordered alphabet Σ. Further let w ≥ 2 and k ≥ 1 be two integers. The minimizer of S[i . . i + w + k − 2] is the smallest position in [i, i + w − 1] where the lexicographically smallest length-k substring of S[i . . i + w + k − 2] starts. The set of minimizers over all i ∈ [1, n − w − k + 2] is the set Mw,k(S) of the minimizers of S. We consider the following basic problem: Given S, w, and k, can we efficiently compute a total order on Σ that minimizes |Mw,k(S)|? We show that this is unlikely by proving that the problem is NP-hard for any w ≥ 3 and k ≥ 1. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.
Original language | English |
---|---|
Title of host publication | 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024) |
Subtitle of host publication | [Proceedings] |
Editors | Shunsuke Inenaga, Simon J. Puglisi |
Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |
Pages | 1-13 |
Number of pages | 13 |
ISBN (Electronic) | 9783959773263 |
DOIs | |
Publication status | Published - 2024 |
Event | 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024 - Fukuoka, Japan Duration: 25 Jun 2024 → 27 Jun 2024 |
Publication series
Name | Leibniz International Proceedings in Informatics, LIPIcs |
---|---|
Volume | 296 |
ISSN (Print) | 1868-8969 |
Conference
Conference | 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024 |
---|---|
Country/Territory | Japan |
City | Fukuoka |
Period | 25/06/24 → 27/06/24 |
Bibliographical note
Publisher Copyright:© Hilde Verbeek, Lorraine A.K. Ayad, Grigorios Loukides, and Solon P. Pissis.
Keywords
- alphabet reordering
- feedback arc set
- minimizers
- sequence analysis