Efficient Computation of Sequence Mappability

Panagiotis Charalampopoulos, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Juliusz Straszyński*

*Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices j≠ i such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k= 1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k= O(1) , works in O(n) space and, with high probability, in O(n· min { mk, log kn}) time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n2) -time algorithms to compute all (k, m)-mappability tables for a fixed m and all k∈ { 0 , … , m} or a fixed k and all m∈ { k, … , n}. Finally, we show that, for k, m= Θ (log n) , the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018.

Original languageEnglish
Pages (from-to)1418-1440
Number of pages23
JournalAlgorithmica
Volume84
Issue number5
Early online date2 Feb 2022
DOIs
Publication statusPublished - May 2022

Bibliographical note

Funding Information:
Panagiotis Charalampopoulos is supported by the Israel Science Foundation Grant 592/17. Tomasz Kociumaka is partly supported by NSF 1652303, 1909046, and HDR TRIPODS 1934846 grants, and an Alfred P. Sloan Fellowship. Jakub Radoszewski and Juliusz Straszyński are supported by the ”Algorithms for text processing with errors and uncertainties”project carried out within the HOMING program of the Foundation for Polish Science co-financed by the European Union under the European Regional Development Fund, Project No. POIR.04.04.00-00-24BA/16, and by the Polish National Science Center, Grant Number 2018/31/D/ST6/03991.

Funding Information:
This paper is part of the PANGAIA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 872539. This paper is also part of the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 956229.

Publisher Copyright:
© 2022, The Author(s).

Keywords

  • Hamming distance
  • k-errata tree
  • Sequence mappability

Fingerprint

Dive into the research topics of 'Efficient Computation of Sequence Mappability'. Together they form a unique fingerprint.

Cite this