TY - GEN

T1 - Efficient computation of sequence mappability

AU - Alzamel, Mai

AU - Charalampopoulos, Panagiotis

AU - Iliopoulos, Costas S.

AU - Kociumaka, Tomasz

AU - Pissis, Solon P.

AU - Radoszewski, Jakub

AU - Straszyński, Juliusz

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices j≠i such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(n min {mk,log k+1 n}) time and O(n) space for k=O(1). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2) -time algorithms to compute all results for a fixed m and all k=0,…,m or a fixed k and all m=k,…,n-1. Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for k,m = Θ (log n) unless the Strong Exponential Time Hypothesis fails.

AB - Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices j≠i such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(n min {mk,log k+1 n}) time and O(n) space for k=O(1). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2) -time algorithms to compute all results for a fixed m and all k=0,…,m or a fixed k and all m=k,…,n-1. Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for k,m = Θ (log n) unless the Strong Exponential Time Hypothesis fails.

KW - Genome sequencing

KW - Hamming distance

KW - Longest common substring with k mismatches

KW - Sequence mappability

KW - Suffix tree

UR - http://www.scopus.com/inward/record.url?scp=85054834341&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054834341&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-00479-8_2

DO - 10.1007/978-3-030-00479-8_2

M3 - Conference contribution

AN - SCOPUS:85054834341

SN - 9783030004781

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 12

EP - 26

BT - String Processing and Information Retrieval - 25th International Symposium, SPIRE 2018, Proceedings

A2 - Gagie, Travis

A2 - Moffat, Alistair

A2 - Navarro, Gonzalo

A2 - Cuadros-Vargas, Ernesto

PB - Springer Verlag

T2 - 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018

Y2 - 9 October 2018 through 11 October 2018

ER -