TY - GEN
T1 - Efficient computation of sequence mappability
AU - Alzamel, Mai
AU - Charalampopoulos, Panagiotis
AU - Iliopoulos, Costas S.
AU - Kociumaka, Tomasz
AU - Pissis, Solon P.
AU - Radoszewski, Jakub
AU - Straszyński, Juliusz
PY - 2018/1/1
Y1 - 2018/1/1
N2 - Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices j≠i such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(n min {mk,log k+1 n}) time and O(n) space for k=O(1). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2) -time algorithms to compute all results for a fixed m and all k=0,…,m or a fixed k and all m=k,…,n-1. Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for k,m = Θ (log n) unless the Strong Exponential Time Hypothesis fails.
AB - Sequence mappability is an important task in genome re-sequencing. In the (k, m)-mappability problem, for a given sequence T of length n, our goal is to compute a table whose ith entry is the number of indices j≠i such that length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in O(n min {mk,log k+1 n}) time and O(n) space for k=O(1). It requires a careful adaptation of the technique of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. We also show O(n2) -time algorithms to compute all results for a fixed m and all k=0,…,m or a fixed k and all m=k,…,n-1. Finally we show that the (k, m)-mappability problem cannot be solved in strongly subquadratic time for k,m = Θ (log n) unless the Strong Exponential Time Hypothesis fails.
KW - Genome sequencing
KW - Hamming distance
KW - Longest common substring with k mismatches
KW - Sequence mappability
KW - Suffix tree
UR - http://www.scopus.com/inward/record.url?scp=85054834341&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054834341&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-00479-8_2
DO - 10.1007/978-3-030-00479-8_2
M3 - Conference contribution
AN - SCOPUS:85054834341
SN - 9783030004781
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 12
EP - 26
BT - String Processing and Information Retrieval - 25th International Symposium, SPIRE 2018, Proceedings
A2 - Gagie, Travis
A2 - Moffat, Alistair
A2 - Navarro, Gonzalo
A2 - Cuadros-Vargas, Ernesto
PB - Springer Verlag
T2 - 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018
Y2 - 9 October 2018 through 11 October 2018
ER -