TY - GEN
T1 - Dictionary matching in elastic-degenerate texts with applications in searching VCF files on-line
AU - Pissis, Solon P.
AU - Retha, Ahmad
PY - 2018/6/1
Y1 - 2018/6/1
N2 - An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 + N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · mw ) with pre-processing time and space O(m·mw ), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N · Mw ) with pre-processing time and space O(M · Mw ), which is prohibitive in practice. We present a new on-line O(N · Mw )-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered.
AB - An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 + N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · mw ) with pre-processing time and space O(m·mw ), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N · Mw ) with pre-processing time and space O(M · Mw ), which is prohibitive in practice. We present a new on-line O(N · Mw )-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered.
KW - Algorithms on strings
KW - Dictionary matching
KW - Elastic-degenerate string
KW - Phrases on-line algorithms
KW - Variant Call Format
UR - http://www.scopus.com/inward/record.url?scp=85062522834&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062522834&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.SEA.2018.16
DO - 10.4230/LIPIcs.SEA.2018.16
M3 - Conference contribution
AN - SCOPUS:85062522834
T3 - Leibniz International Proceedings in Informatics, LIPIcs
SP - 16:1-16:14
BT - 17th Symposium on Experimental Algorithms, SEA 2018
A2 - D'Angelo, Gianlorenzo
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 17th Symposium on Experimental Algorithms, SEA 2018
Y2 - 27 June 2018 through 29 June 2018
ER -