TY - GEN
T1 - Efficient index for weighted sequences
AU - Barton, Carl
AU - Kociumaka, Tomasz
AU - Pissis, Solon P.
AU - Radoszewski, Jakub
PY - 2016/6/1
Y1 - 2016/6/1
N2 - The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold 1/z, we say that a pattern string P matches a weighted text at starting position i if the product of probabilities of the letters of P at positions i, . . . , i + |P| - 1 in the text is at least 1/z. In this article, we present an O(nz)-time construction of an O(nz)-sized index that can answer pattern matching queries in a weighted text over a constant-sized alphabet in optimal time. This improves upon the state of the art by a factor of z log z in construction time and space. Other applications of this data structure include an O(nz)-time construction of the weighted prefix table and an O(nz)-time computation of all covers of a weighted sequence, which improve upon the time complexity of the state of the art by the same factor.
AB - The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold 1/z, we say that a pattern string P matches a weighted text at starting position i if the product of probabilities of the letters of P at positions i, . . . , i + |P| - 1 in the text is at least 1/z. In this article, we present an O(nz)-time construction of an O(nz)-sized index that can answer pattern matching queries in a weighted text over a constant-sized alphabet in optimal time. This improves upon the state of the art by a factor of z log z in construction time and space. Other applications of this data structure include an O(nz)-time construction of the weighted prefix table and an O(nz)-time computation of all covers of a weighted sequence, which improve upon the time complexity of the state of the art by the same factor.
KW - Indexing
KW - Position weight matrix
KW - Weighted sequence
KW - Weighted suffix tree
UR - http://www.scopus.com/inward/record.url?scp=85012022747&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85012022747&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.CPM.2016.4
DO - 10.4230/LIPIcs.CPM.2016.4
M3 - Conference contribution
AN - SCOPUS:85012022747
T3 - Leibniz International Proceedings in Informatics, LIPIcs
SP - 4.1-4.13
BT - 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016
A2 - Grossi, Roberto
A2 - Lewenstein, Moshe
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016
Y2 - 27 June 2016 through 29 June 2016
ER -