Efficient index for weighted sequences

Carl Barton, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold 1/z, we say that a pattern string P matches a weighted text at starting position i if the product of probabilities of the letters of P at positions i, . . . , i + |P| - 1 in the text is at least 1/z. In this article, we present an O(nz)-time construction of an O(nz)-sized index that can answer pattern matching queries in a weighted text over a constant-sized alphabet in optimal time. This improves upon the state of the art by a factor of z log z in construction time and space. Other applications of this data structure include an O(nz)-time construction of the weighted prefix table and an O(nz)-time computation of all covers of a weighted sequence, which improve upon the time complexity of the state of the art by the same factor.

Original languageEnglish
Title of host publication27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016
EditorsRoberto Grossi, Moshe Lewenstein
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Pages4.1-4.13
ISBN (Electronic)9783959770125
DOIs
Publication statusPublished - 1 Jun 2016
Externally publishedYes
Event27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016 - Tel Aviv, Israel
Duration: 27 Jun 201629 Jun 2016

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume54
ISSN (Print)1868-8969

Conference

Conference27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016
CountryIsrael
CityTel Aviv
Period27/06/1629/06/16

Keywords

  • Indexing
  • Position weight matrix
  • Weighted sequence
  • Weighted suffix tree

Fingerprint

Dive into the research topics of 'Efficient index for weighted sequences'. Together they form a unique fingerprint.

Cite this