TY - GEN

T1 - Longest common prefixes with k-errors and applications

AU - Ayad, Lorraine A.K.

AU - Barton, Carl

AU - Charalampopoulos, Panagiotis

AU - Iliopoulos, Costas S.

AU - Pissis, Solon P.

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Although real-world text datasets, such as DNA sequences, are far from being uniformly random, string searching average-case algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in (Formula presented) time on average, where c>1 is a constant, using O(n) space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant k and using O(n) space.

AB - Although real-world text datasets, such as DNA sequences, are far from being uniformly random, string searching average-case algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in (Formula presented) time on average, where c>1 is a constant, using O(n) space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant k and using O(n) space.

KW - k-errors

KW - k-mismatches

KW - Longest common factor

KW - Longest common prefix

KW - Longest common substring

UR - http://www.scopus.com/inward/record.url?scp=85054783468&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054783468&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-00479-8_3

DO - 10.1007/978-3-030-00479-8_3

M3 - Conference contribution

AN - SCOPUS:85054783468

SN - 9783030004781

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 27

EP - 41

BT - String Processing and Information Retrieval - 25th International Symposium, SPIRE 2018, Proceedings

A2 - Gagie, Travis

A2 - Moffat, Alistair

A2 - Navarro, Gonzalo

A2 - Cuadros-Vargas, Ernesto

PB - Springer Verlag

T2 - 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018

Y2 - 9 October 2018 through 11 October 2018

ER -