Constructing Antidictionaries in Output-Sensitive Space

Lorraine A.K. Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Heliou, Solon P. Pissis

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Original languageEnglish
Title of host publicationProceedings - DCC 2019
Subtitle of host publication2019 Data Compression Conference
EditorsJames A. Storer, Ali Bilgin, Joan Serra-Sagrista, Michael W. Marcellin
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages538-547
Number of pages10
ISBN (Electronic)9781728106571
DOIs
Publication statusPublished - 10 May 2019
Externally publishedYes
Event2019 Data Compression Conference, DCC 2019 - Snowbird, United States
Duration: 26 Mar 201929 Mar 2019

Publication series

NameData Compression Conference Proceedings
Volume2019-March
ISSN (Print)1068-0314

Conference

Conference2019 Data Compression Conference, DCC 2019
CountryUnited States
CitySnowbird
Period26/03/1929/03/19

Keywords

  • Absent words
  • Antidictionaries
  • Data compression
  • Output sensitive algorithms
  • String algorithms

Fingerprint Dive into the research topics of 'Constructing Antidictionaries in Output-Sensitive Space'. Together they form a unique fingerprint.

  • Cite this

    Ayad, L. A. K., Badkobeh, G., Fici, G., Heliou, A., & Pissis, S. P. (2019). Constructing Antidictionaries in Output-Sensitive Space. In J. A. Storer, A. Bilgin, J. Serra-Sagrista, & M. W. Marcellin (Eds.), Proceedings - DCC 2019: 2019 Data Compression Conference (pp. 538-547). [8712742] (Data Compression Conference Proceedings; Vol. 2019-March). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DCC.2019.00062