TY - GEN
T1 - Constructing Antidictionaries in Output-Sensitive Space
AU - Ayad, Lorraine A.K.
AU - Badkobeh, Golnaz
AU - Fici, Gabriele
AU - Heliou, Alice
AU - Pissis, Solon P.
PY - 2019/5/10
Y1 - 2019/5/10
N2 - A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
AB - A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
KW - Absent words
KW - Antidictionaries
KW - Data compression
KW - Output sensitive algorithms
KW - String algorithms
UR - http://www.scopus.com/inward/record.url?scp=85066302019&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066302019&partnerID=8YFLogxK
U2 - 10.1109/DCC.2019.00062
DO - 10.1109/DCC.2019.00062
M3 - Conference contribution
AN - SCOPUS:85066302019
T3 - Data Compression Conference Proceedings
SP - 538
EP - 547
BT - Proceedings - DCC 2019
A2 - Storer, James A.
A2 - Bilgin, Ali
A2 - Serra-Sagrista, Joan
A2 - Marcellin, Michael W.
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 Data Compression Conference, DCC 2019
Y2 - 26 March 2019 through 29 March 2019
ER -