Skip to main navigation Skip to search Skip to main content

Frequency-Constrained Substring Complexity

  • Solon P. Pissis*
  • , Michael Shekelyan
  • , Chang Liu
  • , Grigorios Loukides
  • *Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

10 Downloads (Pure)

Abstract

We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in $$\tau $$ intervals, $$\mathcal {I}=I_1,\ldots,I_\tau $$, the frequency-constrained substring complexity of x is the function $$f:{x,\mathcal {I}}(i,j)$$ that maps i, j to the number of distinct substrings of length i of x occurring at least $$\alpha _j$$ and at most $$\beta _j$$ times in x, where $$I:j=[\alpha _j,\beta _j]$$. We extend this notion as follows. For a string x, a dictionary $$\mathcal {D}$$ of d strings (documents), and a partition of [d] in $$\tau $$ intervals $$I:1,\ldots,I_\tau $$, we define a 2D array $$S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]$$ as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least $$\alpha _j$$ and at most $$\beta _j$$ documents, where $$I:j=[\alpha _j,\beta _j]$$. Array S can thus be seen as the distribution of the substring complexity of x into $$\tau $$ document frequency classes. We show that after a linear-time preprocessing of $$\mathcal {D}$$, for any x and any partition of [d] in $$\tau $$ intervals given online, array S can be computed in near-optimal $$\mathcal {O}(|x| \tau \log \log d)$$ time.

Original languageEnglish
Title of host publicationString Processing and Information Retrieval
Subtitle of host publication30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings
EditorsFranco Maria Nardini, Nadia Pisanti, Rossano Venturini
PublisherSpringer Science and Business Media Deutschland GmbH
Pages345-352
Number of pages8
ISBN (Electronic)9783031439797
ISBN (Print)9783031439797
DOIs
Publication statusPublished - 2023
Event30th International Symposium on String Processing and Information Retrieval, SPIRE 2023 - Pisa, Italy
Duration: 26 Sept 202328 Sept 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14240 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference30th International Symposium on String Processing and Information Retrieval, SPIRE 2023
Country/TerritoryItaly
CityPisa
Period26/09/2328/09/23

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.

Keywords

  • Predecessor search
  • Substring complexity
  • Suffix tree

Fingerprint

Dive into the research topics of 'Frequency-Constrained Substring Complexity'. Together they form a unique fingerprint.

Cite this