Abstract
We introduce the notion of frequency-constrained substring complexity. For any finite string, it counts the distinct substrings of the string per length and frequency class. For a string x of length n and a partition of [n] in $$\tau $$ intervals, $$\mathcal {I}=I_1,\ldots,I_\tau $$, the frequency-constrained substring complexity of x is the function $$f:{x,\mathcal {I}}(i,j)$$ that maps i, j to the number of distinct substrings of length i of x occurring at least $$\alpha _j$$ and at most $$\beta _j$$ times in x, where $$I:j=[\alpha _j,\beta _j]$$. We extend this notion as follows. For a string x, a dictionary $$\mathcal {D}$$ of d strings (documents), and a partition of [d] in $$\tau $$ intervals $$I:1,\ldots,I_\tau $$, we define a 2D array $$S=S[1\mathinner {.\,.}|x|,1\mathinner {.\,.}\tau ]$$ as follows: S[i, j] is the number of distinct substrings of length i of x occurring in at least $$\alpha _j$$ and at most $$\beta _j$$ documents, where $$I:j=[\alpha _j,\beta _j]$$. Array S can thus be seen as the distribution of the substring complexity of x into $$\tau $$ document frequency classes. We show that after a linear-time preprocessing of $$\mathcal {D}$$, for any x and any partition of [d] in $$\tau $$ intervals given online, array S can be computed in near-optimal $$\mathcal {O}(|x| \tau \log \log d)$$ time.
| Original language | English |
|---|---|
| Title of host publication | String Processing and Information Retrieval |
| Subtitle of host publication | 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings |
| Editors | Franco Maria Nardini, Nadia Pisanti, Rossano Venturini |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 345-352 |
| Number of pages | 8 |
| ISBN (Electronic) | 9783031439797 |
| ISBN (Print) | 9783031439797 |
| DOIs | |
| Publication status | Published - 2023 |
| Event | 30th International Symposium on String Processing and Information Retrieval, SPIRE 2023 - Pisa, Italy Duration: 26 Sept 2023 → 28 Sept 2023 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 14240 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 30th International Symposium on String Processing and Information Retrieval, SPIRE 2023 |
|---|---|
| Country/Territory | Italy |
| City | Pisa |
| Period | 26/09/23 → 28/09/23 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023.
Keywords
- Predecessor search
- Substring complexity
- Suffix tree
Fingerprint
Dive into the research topics of 'Frequency-Constrained Substring Complexity'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver