Variable strings for pangenomes: Matching, Comparison, Indexing

Estéban Arthur Denis Gabory

Research output: PhD ThesisPhD-Thesis - Research and graduation internal

34 Downloads (Pure)

Abstract

This dissertation investigates the computational foundations of sequence analysis in the context of pangenomics, a rapidly evolving field in computational biology. Classical string algorithms, originally developed for linear DNA sequences, encounter new challenges when extended to nonlinear, graph-like pangenome representations. To address this, we study variable strings—generalized models for representing sets of similar sequences compactly, including elastic-degenerate strings (ED strings), founder graphs, and weighted sequences. The thesis makes three core contributions. First, it explores exact and approximate pattern matching algorithms for variable strings, establishing tight upper and lower bounds. Second, it introduces novel methods for comparing pangenomic data structures, including algorithms for intersection detection, matching statistics, and distance-based comparisons. Third, it proposes space-efficient indexing strategies for weighted sequences, enabling probabilistic pattern queries under uncertainty.
Original languageEnglish
QualificationPhD
Awarding Institution
  • Vrije Universiteit Amsterdam
Supervisors/Advisors
  • Pissis, Solon, Supervisor
  • Stougie, Leen, Supervisor
Award date21 May 2025
DOIs
Publication statusPublished - 21 May 2025

Keywords

  • Pangenomics
  • Pattern matching
  • String comparison
  • Variable strings
  • Elastic-degenerate strings
  • Founder graphs
  • Edit distance
  • Indexing
  • Computational biology
  • Sequence analysis

Fingerprint

Dive into the research topics of 'Variable strings for pangenomes: Matching, Comparison, Indexing'. Together they form a unique fingerprint.

Cite this