Abstract
Analysis of single-cell transcriptomics often relies on clustering cells and then performing differential gene expression (DGE) to identify genes that vary between these clusters. These discrete analyses successfully determine cell types and markers; however, continuous variation within and between cell types may not be detected. We propose three topologically motivated mathematical methods for unsupervised feature selection that consider discrete and continuous transcriptional patterns on an equal footing across multiple scales simultaneously. Eigenscores ((Formula presented.)) rank signals or genes based on their correspondence to low-frequency intrinsic patterning in the data using the spectral decomposition of the Laplacian graph. The multiscale Laplacian score (MLS) is an unsupervised method for locating relevant scales in data and selecting the genes that are coherently expressed at these respective scales. The persistent Rayleigh quotient (PRQ) takes data equipped with a filtration, allowing the separation of genes with different roles in a bifurcation process (e.g., pseudo-time). We demonstrate the utility of these techniques by applying them to published single-cell transcriptomics data sets. The methods validate previously identified genes and detect additional biologically meaningful genes with coherent expression patterns. By studying the interaction between gene signals and the geometry of the underlying space, the three methods give multidimensional rankings of the genes and visualisation of relationships between them.
Original language | English |
---|---|
Article number | 1116 |
Pages (from-to) | 1-29 |
Number of pages | 29 |
Journal | Entropy |
Volume | 24 |
Issue number | 8 |
Early online date | 13 Aug 2022 |
DOIs | |
Publication status | Published - Aug 2022 |
Bibliographical note
Special Issue: Applications of Topological Data Analysis in the Life Sciences.Funding Information:
The authors thank Mariano Beguerisse, Carla Groenewegen, Joe Kaplinsky, and Vidit Nanda for helpful discussions. We thank Renaud Lambiotte and Michael Schaub for reading an earlier version of this manuscript. HAH gratefully acknowledges funding from EPSRC EP/R018472/1, EP/R005125/1 and EP/T001968/1, the Royal Society RGF∖EA∖201074 and UF150238. RSH, HAH and HMB acknowledge funding from the Emerson Collective. This research was funded in part by EPSRC EP/R018472/1. LM, OS, TMC, XL and HMB are funded by the Ludwig Institute for Cancer Research Ltd. TMC gratefully acknowledges scholarship support from the Rhodes Trust. For the purpose of Open Access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission.
Publisher Copyright:
© 2022 by the authors.
Funding
The authors thank Mariano Beguerisse, Carla Groenewegen, Joe Kaplinsky, and Vidit Nanda for helpful discussions. We thank Renaud Lambiotte and Michael Schaub for reading an earlier version of this manuscript. HAH gratefully acknowledges funding from EPSRC EP/R018472/1, EP/R005125/1 and EP/T001968/1, the Royal Society RGF∖EA∖201074 and UF150238. RSH, HAH and HMB acknowledge funding from the Emerson Collective. This research was funded in part by EPSRC EP/R018472/1. LM, OS, TMC, XL and HMB are funded by the Ludwig Institute for Cancer Research Ltd. TMC gratefully acknowledges scholarship support from the Rhodes Trust. For the purpose of Open Access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript (AAM) version arising from this submission.
Keywords
- feature selection
- multiscale data analysis
- persistent Laplacian
- single cell transcriptomics
- topological signal processing