Tree-based methods for omics-informed cancer prediction

  • Jeroen Michiel Goedhart

    Research output: PhD ThesisPhD-Thesis - Research and graduation internal

    31 Downloads (Pure)

    Abstract

    Cancer is a complex disease driven by numerous (bio)molecular mechanisms and clinical risk factors. Prediction models aim to capture such cancer drivers to accurately predict patient-specific cancer outcomes. Omics technologies, which analyze the human genome, enable the integration of molecular data into prediction models. However, these data are often high-dimensional as they usually comprise over 10,000 variables (e.g., gene expression measurements) for only ~100 patients. This high-dimensionality poses substantial challenges in finding the relevant variables in a large pool of candidates. This thesis introduces three methods that address specific challenges in high-dimensional omics-based cancer prediction. Several prediction models can predict cancer outcomes from high-dimensional data. The canonical learner for these prediction applications is penalized regression which considers a linear relationship for the variables. Tree-based methods offer a promising alternative class of learners. This thesis focuses on advancing tree-based learning approaches to improve prediction accuracy in cancer outcomes. A key aspect in comparing learners is assessing their predictive performance, particularly in the small-sample settings that are common in omics research. Chapter 2 presents a novel framework to estimate this performance when no independent test dataset is available. In such a scenario, one usually employs resampling techniques such as cross-validation and bootstrapping. A problem for such techniques is determining how to allocate samples for model fitting and for estimating the performance. The proposed framework addresses this by fitting a monotonic curve that models predictive performance as a function of sample size. This approach enables estimation of a more informative lower confidence bound for the performance and facilitates a comprehensive comparison between learners. Additionally, the curve provides insight into how different models behave as the sample size varies, thus guiding model selection and data collection strategies in future studies. Chapter 3 presents a method to incorporate external information on the covariates into Bayesian additive regression trees (BART), a commonly used tree-based method. The Bayesian specification of BART allows for flexible incorporation of external information into the model fitting process. To do so, an empirical Bayes estimator is developed for the prior variable weights of BART. This estimator recognizes how informative the external information is for the application at hand. If deemed informative, the estimator helps prioritize predictive variables from the large candidate pool, thereby improving both predictive performance and variable selection. Chapter 4 presents an easy-to-use software package for the methodology developed in Chapter 3. Chapter 4 also shows how the empirical Bayes estimator may be used for hyperparameter tuning of BART. Tuning by empirical Bayes is an computationally efficient alternative to the widely-used cross-validation. Chapter 4 also illustrates how modern variable selection techniques for BART may be integrated with the prior variable weight estimator developed in Chapter 3. Chapter 5 presents a novel cancer prediction model designed to handle data with both low-dimensional clinical risk factors and high-dimensional omics features. The set of clinical risk factors usually contains a disease progression measure such as tumor stage. Omics profiles are expected to differ across stage categories, therefore potentially favoring a prediction model with stage-omics interactions. To incorporate interactions, we combine a regression tree built using the clinical risk factors with omics-based regressions fitted within the leaf nodes of the tree. To fit the model, we present a novel penalized likelihood estimator that adapts to 1) the difference in predictive signal between clinical and omics variables and 2) the interaction strength between stage and omics variables through a fusion penalty. This fusion penalty enables information exchange across leaf nodes. Importantly, the proposed model also facilitates straightforward evaluation of the added predictive effect of omics variables in different patient subgroups defined by the fitted tree.
    Original languageEnglish
    QualificationPhD
    Awarding Institution
    • Vrije Universiteit Amsterdam
    Supervisors/Advisors
    • van de Wiel, Mark, Supervisor, -
    • Klausch, T., Co-supervisor, -
    Award date13 Mar 2026
    DOIs
    Publication statusPublished - 13 Mar 2026

    Keywords

    • Statistics
    • Machine learning
    • High-dimensional data
    • Omics data
    • Tree-based methods
    • Bayesian methods

    Fingerprint

    Dive into the research topics of 'Tree-based methods for omics-informed cancer prediction'. Together they form a unique fingerprint.

    Cite this