Chronological age prediction based on DNA methylation: Massive parallel sequencing and random forest regression

Jana Naue*, Huub C.J. Hoefsloot, Olaf R.F. Mook, Laura Rijlaarsdam-Hoekstra, Marloes C.H. van der Zwalm, Peter Henneman, Ate D. Kloosterman, Pernette J. Verschure

*Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review


The use of DNA methylation (DNAm) to obtain additional information in forensic investigations showed to be a promising and increasing field of interest. Prediction of the chronological age based on age-dependent changes in the DNAm of specific CpG sites within the genome is one such potential application. Here we present an age-prediction tool for whole blood based on massive parallel sequencing (MPS) and a random forest machine learning algorithm. MPS allows accurate DNAm determination of pre-selected markers and neighboring CpG-sites to identify the best age-predictive markers for the age-prediction tool. 15 age-dependent markers of different loci were initially chosen based on publicly available 450K microarray data, and 13 finally selected for the age tool based on MPS (DDO, ELOVL2, F5, GRM2, HOXC4, KLF14, LDB2, MEIS1-AS3, NKIRAS2, RPA2, SAMD10, TRIM59, ZYG11A). Whole blood samples of 208 individuals were used for training of the algorithm and a further 104 individuals were used for model evaluation (age 18–69). In the case of KLF14, LDB2, SAMD10, and GRM2, neighboring CpG sites and not the initial 450K sites were chosen for the final model. Cross-validation of the training set leads to a mean absolute deviation (MAD) of 3.21 years and a root-mean square error (RMSE) of 3.97 years. Evaluation of model performance using the test set showed a comparable result (MAD 3.16 years, RMSE 3.93 years). A reduced model based on only the top 4 markers (ELOVL2, F5, KLF14, and TRIM59) resulted in a RMSE of 4.19 years and MAD of 3.24 years for the test set (cross validation training set: RMSE 4.63 years, MAD 3.64 years). The amplified region was additionally investigated for occurrence of SNPs in case of an aberrant DNAm result, which in some cases can be an indication for a deviation in DNAm. Our approach uncovered well-known DNAm age-dependent markers, as well as additional new age-dependent sites for improvement of the model, and allowed the creation of a reliable and accurate epigenetic tool for age-prediction without restriction to a linear change in DNAm with age.

Original languageEnglish
Pages (from-to)19-28
Number of pages10
JournalForensic Science International: Genetics
Publication statusPublished - 1 Nov 2017
Externally publishedYes


  • Age prediction
  • DNA methylation
  • Machine learning
  • Massive parallel sequencing


Dive into the research topics of 'Chronological age prediction based on DNA methylation: Massive parallel sequencing and random forest regression'. Together they form a unique fingerprint.

Cite this