Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records

Reinier Kop, Mark Hoogendoorn, Annette ten Teije, Frederike L. Büchner, Pauline Slottje, Leon M G Moons, Mattijs E. Numans

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Over the past years, research utilizing routine care data extracted from Electronic Medical Records (EMRs) has increased tremendously. Yet there are no straightforward, standardized strategies for pre-processing these data. We propose a dedicated medical pre-processing pipeline aimed at taking on many problems and opportunities contained within EMR data, such as their temporal, inaccurate and incomplete nature. The pipeline is demonstrated on a dataset of routinely recorded data in general practice EMRs of over 260,000 patients, in which the occurrence of colorectal cancer (CRC) is predicted using various machine learning techniques (i.e., CART, LR, RF) and subsets of the data. CRC is a common type of cancer, of which early detection has proven to be important yet challenging. The results are threefold. First, the predictive models generated using our pipeline reconfirmed known predictors and identified new, medically plausible, predictors derived from the cardiovascular and metabolic disease domain, validating the pipeline's effectiveness. Second, the difference between the best model generated by the data-driven subset (AUC 0.891) and the best model generated by the current state of the art hypothesis-driven subset (AUC 0.864) is statistically significant at the 95% confidence interval level. Third, the pipeline itself is highly generic and independent of the specific disease targeted and the EMR used. In conclusion, the application of established machine learning techniques in combination with the proposed pipeline on EMRs has great potential to enhance disease prediction, and hence early detection and intervention in medical practice.

Original languageEnglish
Pages (from-to)30-38
Number of pages9
JournalComputers in Biology and Medicine
Volume76
DOIs
Publication statusPublished - 1 Sep 2016

Fingerprint

Electronic medical equipment
Electronic Health Records
Colorectal Neoplasms
Pipelines
Processing
Area Under Curve
Learning systems
Metabolic Diseases
Early Detection of Cancer
General Practice
Cardiovascular Diseases
Confidence Intervals
Research

Keywords

  • Colorectal cancer
  • Data mining
  • Data processing
  • Electronic medical records
  • Machine learning

Cite this

Kop, Reinier ; Hoogendoorn, Mark ; ten Teije, Annette ; Büchner, Frederike L. ; Slottje, Pauline ; Moons, Leon M G ; Numans, Mattijs E. / Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records. In: Computers in Biology and Medicine. 2016 ; Vol. 76. pp. 30-38.
@article{8fcbb631b38f4a52999735c46e47434a,
title = "Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records",
abstract = "Over the past years, research utilizing routine care data extracted from Electronic Medical Records (EMRs) has increased tremendously. Yet there are no straightforward, standardized strategies for pre-processing these data. We propose a dedicated medical pre-processing pipeline aimed at taking on many problems and opportunities contained within EMR data, such as their temporal, inaccurate and incomplete nature. The pipeline is demonstrated on a dataset of routinely recorded data in general practice EMRs of over 260,000 patients, in which the occurrence of colorectal cancer (CRC) is predicted using various machine learning techniques (i.e., CART, LR, RF) and subsets of the data. CRC is a common type of cancer, of which early detection has proven to be important yet challenging. The results are threefold. First, the predictive models generated using our pipeline reconfirmed known predictors and identified new, medically plausible, predictors derived from the cardiovascular and metabolic disease domain, validating the pipeline's effectiveness. Second, the difference between the best model generated by the data-driven subset (AUC 0.891) and the best model generated by the current state of the art hypothesis-driven subset (AUC 0.864) is statistically significant at the 95{\%} confidence interval level. Third, the pipeline itself is highly generic and independent of the specific disease targeted and the EMR used. In conclusion, the application of established machine learning techniques in combination with the proposed pipeline on EMRs has great potential to enhance disease prediction, and hence early detection and intervention in medical practice.",
keywords = "Colorectal cancer, Data mining, Data processing, Electronic medical records, Machine learning",
author = "Reinier Kop and Mark Hoogendoorn and {ten Teije}, Annette and B{\"u}chner, {Frederike L.} and Pauline Slottje and Moons, {Leon M G} and Numans, {Mattijs E.}",
year = "2016",
month = "9",
day = "1",
doi = "10.1016/j.compbiomed.2016.06.019",
language = "English",
volume = "76",
pages = "30--38",
journal = "Computers in Biology and Medicine",
issn = "0010-4825",
publisher = "Elsevier Limited",

}

Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records. / Kop, Reinier; Hoogendoorn, Mark; ten Teije, Annette; Büchner, Frederike L.; Slottje, Pauline; Moons, Leon M G; Numans, Mattijs E.

In: Computers in Biology and Medicine, Vol. 76, 01.09.2016, p. 30-38.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Predictive modeling of colorectal cancer using a dedicated pre-processing pipeline on routine electronic medical records

AU - Kop, Reinier

AU - Hoogendoorn, Mark

AU - ten Teije, Annette

AU - Büchner, Frederike L.

AU - Slottje, Pauline

AU - Moons, Leon M G

AU - Numans, Mattijs E.

PY - 2016/9/1

Y1 - 2016/9/1

N2 - Over the past years, research utilizing routine care data extracted from Electronic Medical Records (EMRs) has increased tremendously. Yet there are no straightforward, standardized strategies for pre-processing these data. We propose a dedicated medical pre-processing pipeline aimed at taking on many problems and opportunities contained within EMR data, such as their temporal, inaccurate and incomplete nature. The pipeline is demonstrated on a dataset of routinely recorded data in general practice EMRs of over 260,000 patients, in which the occurrence of colorectal cancer (CRC) is predicted using various machine learning techniques (i.e., CART, LR, RF) and subsets of the data. CRC is a common type of cancer, of which early detection has proven to be important yet challenging. The results are threefold. First, the predictive models generated using our pipeline reconfirmed known predictors and identified new, medically plausible, predictors derived from the cardiovascular and metabolic disease domain, validating the pipeline's effectiveness. Second, the difference between the best model generated by the data-driven subset (AUC 0.891) and the best model generated by the current state of the art hypothesis-driven subset (AUC 0.864) is statistically significant at the 95% confidence interval level. Third, the pipeline itself is highly generic and independent of the specific disease targeted and the EMR used. In conclusion, the application of established machine learning techniques in combination with the proposed pipeline on EMRs has great potential to enhance disease prediction, and hence early detection and intervention in medical practice.

AB - Over the past years, research utilizing routine care data extracted from Electronic Medical Records (EMRs) has increased tremendously. Yet there are no straightforward, standardized strategies for pre-processing these data. We propose a dedicated medical pre-processing pipeline aimed at taking on many problems and opportunities contained within EMR data, such as their temporal, inaccurate and incomplete nature. The pipeline is demonstrated on a dataset of routinely recorded data in general practice EMRs of over 260,000 patients, in which the occurrence of colorectal cancer (CRC) is predicted using various machine learning techniques (i.e., CART, LR, RF) and subsets of the data. CRC is a common type of cancer, of which early detection has proven to be important yet challenging. The results are threefold. First, the predictive models generated using our pipeline reconfirmed known predictors and identified new, medically plausible, predictors derived from the cardiovascular and metabolic disease domain, validating the pipeline's effectiveness. Second, the difference between the best model generated by the data-driven subset (AUC 0.891) and the best model generated by the current state of the art hypothesis-driven subset (AUC 0.864) is statistically significant at the 95% confidence interval level. Third, the pipeline itself is highly generic and independent of the specific disease targeted and the EMR used. In conclusion, the application of established machine learning techniques in combination with the proposed pipeline on EMRs has great potential to enhance disease prediction, and hence early detection and intervention in medical practice.

KW - Colorectal cancer

KW - Data mining

KW - Data processing

KW - Electronic medical records

KW - Machine learning

UR - http://www.scopus.com/inward/record.url?scp=84977074295&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84977074295&partnerID=8YFLogxK

U2 - 10.1016/j.compbiomed.2016.06.019

DO - 10.1016/j.compbiomed.2016.06.019

M3 - Article

VL - 76

SP - 30

EP - 38

JO - Computers in Biology and Medicine

JF - Computers in Biology and Medicine

SN - 0010-4825

ER -