TY - JOUR
T1 - Performance of variable selection methods for assessing the health effects of correlated exposures in case-control studies
AU - Lenters, Virissa
AU - Vermeulen, Roel
AU - Portengen, Lützen
PY - 2018
Y1 - 2018
N2 - Objectives There is growing recognition that simultaneously assessing multiple exposures may reduce false positive discoveries and improve epidemiological effect estimates. We evaluated the performance of statistical methods for identifying exposure-outcome associations across various data structures typical of environmental and occupational epidemiology analyses. Methods We simulated a case-control study, generating 100 data sets for each of 270 different simulation scenarios; varying the number of exposure variables, the correlation between exposures, sample size, the number of effective exposures and the magnitude of effect estimates. We compared conventional analytical approaches, that is, univariable (with and without multiplicity adjustment), multivariable and stepwise logistic regression, with variable selection methods: sparse partial least squares discriminant analysis, boosting, and frequentist and Bayesian penalised regression approaches. Results T he variable selection methods consistently yielded more precise effect estimates and generally improved selection accuracy compared with conventional logistic regression methods, especially for scenarios with higher correlation levels. Penalised lasso and elastic net regression both seemed to perform particularly well, specifically when statistical inference based on a balanced weighting of high sensitivity and a low proportion of false discoveries is sought. Conclusions I n this extensive simulation study with multicollinear data, we found that most variable selection methods consistently outperformed conventional approaches, and demonstrated how performance is influenced by the structure of the data and underlying model.
AB - Objectives There is growing recognition that simultaneously assessing multiple exposures may reduce false positive discoveries and improve epidemiological effect estimates. We evaluated the performance of statistical methods for identifying exposure-outcome associations across various data structures typical of environmental and occupational epidemiology analyses. Methods We simulated a case-control study, generating 100 data sets for each of 270 different simulation scenarios; varying the number of exposure variables, the correlation between exposures, sample size, the number of effective exposures and the magnitude of effect estimates. We compared conventional analytical approaches, that is, univariable (with and without multiplicity adjustment), multivariable and stepwise logistic regression, with variable selection methods: sparse partial least squares discriminant analysis, boosting, and frequentist and Bayesian penalised regression approaches. Results T he variable selection methods consistently yielded more precise effect estimates and generally improved selection accuracy compared with conventional logistic regression methods, especially for scenarios with higher correlation levels. Penalised lasso and elastic net regression both seemed to perform particularly well, specifically when statistical inference based on a balanced weighting of high sensitivity and a low proportion of false discoveries is sought. Conclusions I n this extensive simulation study with multicollinear data, we found that most variable selection methods consistently outperformed conventional approaches, and demonstrated how performance is influenced by the structure of the data and underlying model.
UR - http://www.scopus.com/inward/record.url?scp=85058508545&partnerID=8YFLogxK
U2 - 10.1136/oemed-2016-104231
DO - 10.1136/oemed-2016-104231
M3 - Article
SN - 1351-0711
VL - 75
SP - 522
EP - 529
JO - Occupational and Environmental Medicine
JF - Occupational and Environmental Medicine
IS - 7
ER -