Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations.

A. May, S. Abeln, W. Crielaard, J. Heringa, B.W. Brandt

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Motivation: 16S rDNA pyrosequencing is a powerful approach that requires extensive usage of computational methods for delineating microbial compositions. Previously, it was shown that outcomes of studies relying on this approach vastly depend on the choice of pre-processing and clustering algorithms used. However, obtaining insights into the effects and accuracy of these algorithms is challenging due to difficulties in generating samples of known composition with high enough diversity. Here, we use in silico microbial datasets to better understand how the experimental data are transformed into taxonomic clusters by computational methods. Results: We were able to qualitatively replicate the raw experimental pyrosequencing data after rigorously adjusting existing simulation software. This allowed us to simulate datasets of real-life complexity, which we used to assess the influence and performance of two widely used pre-processing methods along with 11 clustering algorithms. We show that the choice, order and mode of the pre-processing methods have a larger impact on the accuracy of the clustering pipeline than the clustering methods themselves. Without pre-processing, the difference between the performances of clustering methods is large. Depending on the clustering algorithm, the most optimal analysis pipeline resulted in significant underestimations of the expected number of clusters (minimum: 3.4%; maximum: 13.6%), allowing us to make quantitative estimations of the bacterial complexity of real microbiome samples. © 2014 The Author 2014.
Original languageEnglish
Pages (from-to)1530-1538
JournalBioinformatics
Volume30
DOIs
Publication statusPublished - 2014

Fingerprint

Taxonomies
Taxonomy
Ribosomal DNA
Cluster Analysis
Preprocessing
Clustering algorithms
Clustering Algorithm
Processing
Computational methods
Clustering Methods
Computational Methods
Simulation
Pipelines
Experimental Data
Number of Clusters
Simulation Software
Chemical analysis
Microbiota
Clustering
Computer Simulation

Cite this

@article{c98c35e30b8f457aa875153cfe3a0dae,
title = "Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations.",
abstract = "Motivation: 16S rDNA pyrosequencing is a powerful approach that requires extensive usage of computational methods for delineating microbial compositions. Previously, it was shown that outcomes of studies relying on this approach vastly depend on the choice of pre-processing and clustering algorithms used. However, obtaining insights into the effects and accuracy of these algorithms is challenging due to difficulties in generating samples of known composition with high enough diversity. Here, we use in silico microbial datasets to better understand how the experimental data are transformed into taxonomic clusters by computational methods. Results: We were able to qualitatively replicate the raw experimental pyrosequencing data after rigorously adjusting existing simulation software. This allowed us to simulate datasets of real-life complexity, which we used to assess the influence and performance of two widely used pre-processing methods along with 11 clustering algorithms. We show that the choice, order and mode of the pre-processing methods have a larger impact on the accuracy of the clustering pipeline than the clustering methods themselves. Without pre-processing, the difference between the performances of clustering methods is large. Depending on the clustering algorithm, the most optimal analysis pipeline resulted in significant underestimations of the expected number of clusters (minimum: 3.4{\%}; maximum: 13.6{\%}), allowing us to make quantitative estimations of the bacterial complexity of real microbiome samples. {\circledC} 2014 The Author 2014.",
author = "A. May and S. Abeln and W. Crielaard and J. Heringa and B.W. Brandt",
year = "2014",
doi = "10.1093/bioinformatics/btu085",
language = "English",
volume = "30",
pages = "1530--1538",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",

}

Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations. / May, A.; Abeln, S.; Crielaard, W.; Heringa, J.; Brandt, B.W.

In: Bioinformatics, Vol. 30, 2014, p. 1530-1538.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations.

AU - May, A.

AU - Abeln, S.

AU - Crielaard, W.

AU - Heringa, J.

AU - Brandt, B.W.

PY - 2014

Y1 - 2014

N2 - Motivation: 16S rDNA pyrosequencing is a powerful approach that requires extensive usage of computational methods for delineating microbial compositions. Previously, it was shown that outcomes of studies relying on this approach vastly depend on the choice of pre-processing and clustering algorithms used. However, obtaining insights into the effects and accuracy of these algorithms is challenging due to difficulties in generating samples of known composition with high enough diversity. Here, we use in silico microbial datasets to better understand how the experimental data are transformed into taxonomic clusters by computational methods. Results: We were able to qualitatively replicate the raw experimental pyrosequencing data after rigorously adjusting existing simulation software. This allowed us to simulate datasets of real-life complexity, which we used to assess the influence and performance of two widely used pre-processing methods along with 11 clustering algorithms. We show that the choice, order and mode of the pre-processing methods have a larger impact on the accuracy of the clustering pipeline than the clustering methods themselves. Without pre-processing, the difference between the performances of clustering methods is large. Depending on the clustering algorithm, the most optimal analysis pipeline resulted in significant underestimations of the expected number of clusters (minimum: 3.4%; maximum: 13.6%), allowing us to make quantitative estimations of the bacterial complexity of real microbiome samples. © 2014 The Author 2014.

AB - Motivation: 16S rDNA pyrosequencing is a powerful approach that requires extensive usage of computational methods for delineating microbial compositions. Previously, it was shown that outcomes of studies relying on this approach vastly depend on the choice of pre-processing and clustering algorithms used. However, obtaining insights into the effects and accuracy of these algorithms is challenging due to difficulties in generating samples of known composition with high enough diversity. Here, we use in silico microbial datasets to better understand how the experimental data are transformed into taxonomic clusters by computational methods. Results: We were able to qualitatively replicate the raw experimental pyrosequencing data after rigorously adjusting existing simulation software. This allowed us to simulate datasets of real-life complexity, which we used to assess the influence and performance of two widely used pre-processing methods along with 11 clustering algorithms. We show that the choice, order and mode of the pre-processing methods have a larger impact on the accuracy of the clustering pipeline than the clustering methods themselves. Without pre-processing, the difference between the performances of clustering methods is large. Depending on the clustering algorithm, the most optimal analysis pipeline resulted in significant underestimations of the expected number of clusters (minimum: 3.4%; maximum: 13.6%), allowing us to make quantitative estimations of the bacterial complexity of real microbiome samples. © 2014 The Author 2014.

U2 - 10.1093/bioinformatics/btu085

DO - 10.1093/bioinformatics/btu085

M3 - Article

VL - 30

SP - 1530

EP - 1538

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

ER -