Comparing clustering and pre-processing in taxonomy analysis.

M.J. Bonder, S. Abeln, E. Zaura, B.W. Brandt

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Motivation: Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S ribosomal RNA (rRNA) amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps and clustering algorithms on cluster accuracy for oral microbial sequence data.Results: The number of clusters varied up to two orders of magnitude depending on pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 15). Based on run time, purity and normalized mutual information, we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared with the differences in accuracy among different pre-processing steps. © 2012 The Author.
Original languageEnglish
Pages (from-to)2891-2897
JournalBioinformatics
Volume28
Issue number22
DOIs
Publication statusPublished - 2012

Fingerprint

Taxonomies
Taxonomy
Cluster Analysis
Preprocessing
Clustering
Sequencing
Data Preprocessing
Number of Clusters
Processing
Clustering Algorithm
16S Ribosomal RNA
Clustering algorithms
High-Throughput Nucleotide Sequencing
Denoising
Mutual Information
Minor
Fragment
Unit

Cite this

Bonder, M.J. ; Abeln, S. ; Zaura, E. ; Brandt, B.W. / Comparing clustering and pre-processing in taxonomy analysis. In: Bioinformatics. 2012 ; Vol. 28, No. 22. pp. 2891-2897.
@article{8581d3d3c73041a895c285d6e628493b,
title = "Comparing clustering and pre-processing in taxonomy analysis.",
abstract = "Motivation: Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S ribosomal RNA (rRNA) amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps and clustering algorithms on cluster accuracy for oral microbial sequence data.Results: The number of clusters varied up to two orders of magnitude depending on pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 15). Based on run time, purity and normalized mutual information, we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared with the differences in accuracy among different pre-processing steps. {\circledC} 2012 The Author.",
author = "M.J. Bonder and S. Abeln and E. Zaura and B.W. Brandt",
year = "2012",
doi = "10.1093/bioinformatics/bts552",
language = "English",
volume = "28",
pages = "2891--2897",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "22",

}

Comparing clustering and pre-processing in taxonomy analysis. / Bonder, M.J.; Abeln, S.; Zaura, E.; Brandt, B.W.

In: Bioinformatics, Vol. 28, No. 22, 2012, p. 2891-2897.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Comparing clustering and pre-processing in taxonomy analysis.

AU - Bonder, M.J.

AU - Abeln, S.

AU - Zaura, E.

AU - Brandt, B.W.

PY - 2012

Y1 - 2012

N2 - Motivation: Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S ribosomal RNA (rRNA) amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps and clustering algorithms on cluster accuracy for oral microbial sequence data.Results: The number of clusters varied up to two orders of magnitude depending on pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 15). Based on run time, purity and normalized mutual information, we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared with the differences in accuracy among different pre-processing steps. © 2012 The Author.

AB - Motivation: Massively parallel sequencing allows for rapid sequencing of large numbers of sequences in just a single run. Thus, 16S ribosomal RNA (rRNA) amplicon sequencing of complex microbial communities has become possible. The sequenced 16S rRNA fragments (reads) are clustered into operational taxonomic units and taxonomic categories are assigned. Recent reports suggest that data pre-processing should be performed before clustering. We assessed combinations of data pre-processing steps and clustering algorithms on cluster accuracy for oral microbial sequence data.Results: The number of clusters varied up to two orders of magnitude depending on pre-processing. Pre-processing using both denoising and chimera checking resulted in a number of clusters that was closest to the number of species in the mock dataset (25 versus 15). Based on run time, purity and normalized mutual information, we could not identify a single best clustering algorithm. The differences in clustering accuracy among the algorithms after the same pre-processing were minor compared with the differences in accuracy among different pre-processing steps. © 2012 The Author.

U2 - 10.1093/bioinformatics/bts552

DO - 10.1093/bioinformatics/bts552

M3 - Article

VL - 28

SP - 2891

EP - 2897

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 22

ER -