Using phylogeny to improve genome-wide distant homology recognition

Sanne Abeln, Carlo Teubner, Charlotte M Deane

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10(-8) for SUPERFAMILY and >10(-4) for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.

Original languageEnglish
Pages (from-to)0073-0083
Number of pages11
JournalPLoS Computational Biology
Volume3
Issue number1
DOIs
Publication statusPublished - 19 Jan 2007

Fingerprint

Phylogeny
homology
Homology
Genome
phylogeny
genome
Genes
False Positive
Assignment
Fold
fold
methodology
Tend
Parsimony
secondary structure
Protein Structure
protein structure
Protein Sequence
Secondary Structure
method

Keywords

  • Algorithms
  • Base Sequence
  • Chromosome Mapping
  • Evolution, Molecular
  • Genome
  • Linkage Disequilibrium
  • Molecular Sequence Data
  • Phylogeny
  • Sequence Alignment
  • Sequence Analysis, DNA
  • Sequence Homology, Nucleic Acid
  • Journal Article
  • Research Support, Non-U.S. Gov't

Cite this

Abeln, Sanne ; Teubner, Carlo ; Deane, Charlotte M. / Using phylogeny to improve genome-wide distant homology recognition. In: PLoS Computational Biology. 2007 ; Vol. 3, No. 1. pp. 0073-0083.
@article{923a3d6f8ff84689890ff34e4b764936,
title = "Using phylogeny to improve genome-wide distant homology recognition",
abstract = "The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10(-8) for SUPERFAMILY and >10(-4) for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.",
keywords = "Algorithms, Base Sequence, Chromosome Mapping, Evolution, Molecular, Genome, Linkage Disequilibrium, Molecular Sequence Data, Phylogeny, Sequence Alignment, Sequence Analysis, DNA, Sequence Homology, Nucleic Acid, Journal Article, Research Support, Non-U.S. Gov't",
author = "Sanne Abeln and Carlo Teubner and Deane, {Charlotte M}",
year = "2007",
month = "1",
day = "19",
doi = "10.1371/journal.pcbi.0030003",
language = "English",
volume = "3",
pages = "0073--0083",
journal = "PLoS Computational Biology",
issn = "1553-734X",
publisher = "Public Library of Science",
number = "1",

}

Using phylogeny to improve genome-wide distant homology recognition. / Abeln, Sanne; Teubner, Carlo; Deane, Charlotte M.

In: PLoS Computational Biology, Vol. 3, No. 1, 19.01.2007, p. 0073-0083.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Using phylogeny to improve genome-wide distant homology recognition

AU - Abeln, Sanne

AU - Teubner, Carlo

AU - Deane, Charlotte M

PY - 2007/1/19

Y1 - 2007/1/19

N2 - The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10(-8) for SUPERFAMILY and >10(-4) for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.

AB - The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10(-8) for SUPERFAMILY and >10(-4) for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.

KW - Algorithms

KW - Base Sequence

KW - Chromosome Mapping

KW - Evolution, Molecular

KW - Genome

KW - Linkage Disequilibrium

KW - Molecular Sequence Data

KW - Phylogeny

KW - Sequence Alignment

KW - Sequence Analysis, DNA

KW - Sequence Homology, Nucleic Acid

KW - Journal Article

KW - Research Support, Non-U.S. Gov't

U2 - 10.1371/journal.pcbi.0030003

DO - 10.1371/journal.pcbi.0030003

M3 - Article

VL - 3

SP - 73

EP - 83

JO - PLoS Computational Biology

JF - PLoS Computational Biology

SN - 1553-734X

IS - 1

ER -