Seeing the trees through the forest: sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest

Qingzhen Hou, Paul F.G. De Geest, Wim F. Vranken, Jaap Heringa, K. Anton Feenstra

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains the prediction of the amino acids in a given protein sequence that are involved in protein-protein interactions. Such predictions are typically based on machine learning methods that take advantage of the properties and sequence positions of amino acids that are known to be involved in interaction. In this paper, we evaluate the importance of various features using Random Forest (RF), and include as a novel feature backbone flexibility predicted from sequences to further optimise protein interface prediction.

Results: We observe that there is no single sequence feature that enables pinpointing interacting sites in our Random Forest models. However, combining different properties does increase the performance of interface prediction. Our homomeric-trained RF interface predictor is able to distinguish interface from non-interface residues with an area under the ROC curve of 0.72 in a homomeric test-set. The heteromeric-trained RF interface predictor performs better than existing predictors on a independent heteromeric test-set. We trained a more general predictor on the combined homomeric and heteromeric dataset, and show that in addition to predicting homomeric interfaces, it is also able to pinpoint interface residues in heterodimers. This suggests that our random forest model and the features included capture common properties of both homodimer and heterodimer interfaces.

Availability and Implementation: The predictors and test datasets used in our analyses are freely available ( http://www.ibi.vu.nl/downloads/RF_PPI/ ).

Contact: k.a.feenstra@vu.nl.

Supplementary information: Supplementary data are available at Bioinformatics online.

Original languageEnglish
Pages (from-to)1479-1487
Number of pages9
JournalBioinformatics
Volume33
Issue number10
DOIs
Publication statusPublished - 15 May 2017

Fingerprint

Random Forest
Protein-protein Interaction
Proteins
Prediction
Predictors
Amino acids
Protein Sequence
Test Set
Amino Acids
Annotation
Bioinformatics
Learning systems
Computational Biology
ROC Curve
Genes
Area Under Curve
Forests
Availability
Amino Acid Sequence
Receiver Operating Characteristic Curve

Cite this

@article{e1c176c8e75e46daaee7815a8d481259,
title = "Seeing the trees through the forest: sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest",
abstract = "Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains the prediction of the amino acids in a given protein sequence that are involved in protein-protein interactions. Such predictions are typically based on machine learning methods that take advantage of the properties and sequence positions of amino acids that are known to be involved in interaction. In this paper, we evaluate the importance of various features using Random Forest (RF), and include as a novel feature backbone flexibility predicted from sequences to further optimise protein interface prediction.Results: We observe that there is no single sequence feature that enables pinpointing interacting sites in our Random Forest models. However, combining different properties does increase the performance of interface prediction. Our homomeric-trained RF interface predictor is able to distinguish interface from non-interface residues with an area under the ROC curve of 0.72 in a homomeric test-set. The heteromeric-trained RF interface predictor performs better than existing predictors on a independent heteromeric test-set. We trained a more general predictor on the combined homomeric and heteromeric dataset, and show that in addition to predicting homomeric interfaces, it is also able to pinpoint interface residues in heterodimers. This suggests that our random forest model and the features included capture common properties of both homodimer and heterodimer interfaces.Availability and Implementation: The predictors and test datasets used in our analyses are freely available ( http://www.ibi.vu.nl/downloads/RF_PPI/ ).Contact: k.a.feenstra@vu.nl.Supplementary information: Supplementary data are available at Bioinformatics online.",
author = "Qingzhen Hou and {De Geest}, {Paul F.G.} and Vranken, {Wim F.} and Jaap Heringa and Feenstra, {K. Anton}",
year = "2017",
month = "5",
day = "15",
doi = "10.1093/bioinformatics/btx005",
language = "English",
volume = "33",
pages = "1479--1487",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "10",

}

Seeing the trees through the forest : sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest. / Hou, Qingzhen; De Geest, Paul F.G.; Vranken, Wim F.; Heringa, Jaap; Feenstra, K. Anton.

In: Bioinformatics, Vol. 33, No. 10, 15.05.2017, p. 1479-1487.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Seeing the trees through the forest

T2 - sequence-based homo- and heteromeric protein-protein interaction sites prediction using random forest

AU - Hou, Qingzhen

AU - De Geest, Paul F.G.

AU - Vranken, Wim F.

AU - Heringa, Jaap

AU - Feenstra, K. Anton

PY - 2017/5/15

Y1 - 2017/5/15

N2 - Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains the prediction of the amino acids in a given protein sequence that are involved in protein-protein interactions. Such predictions are typically based on machine learning methods that take advantage of the properties and sequence positions of amino acids that are known to be involved in interaction. In this paper, we evaluate the importance of various features using Random Forest (RF), and include as a novel feature backbone flexibility predicted from sequences to further optimise protein interface prediction.Results: We observe that there is no single sequence feature that enables pinpointing interacting sites in our Random Forest models. However, combining different properties does increase the performance of interface prediction. Our homomeric-trained RF interface predictor is able to distinguish interface from non-interface residues with an area under the ROC curve of 0.72 in a homomeric test-set. The heteromeric-trained RF interface predictor performs better than existing predictors on a independent heteromeric test-set. We trained a more general predictor on the combined homomeric and heteromeric dataset, and show that in addition to predicting homomeric interfaces, it is also able to pinpoint interface residues in heterodimers. This suggests that our random forest model and the features included capture common properties of both homodimer and heterodimer interfaces.Availability and Implementation: The predictors and test datasets used in our analyses are freely available ( http://www.ibi.vu.nl/downloads/RF_PPI/ ).Contact: k.a.feenstra@vu.nl.Supplementary information: Supplementary data are available at Bioinformatics online.

AB - Motivation: Genome sequencing is producing an ever-increasing amount of associated protein sequences. Few of these sequences have experimentally validated annotations, however, and computational predictions are becoming increasingly successful in producing such annotations. One key challenge remains the prediction of the amino acids in a given protein sequence that are involved in protein-protein interactions. Such predictions are typically based on machine learning methods that take advantage of the properties and sequence positions of amino acids that are known to be involved in interaction. In this paper, we evaluate the importance of various features using Random Forest (RF), and include as a novel feature backbone flexibility predicted from sequences to further optimise protein interface prediction.Results: We observe that there is no single sequence feature that enables pinpointing interacting sites in our Random Forest models. However, combining different properties does increase the performance of interface prediction. Our homomeric-trained RF interface predictor is able to distinguish interface from non-interface residues with an area under the ROC curve of 0.72 in a homomeric test-set. The heteromeric-trained RF interface predictor performs better than existing predictors on a independent heteromeric test-set. We trained a more general predictor on the combined homomeric and heteromeric dataset, and show that in addition to predicting homomeric interfaces, it is also able to pinpoint interface residues in heterodimers. This suggests that our random forest model and the features included capture common properties of both homodimer and heterodimer interfaces.Availability and Implementation: The predictors and test datasets used in our analyses are freely available ( http://www.ibi.vu.nl/downloads/RF_PPI/ ).Contact: k.a.feenstra@vu.nl.Supplementary information: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=85020312275&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85020312275&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btx005

DO - 10.1093/bioinformatics/btx005

M3 - Article

VL - 33

SP - 1479

EP - 1487

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 10

ER -