ProteinGLUE multi-task benchmark suite for self-supervised protein modeling

Henriette Capel, Robin Weiler, Maurits Dijkstra, Reinier Vleugels, Peter Bloem, K. Anton Feenstra*

*Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Self-supervised language modeling is a rapidly developing approach for the analysis of protein sequence data. However, work in this area is heterogeneous and diverse, making comparison of models and methods difficult. Moreover, models are often evaluated only on one or two downstream tasks, making it unclear whether the models capture generally useful properties. We introduce the ProteinGLUE benchmark for the evaluation of protein representations: a set of seven per-amino-acid tasks for evaluating learned protein representations. We also offer reference code, and we provide two baseline models with hyperparameters specifically trained for these benchmarks. Pre-training was done on two tasks, masked symbol prediction and next sentence prediction. We show that pre-training yields higher performance on a variety of downstream tasks such as secondary structure and protein interaction interface prediction, compared to no pre-training. However, the larger base model does not outperform the smaller medium model. We expect the ProteinGLUE benchmark dataset introduced here, together with the two baseline pre-trained models and their performance evaluations, to be of great value to the field of protein sequence-based property prediction. Availability: code and datasets from https://github.com/ibivu/protein-glue.

Original languageEnglish
Article number16047
Pages (from-to)1-14
Number of pages14
JournalScientific Reports
Volume12
DOIs
Publication statusPublished - 26 Sept 2022

Bibliographical note

Funding Information:
H.C. and R.W. would like to thank The Network Institute VU for funding this project. This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative and on the distributed ASCII supercomputer.

Funding Information:
H.C. and R.W. would like to thank The Network Institute VU for funding this project. This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative and on the distributed ASCII supercomputer74.

Publisher Copyright:
© 2022, The Author(s).

Keywords

  • Amino Acid Sequence
  • Amino Acids/chemistry
  • Benchmarking
  • Natural Language Processing
  • Proteins

Fingerprint

Dive into the research topics of 'ProteinGLUE multi-task benchmark suite for self-supervised protein modeling'. Together they form a unique fingerprint.

Cite this