Skip to main navigation Skip to search Skip to main content

Protein language models enable accurate viral host range prediction

  • Jorge F. Beltrán*
  • , Lisandra Herrera Belén
  • , Fernanda Parraguez-Contreras
  • , Alejandro J. Yañez
  • *Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Understanding and predicting viral host range is a fundamental challenge in virology, with direct implications for emerging pathogen surveillance and pandemic preparedness. Traditional molecular descriptors such as PAAC and DPC capture only local physicochemical properties, limiting their ability to generalize across diverse viral taxa. In this work, we developed VirHostPRED, a novel computational framework based on protein language models (PLMs) that leverages embeddings derived from ESM-2 to predict the likelihood of human infectivity from individual viral protein sequences. Among nine machine learning algorithms evaluated, SVM-RBF achieved the best performance, reaching 0.852 accuracy and 0.914 AUC on the hold-out test set using ESM2-t48-15B embeddings. The progressive scaling of ESM-2 from 8 million to 15 billion parameters resulted in consistent gains in discriminative capability, while t-SNE projections revealed enhanced class separability with larger models, confirming that ESM-2 embeddings encode biologically meaningful structure. Comparative benchmarks with ensemble and linear classifiers further demonstrated that nonlinear models effectively capture the high-dimensional relationships within PLM representations. Our web server, VirHostPRED, enables rapid in silico prediction of human infectivity risk from viral protein sequences without requiring extensive experimental characterization, providing an efficient computational triage system to support early warning, prioritization, and resource allocation in viral surveillance pipelines. The VirHostPRED server is freely available at https://www.biochemintelli.com/virhostpred/.

Original languageEnglish
Article number7606
Pages (from-to)1-12
Number of pages12
JournalScientific Reports
Volume16
Issue number1
Early online date25 Feb 2026
DOIs
Publication statusPublished - 2026

Bibliographical note

Publisher Copyright:
© The Author(s) 2026.

Keywords

  • Bioinformatics
  • ESM-2 embeddings
  • Machine learning
  • Protein language models
  • SVM-RBF
  • Viral host prediction

Fingerprint

Dive into the research topics of 'Protein language models enable accurate viral host range prediction'. Together they form a unique fingerprint.

Cite this