Abstract
Understanding and predicting viral host range is a fundamental challenge in virology, with direct implications for emerging pathogen surveillance and pandemic preparedness. Traditional molecular descriptors such as PAAC and DPC capture only local physicochemical properties, limiting their ability to generalize across diverse viral taxa. In this work, we developed VirHostPRED, a novel computational framework based on protein language models (PLMs) that leverages embeddings derived from ESM-2 to predict the likelihood of human infectivity from individual viral protein sequences. Among nine machine learning algorithms evaluated, SVM-RBF achieved the best performance, reaching 0.852 accuracy and 0.914 AUC on the hold-out test set using ESM2-t48-15B embeddings. The progressive scaling of ESM-2 from 8 million to 15 billion parameters resulted in consistent gains in discriminative capability, while t-SNE projections revealed enhanced class separability with larger models, confirming that ESM-2 embeddings encode biologically meaningful structure. Comparative benchmarks with ensemble and linear classifiers further demonstrated that nonlinear models effectively capture the high-dimensional relationships within PLM representations. Our web server, VirHostPRED, enables rapid in silico prediction of human infectivity risk from viral protein sequences without requiring extensive experimental characterization, providing an efficient computational triage system to support early warning, prioritization, and resource allocation in viral surveillance pipelines. The VirHostPRED server is freely available at https://www.biochemintelli.com/virhostpred/.
| Original language | English |
|---|---|
| Article number | 7606 |
| Pages (from-to) | 1-12 |
| Number of pages | 12 |
| Journal | Scientific Reports |
| Volume | 16 |
| Issue number | 1 |
| Early online date | 25 Feb 2026 |
| DOIs | |
| Publication status | Published - 2026 |
Bibliographical note
Publisher Copyright:© The Author(s) 2026.
Keywords
- Bioinformatics
- ESM-2 embeddings
- Machine learning
- Protein language models
- SVM-RBF
- Viral host prediction
Fingerprint
Dive into the research topics of 'Protein language models enable accurate viral host range prediction'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver