ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
Paper • 2605.25388 • Published
ViroHyena is a Hyena-based nucleotide language model pre-trained on the ViroBlend corpus, a 216 Mbp mixed pretraining dataset with source-wise stratified sampling to balance human reference, multi-species genomes, and viral in-domain sequences.
The model was introduced as a baseline in the paper ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks.
| Model | Params | d_model | Layers |
|---|---|---|---|
| ViroHyena-436K | 0.436M | 128 | 2 |
| ViroHyena-1.6M | 1.6M | 256 | 2 |
| ViroHyena-6.6M | 6.6M | 256 | 8 |
| ViroHyena-253M | 253M | 1024 | 20 |
ViroBench is a comprehensive diagnostic benchmark for viral sequence modeling, designed to evaluate models across two critical dimensions:
@article{ye2026virobench,
title={ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks},
author={Ye, Dongxin and Hu, Fang and Hu, Han and Hu, Shu and Tan, Yang and Ouyang, Wanli and Li, Stan Z and Cui, Jie and Dong, Nanqing},
journal={arXiv preprint arXiv:2605.25388},
year={2026}
}