ViroDNABERT2

ViroDNABERT2 is a DNABERT-2-based nucleotide language model pre-trained on the ViroBlend (ViroBland) corpus, a small (216 Mbp) mixed pretraining dataset with source-wise stratified sampling to balance human reference, multi-species genomes, and viral in-domain sequences.

It is released as part of the ViroBench benchmark for evaluating viral nucleotide foundation models.

Model details

Item	Value
Architecture	DNABERT-2-117M (BERT-style, BPE tokenizer)
Pretraining data	ViroBlend (~216 Mbp)

Quick start

Install dependencies:

pip install torch transformers

Extract an embedding for a random DNA sequence:

python get_embedding.py

Or load in Python (base model + local pytorch_model.bin):

import torch
from transformers import AutoModel, AutoTokenizer

BASE = "zhihan1996/DNABERT-2-117M"
# REPO = "YDXX/ViroDNABERT2"  # after uploading to Hugging Face

tokenizer = AutoTokenizer.from_pretrained(BASE, trust_remote_code=True)
model = AutoModel.from_pretrained(BASE, trust_remote_code=True)
# load ViroDNABERT2 weights from pytorch_model.bin if needed (see get_embedding.py)

Files

config.json — training export config
pytorch_model.bin — fine-tuned backbone weights
tokenizer.json / tokenizer_config.json — tokenizer files
get_embedding.py — minimal embedding demo

Citation

@article{ye2026virobench,
  title={ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks},
  author={Ye, Dongxin and Hu, Fang and Hu, Han and Hu, Shu and Tan, Yang and Ouyang, Wanli and Li, Stan Z and Cui, Jie and Dong, Nanqing},
  journal={arXiv preprint arXiv:2605.25388},
  year={2026}
}

Downloads last month: 21

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for YDXX/ViroDNABERT2

ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

Paper • 2605.25388 • Published May 25