|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- genomics |
|
|
- virology |
|
|
- dnabert |
|
|
- foundation-model |
|
|
- hvilm |
|
|
- pathogenicity |
|
|
- transmissibility |
|
|
- host-tropism |
|
|
- viral-genomics |
|
|
datasets: |
|
|
- VIRION |
|
|
- BV-BRC |
|
|
- VHDB |
|
|
- duttaprat/HVUE |
|
|
pipeline_tag: feature-extraction |
|
|
widget: |
|
|
- text: "ATGCGTACGTTAGCCGATCG" |
|
|
example_title: "Viral Sequence Example" |
|
|
--- |
|
|
|
|
|
# HViLM-base: A Foundation Model for Viral Genomics |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://github.com/duttaprat/HViLM) |
|
|
[](https://github.com/duttaprat/HViLM) |
|
|
[](LICENSE) |
|
|
[](https://huggingface.co/duttaprat/HViLM-base) |
|
|
|
|
|
</div> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**HViLM (Human Virome Language Model)** is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) on 5 million viral genome sequences from the [VIRION database](https://virion.verena.org), HViLM captures universal viral genomic patterns relevant for human disease risk assessment. |
|
|
|
|
|
**Paper**: *HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism* (RECOMB 2026) |
|
|
|
|
|
**Authors**: Pratik Dutta, Jack Vaska, Pallavi Surana, Rekha Sathian, Max Chao, Zhihan Zhou, Han Liu, and Ramana V. Davuluri |
|
|
|
|
|
**Code & Benchmarks**: [GitHub Repository](https://github.com/duttaprat/HViLM) |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- 🦠 **Viral-specialized pre-training** on 5M sequences from 10.8M genomes spanning 45+ viral families |
|
|
- 🎯 **Multi-task predictions** across 3 epidemiologically critical tasks: |
|
|
- **Pathogenicity classification**: 95.32% average accuracy |
|
|
- **Host tropism prediction**: 96.25% accuracy |
|
|
- **Transmissibility assessment**: 97.36% average accuracy |
|
|
- 📊 **[HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE)**: 7 curated datasets totaling 60K+ viral sequences |
|
|
- 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs) |
|
|
- ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task) |
|
|
- 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
HViLM is built upon **DNABERT-2** (117M parameters), which uses the MosaicBERT architecture with: |
|
|
- **Tokenization**: Byte Pair Encoding (BPE) with vocabulary size 4,096 |
|
|
- **Max sequence length**: 1,000 base pairs |
|
|
- **Hidden size**: 768 |
|
|
- **Attention heads**: 12 |
|
|
- **Layers**: 12 |
|
|
- **Positional encoding**: Attention with Linear Biases (ALiBi) |
|
|
|
|
|
**Continued pre-training**: |
|
|
- **Objective**: Masked Language Modeling (MLM) |
|
|
- **Training data**: 5M viral sequence chunks (non-overlapping, 1000 bp) |
|
|
- **Data source**: VIRION database (clustered at 80% identity with MMseqs2) |
|
|
- **Training**: 10 epochs, AdamW optimizer, learning rate 5e-5 |
|
|
- **Hardware**: 4x NVIDIA A100 GPUs (72 hours) |
|
|
- **Performance**: 94.2% MLM accuracy on validation set |
|
|
|
|
|
--- |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Basic Usage: Extract Sequence Embeddings |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"duttaprat/HViLM-base", |
|
|
trust_remote_code=True # Required for custom architecture |
|
|
) |
|
|
model = AutoModel.from_pretrained( |
|
|
"duttaprat/HViLM-base", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Example: Get embeddings for a viral sequence |
|
|
viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT" |
|
|
|
|
|
# Tokenize |
|
|
inputs = tokenizer( |
|
|
viral_sequence, |
|
|
return_tensors="pt", |
|
|
truncation=True, |
|
|
max_length=512, |
|
|
padding=True |
|
|
) |
|
|
|
|
|
# Generate embeddings |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state # [batch_size, seq_len, 768] |
|
|
|
|
|
print(f"Sequence embeddings shape: {embeddings.shape}") |
|
|
|
|
|
# Mean pooling for sequence-level representation |
|
|
attention_mask = inputs['attention_mask'] |
|
|
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float() |
|
|
sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1) |
|
|
sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9) |
|
|
mean_embeddings = sum_embeddings / sum_mask |
|
|
|
|
|
print(f"Mean sequence embedding shape: {mean_embeddings.shape}") # [batch_size, 768] |
|
|
``` |
|
|
|
|
|
### Fine-tuning on Your Own Task |
|
|
|
|
|
For fine-tuning HViLM on custom viral classification tasks, please refer to the [GitHub repository](https://github.com/duttaprat/HViLM) for complete training scripts and examples. |
|
|
|
|
|
```python |
|
|
# Example fine-tuning setup (see GitHub for complete code) |
|
|
from transformers import AutoModel, TrainingArguments, Trainer |
|
|
from peft import LoraConfig, get_peft_model |
|
|
|
|
|
# Load base model |
|
|
model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True) |
|
|
|
|
|
# Configure LoRA for parameter-efficient fine-tuning |
|
|
lora_config = LoraConfig( |
|
|
r=8, # rank |
|
|
lora_alpha=16, # scaling factor |
|
|
target_modules=["query", "value"], # attention layers |
|
|
lora_dropout=0.1, |
|
|
bias="none" |
|
|
) |
|
|
|
|
|
# Apply LoRA |
|
|
model = get_peft_model(model, lora_config) |
|
|
|
|
|
# Add classification head and train (see GitHub for details) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance on HVUE Benchmark |
|
|
|
|
|
### Pathogenicity Classification |
|
|
|
|
|
| Dataset | Sequences | Accuracy | F1-Score | MCC | |
|
|
|---------|-----------|----------|----------|-----| |
|
|
| CINI | 159 | **87.74%** | 86.98 | 74.48 | |
|
|
| BVBRC-CoV | 18,066 | **98.26%** | 98.26 | 96.52 | |
|
|
| BVBRC-Calici | 31,089 | **99.95%** | 99.93 | 99.90 | |
|
|
| **Average** | **49,314** | **95.32%** | **95.06** | **90.30** | |
|
|
|
|
|
### Host Tropism Prediction |
|
|
|
|
|
| Dataset | Sequences | Accuracy | F1-Score | MCC | |
|
|
|---------|-----------|----------|----------|-----| |
|
|
| VHDB | 9,428 | **96.25%** | 91.34 | 91.24 | |
|
|
|
|
|
### Transmissibility Assessment (R₀-based Classification) |
|
|
|
|
|
| Viral Family | Sequences | Accuracy | F1-Score | MCC | |
|
|
|--------------|-----------|----------|----------|-----| |
|
|
| Coronaviridae | ~3,000 | **97.45%** | 97.37 | 93.43 | |
|
|
| Orthomyxoviridae | ~2,500 | **95.62%** | 95.44 | 91.07 | |
|
|
| Caliciviridae | ~1,800 | **99.95%** | 99.95 | 99.90 | |
|
|
| **Average** | **~7,300** | **97.36%** | **97.59** | **94.80** | |
|
|
|
|
|
**Comparison with baselines**: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Interpretability: Transcription Factor Mimicry |
|
|
|
|
|
HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through **molecular mimicry of host regulatory elements**: |
|
|
|
|
|
- **42 conserved motifs** identified in high-attention regions of pathogenic coronaviruses |
|
|
- **10 vertebrate transcription factors** targeted, including: |
|
|
- **Irf1** (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion |
|
|
- **Foxq1**: Multiple motifs for epithelial cell tropism |
|
|
- **ZNF354A**: 6 motifs for chromatin regulation |
|
|
|
|
|
This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Pre-training Corpus |
|
|
|
|
|
- **Source**: [VIRION database](https://virion.verena.org) (476,242 virus-host associations) |
|
|
- **Genomes**: 10,817,265 unique NCBI accession numbers |
|
|
- **Processing**: |
|
|
- Segmented into non-overlapping 1000 bp chunks |
|
|
- Clustered with MMseqs2 at 80% identity threshold |
|
|
- **Final dataset**: 5 million unique sequences |
|
|
- **Coverage**: 45+ viral families across all Baltimore classification groups |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## HVUE Benchmark Datasets |
|
|
|
|
|
The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 curated datasets: |
|
|
|
|
|
### Pathogenicity Prediction (3 datasets) |
|
|
- **CINI**: 159 sequences, 4 viral families, manual literature curation |
|
|
- **BVBRC-CoV**: 18,066 coronaviruses |
|
|
- **BVBRC-Calici**: 31,089 caliciviruses |
|
|
|
|
|
### Host Tropism Prediction (1 dataset) |
|
|
- **VHDB**: 9,428 sequences, 30 viral families |
|
|
- Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%) |
|
|
|
|
|
### Transmissibility Prediction (3 datasets) |
|
|
- **Coronaviridae**: R₀-based classification (R₀<1 vs R₀≥1) |
|
|
- **Orthomyxoviridae**: R₀-based classification |
|
|
- **Caliciviridae**: R₀-based classification |
|
|
|
|
|
All datasets available at: **[🤗 duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)** |
|
|
|
|
|
### Download and Use |
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load specific task |
|
|
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism") |
|
|
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity") |
|
|
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility") |
|
|
|
|
|
# Load specific split |
|
|
train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Reproducing Paper Results |
|
|
|
|
|
### Step 1: Download HVUE Benchmark |
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Download all datasets |
|
|
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism") |
|
|
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity") |
|
|
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility") |
|
|
``` |
|
|
|
|
|
### Step 2: Fine-tune and Evaluate |
|
|
|
|
|
To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions: |
|
|
|
|
|
|
|
|
|
|
|
```bash |
|
|
# Clone repository |
|
|
git clone https://github.com/duttaprat/HViLM.git |
|
|
cd HViLM |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Reproduce pathogenicity results on CINI dataset |
|
|
cd finetune |
|
|
bash scripts/run_patho_cini.sh |
|
|
|
|
|
# Reproduce host tropism results |
|
|
bash scripts/run_tropism_vhdb.sh |
|
|
|
|
|
# Reproduce transmissibility results |
|
|
bash scripts/run_r0_coronaviridae.sh |
|
|
``` |
|
|
|
|
|
For detailed instructions, see the [GitHub repository](https://github.com/duttaprat/HViLM). |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
|
|
|
|
|
|
If you use DNABERT-2 (the base model), please also cite: |
|
|
|
|
|
```bibtex |
|
|
@article{zhou2023dnabert2, |
|
|
title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome}, |
|
|
author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han}, |
|
|
journal={ICLR}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
If you use HViLM in your research, please cite our paper: |
|
|
``` |
|
|
@article{dutta2025hvilm, |
|
|
title={HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism}, |
|
|
author={Dutta, Pratik and Vaska, Jack and Surana, Pallavi and Sathian, Rekha and Chao, Max and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V.}, |
|
|
journal={Submitted to RECOMB}, |
|
|
year={2025}, |
|
|
note={Under review} |
|
|
} |
|
|
``` |
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- **Pratik Dutta** (Senior Research Scientist, Stony Brook University) |
|
|
- **Ramana V. Davuluri** (Professor, Stony Brook University) |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Email**: pratik.dutta@stonybrook.edu |
|
|
- **Lab**: [Davuluri Lab, Stony Brook University](https://davulurilab.github.io/) |
|
|
- **GitHub Issues**: [Report bugs or request features](https://github.com/duttaprat/HViLM/issues) |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work builds upon [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) by Zhou et al. Pre-training data from the [VIRION database](https://virion.verena.org) maintained by the Viral Emergence Research Initiative (Verena). |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the **Apache License 2.0**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis. |