README.md · duttaprat/HViLM-base at main

File size: 11,754 Bytes

---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- genomics
- virology
- dnabert
- foundation-model
- hvilm
- pathogenicity
- transmissibility
- host-tropism
- viral-genomics
datasets:
- VIRION
- BV-BRC
- VHDB
- duttaprat/HVUE
pipeline_tag: feature-extraction
widget:
- text: "ATGCGTACGTTAGCCGATCG"
  example_title: "Viral Sequence Example"
---

# HViLM-base: A Foundation Model for Viral Genomics

<div align="center">

[![Paper](https://img.shields.io/badge/Paper-RECOMB%202026-blue)](https://github.com/duttaprat/HViLM)
[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/duttaprat/HViLM)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-HViLM--base-yellow)](https://huggingface.co/duttaprat/HViLM-base)

</div>

## Model Description

**HViLM (Human Virome Language Model)** is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) on 5 million viral genome sequences from the [VIRION database](https://virion.verena.org), HViLM captures universal viral genomic patterns relevant for human disease risk assessment.

**Paper**: *HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism* (RECOMB 2026)

**Authors**: Pratik Dutta, Jack Vaska, Pallavi Surana, Rekha Sathian, Max Chao, Zhihan Zhou, Han Liu, and Ramana V. Davuluri

**Code & Benchmarks**: [GitHub Repository](https://github.com/duttaprat/HViLM)

---

## Key Features

- 🦠 **Viral-specialized pre-training** on 5M sequences from 10.8M genomes spanning 45+ viral families
- 🎯 **Multi-task predictions** across 3 epidemiologically critical tasks:
  - **Pathogenicity classification**: 95.32% average accuracy
  - **Host tropism prediction**: 96.25% accuracy
  - **Transmissibility assessment**: 97.36% average accuracy
- 📊 **[HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE)**: 7 curated datasets totaling 60K+ viral sequences
- 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
- ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
- 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB

---

## Model Architecture

HViLM is built upon **DNABERT-2** (117M parameters), which uses the MosaicBERT architecture with:
- **Tokenization**: Byte Pair Encoding (BPE) with vocabulary size 4,096
- **Max sequence length**: 1,000 base pairs
- **Hidden size**: 768
- **Attention heads**: 12
- **Layers**: 12
- **Positional encoding**: Attention with Linear Biases (ALiBi)

**Continued pre-training**:
- **Objective**: Masked Language Modeling (MLM)
- **Training data**: 5M viral sequence chunks (non-overlapping, 1000 bp)
- **Data source**: VIRION database (clustered at 80% identity with MMseqs2)
- **Training**: 10 epochs, AdamW optimizer, learning rate 5e-5
- **Hardware**: 4x NVIDIA A100 GPUs (72 hours)
- **Performance**: 94.2% MLM accuracy on validation set

---

## Installation

```bash
pip install transformers torch
```

---

## Quick Start

### Basic Usage: Extract Sequence Embeddings

```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "duttaprat/HViLM-base",
    trust_remote_code=True  # Required for custom architecture
)
model = AutoModel.from_pretrained(
    "duttaprat/HViLM-base",
    trust_remote_code=True
)

# Example: Get embeddings for a viral sequence
viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT"

# Tokenize
inputs = tokenizer(
    viral_sequence,
    return_tensors="pt",
    truncation=True,
    max_length=512,
    padding=True
)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # [batch_size, seq_len, 768]

print(f"Sequence embeddings shape: {embeddings.shape}")

# Mean pooling for sequence-level representation
attention_mask = inputs['attention_mask']
mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1)
sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
mean_embeddings = sum_embeddings / sum_mask

print(f"Mean sequence embedding shape: {mean_embeddings.shape}")  # [batch_size, 768]
```

### Fine-tuning on Your Own Task

For fine-tuning HViLM on custom viral classification tasks, please refer to the [GitHub repository](https://github.com/duttaprat/HViLM) for complete training scripts and examples.

```python
# Example fine-tuning setup (see GitHub for complete code)
from transformers import AutoModel, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model

# Load base model
model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True)

# Configure LoRA for parameter-efficient fine-tuning
lora_config = LoraConfig(
    r=8,                    # rank
    lora_alpha=16,         # scaling factor
    target_modules=["query", "value"],  # attention layers
    lora_dropout=0.1,
    bias="none"
)

# Apply LoRA
model = get_peft_model(model, lora_config)

# Add classification head and train (see GitHub for details)
```

---

## Performance on HVUE Benchmark

### Pathogenicity Classification

| Dataset | Sequences | Accuracy | F1-Score | MCC |
|---------|-----------|----------|----------|-----|
| CINI | 159 | **87.74%** | 86.98 | 74.48 |
| BVBRC-CoV | 18,066 | **98.26%** | 98.26 | 96.52 |
| BVBRC-Calici | 31,089 | **99.95%** | 99.93 | 99.90 |
| **Average** | **49,314** | **95.32%** | **95.06** | **90.30** |

### Host Tropism Prediction

| Dataset | Sequences | Accuracy | F1-Score | MCC |
|---------|-----------|----------|----------|-----|
| VHDB | 9,428 | **96.25%** | 91.34 | 91.24 |

### Transmissibility Assessment (R₀-based Classification)

| Viral Family | Sequences | Accuracy | F1-Score | MCC |
|--------------|-----------|----------|----------|-----|
| Coronaviridae | ~3,000 | **97.45%** | 97.37 | 93.43 |
| Orthomyxoviridae | ~2,500 | **95.62%** | 95.44 | 91.07 |
| Caliciviridae | ~1,800 | **99.95%** | 99.95 | 99.90 |
| **Average** | **~7,300** | **97.36%** | **97.59** | **94.80** |

**Comparison with baselines**: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks.

---

## Interpretability: Transcription Factor Mimicry

HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through **molecular mimicry of host regulatory elements**:

- **42 conserved motifs** identified in high-attention regions of pathogenic coronaviruses
- **10 vertebrate transcription factors** targeted, including:
  - **Irf1** (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion
  - **Foxq1**: Multiple motifs for epithelial cell tropism
  - **ZNF354A**: 6 motifs for chromatin regulation
  
This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations.

---

## Training Data

### Pre-training Corpus

- **Source**: [VIRION database](https://virion.verena.org) (476,242 virus-host associations)
- **Genomes**: 10,817,265 unique NCBI accession numbers
- **Processing**: 
  - Segmented into non-overlapping 1000 bp chunks
  - Clustered with MMseqs2 at 80% identity threshold
- **Final dataset**: 5 million unique sequences
- **Coverage**: 45+ viral families across all Baltimore classification groups


---

## HVUE Benchmark Datasets

The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 curated datasets:

### Pathogenicity Prediction (3 datasets)
- **CINI**: 159 sequences, 4 viral families, manual literature curation
- **BVBRC-CoV**: 18,066 coronaviruses
- **BVBRC-Calici**: 31,089 caliciviruses

### Host Tropism Prediction (1 dataset)
- **VHDB**: 9,428 sequences, 30 viral families
- Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%)

### Transmissibility Prediction (3 datasets)
- **Coronaviridae**: R₀-based classification (R₀<1 vs R₀≥1)
- **Orthomyxoviridae**: R₀-based classification
- **Caliciviridae**: R₀-based classification

All datasets available at: **[🤗 duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)**

### Download and Use
```python
from datasets import load_dataset

# Load specific task
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")

# Load specific split
train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv")
```

---

## Reproducing Paper Results

### Step 1: Download HVUE Benchmark
```python
from datasets import load_dataset

# Download all datasets
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
```

### Step 2: Fine-tune and Evaluate

To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:



```bash
# Clone repository
git clone https://github.com/duttaprat/HViLM.git
cd HViLM

# Install dependencies
pip install -r requirements.txt

# Reproduce pathogenicity results on CINI dataset
cd finetune
bash scripts/run_patho_cini.sh

# Reproduce host tropism results
bash scripts/run_tropism_vhdb.sh

# Reproduce transmissibility results
bash scripts/run_r0_coronaviridae.sh
```

For detailed instructions, see the [GitHub repository](https://github.com/duttaprat/HViLM).

---

## Citation



If you use DNABERT-2 (the base model), please also cite:

```bibtex
@article{zhou2023dnabert2,
  title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
  author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han},
  journal={ICLR},
  year={2024}
}
```
If you use HViLM in your research, please cite our paper:
```
@article{dutta2025hvilm,
  title={HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism},
  author={Dutta, Pratik and Vaska, Jack and Surana, Pallavi and Sathian, Rekha and Chao, Max and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V.},
  journal={Submitted to RECOMB},
  year={2025},
  note={Under review}
}
```
---

## Model Card Authors

- **Pratik Dutta** (Senior Research Scientist, Stony Brook University)
- **Ramana V. Davuluri** (Professor, Stony Brook University)

---

## Contact

- **Email**: pratik.dutta@stonybrook.edu
- **Lab**: [Davuluri Lab, Stony Brook University](https://davulurilab.github.io/)
- **GitHub Issues**: [Report bugs or request features](https://github.com/duttaprat/HViLM/issues)

---

## Acknowledgments

This work builds upon [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) by Zhou et al. Pre-training data from the [VIRION database](https://virion.verena.org) maintained by the Viral Emergence Research Initiative (Verena).



---

## License

This model is released under the **Apache License 2.0**.

---

## Disclaimer

HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis.