Feature Extraction
Transformers
PyTorch
English
fill-mask
genomics
virology
dnabert
foundation-model
hvilm
pathogenicity
transmissibility
host-tropism
viral-genomics
custom_code
Instructions to use duttaprat/HViLM-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use duttaprat/HViLM-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="duttaprat/HViLM-base", trust_remote_code=True)# Load model directly from transformers import AutoModelForMaskedLM model = AutoModelForMaskedLM.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - genomics | |
| - virology | |
| - dnabert | |
| - foundation-model | |
| - hvilm | |
| - pathogenicity | |
| - transmissibility | |
| - host-tropism | |
| - viral-genomics | |
| datasets: | |
| - VIRION | |
| - BV-BRC | |
| - VHDB | |
| - duttaprat/HVUE | |
| pipeline_tag: feature-extraction | |
| widget: | |
| - text: "ATGCGTACGTTAGCCGATCG" | |
| example_title: "Viral Sequence Example" | |
| # HViLM-base: A Foundation Model for Viral Genomics | |
| <div align="center"> | |
| [](https://github.com/duttaprat/HViLM) | |
| [](https://github.com/duttaprat/HViLM) | |
| [](LICENSE) | |
| [](https://huggingface.co/duttaprat/HViLM-base) | |
| </div> | |
| ## Model Description | |
| **HViLM (Human Virome Language Model)** is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) on 5 million viral genome sequences from the [VIRION database](https://virion.verena.org), HViLM captures universal viral genomic patterns relevant for human disease risk assessment. | |
| **Paper**: *HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism* (RECOMB 2026) | |
| **Authors**: Pratik Dutta, Jack Vaska, Pallavi Surana, Rekha Sathian, Max Chao, Zhihan Zhou, Han Liu, and Ramana V. Davuluri | |
| **Code & Benchmarks**: [GitHub Repository](https://github.com/duttaprat/HViLM) | |
| --- | |
| ## Key Features | |
| - 🦠 **Viral-specialized pre-training** on 5M sequences from 10.8M genomes spanning 45+ viral families | |
| - 🎯 **Multi-task predictions** across 3 epidemiologically critical tasks: | |
| - **Pathogenicity classification**: 95.32% average accuracy | |
| - **Host tropism prediction**: 96.25% accuracy | |
| - **Transmissibility assessment**: 97.36% average accuracy | |
| - 📊 **[HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE)**: 7 curated datasets totaling 60K+ viral sequences | |
| - 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs) | |
| - ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task) | |
| - 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB | |
| --- | |
| ## Model Architecture | |
| HViLM is built upon **DNABERT-2** (117M parameters), which uses the MosaicBERT architecture with: | |
| - **Tokenization**: Byte Pair Encoding (BPE) with vocabulary size 4,096 | |
| - **Max sequence length**: 1,000 base pairs | |
| - **Hidden size**: 768 | |
| - **Attention heads**: 12 | |
| - **Layers**: 12 | |
| - **Positional encoding**: Attention with Linear Biases (ALiBi) | |
| **Continued pre-training**: | |
| - **Objective**: Masked Language Modeling (MLM) | |
| - **Training data**: 5M viral sequence chunks (non-overlapping, 1000 bp) | |
| - **Data source**: VIRION database (clustered at 80% identity with MMseqs2) | |
| - **Training**: 10 epochs, AdamW optimizer, learning rate 5e-5 | |
| - **Hardware**: 4x NVIDIA A100 GPUs (72 hours) | |
| - **Performance**: 94.2% MLM accuracy on validation set | |
| --- | |
| ## Installation | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| --- | |
| ## Quick Start | |
| ### Basic Usage: Extract Sequence Embeddings | |
| ```python | |
| from transformers import AutoTokenizer, AutoModel | |
| import torch | |
| # Load model and tokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "duttaprat/HViLM-base", | |
| trust_remote_code=True # Required for custom architecture | |
| ) | |
| model = AutoModel.from_pretrained( | |
| "duttaprat/HViLM-base", | |
| trust_remote_code=True | |
| ) | |
| # Example: Get embeddings for a viral sequence | |
| viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT" | |
| # Tokenize | |
| inputs = tokenizer( | |
| viral_sequence, | |
| return_tensors="pt", | |
| truncation=True, | |
| max_length=512, | |
| padding=True | |
| ) | |
| # Generate embeddings | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| embeddings = outputs.last_hidden_state # [batch_size, seq_len, 768] | |
| print(f"Sequence embeddings shape: {embeddings.shape}") | |
| # Mean pooling for sequence-level representation | |
| attention_mask = inputs['attention_mask'] | |
| mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float() | |
| sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1) | |
| sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9) | |
| mean_embeddings = sum_embeddings / sum_mask | |
| print(f"Mean sequence embedding shape: {mean_embeddings.shape}") # [batch_size, 768] | |
| ``` | |
| ### Fine-tuning on Your Own Task | |
| For fine-tuning HViLM on custom viral classification tasks, please refer to the [GitHub repository](https://github.com/duttaprat/HViLM) for complete training scripts and examples. | |
| ```python | |
| # Example fine-tuning setup (see GitHub for complete code) | |
| from transformers import AutoModel, TrainingArguments, Trainer | |
| from peft import LoraConfig, get_peft_model | |
| # Load base model | |
| model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True) | |
| # Configure LoRA for parameter-efficient fine-tuning | |
| lora_config = LoraConfig( | |
| r=8, # rank | |
| lora_alpha=16, # scaling factor | |
| target_modules=["query", "value"], # attention layers | |
| lora_dropout=0.1, | |
| bias="none" | |
| ) | |
| # Apply LoRA | |
| model = get_peft_model(model, lora_config) | |
| # Add classification head and train (see GitHub for details) | |
| ``` | |
| --- | |
| ## Performance on HVUE Benchmark | |
| ### Pathogenicity Classification | |
| | Dataset | Sequences | Accuracy | F1-Score | MCC | | |
| |---------|-----------|----------|----------|-----| | |
| | CINI | 159 | **87.74%** | 86.98 | 74.48 | | |
| | BVBRC-CoV | 18,066 | **98.26%** | 98.26 | 96.52 | | |
| | BVBRC-Calici | 31,089 | **99.95%** | 99.93 | 99.90 | | |
| | **Average** | **49,314** | **95.32%** | **95.06** | **90.30** | | |
| ### Host Tropism Prediction | |
| | Dataset | Sequences | Accuracy | F1-Score | MCC | | |
| |---------|-----------|----------|----------|-----| | |
| | VHDB | 9,428 | **96.25%** | 91.34 | 91.24 | | |
| ### Transmissibility Assessment (R₀-based Classification) | |
| | Viral Family | Sequences | Accuracy | F1-Score | MCC | | |
| |--------------|-----------|----------|----------|-----| | |
| | Coronaviridae | ~3,000 | **97.45%** | 97.37 | 93.43 | | |
| | Orthomyxoviridae | ~2,500 | **95.62%** | 95.44 | 91.07 | | |
| | Caliciviridae | ~1,800 | **99.95%** | 99.95 | 99.90 | | |
| | **Average** | **~7,300** | **97.36%** | **97.59** | **94.80** | | |
| **Comparison with baselines**: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks. | |
| --- | |
| ## Interpretability: Transcription Factor Mimicry | |
| HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through **molecular mimicry of host regulatory elements**: | |
| - **42 conserved motifs** identified in high-attention regions of pathogenic coronaviruses | |
| - **10 vertebrate transcription factors** targeted, including: | |
| - **Irf1** (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion | |
| - **Foxq1**: Multiple motifs for epithelial cell tropism | |
| - **ZNF354A**: 6 motifs for chromatin regulation | |
| This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations. | |
| --- | |
| ## Training Data | |
| ### Pre-training Corpus | |
| - **Source**: [VIRION database](https://virion.verena.org) (476,242 virus-host associations) | |
| - **Genomes**: 10,817,265 unique NCBI accession numbers | |
| - **Processing**: | |
| - Segmented into non-overlapping 1000 bp chunks | |
| - Clustered with MMseqs2 at 80% identity threshold | |
| - **Final dataset**: 5 million unique sequences | |
| - **Coverage**: 45+ viral families across all Baltimore classification groups | |
| --- | |
| ## HVUE Benchmark Datasets | |
| The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 curated datasets: | |
| ### Pathogenicity Prediction (3 datasets) | |
| - **CINI**: 159 sequences, 4 viral families, manual literature curation | |
| - **BVBRC-CoV**: 18,066 coronaviruses | |
| - **BVBRC-Calici**: 31,089 caliciviruses | |
| ### Host Tropism Prediction (1 dataset) | |
| - **VHDB**: 9,428 sequences, 30 viral families | |
| - Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%) | |
| ### Transmissibility Prediction (3 datasets) | |
| - **Coronaviridae**: R₀-based classification (R₀<1 vs R₀≥1) | |
| - **Orthomyxoviridae**: R₀-based classification | |
| - **Caliciviridae**: R₀-based classification | |
| All datasets available at: **[🤗 duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)** | |
| ### Download and Use | |
| ```python | |
| from datasets import load_dataset | |
| # Load specific task | |
| host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism") | |
| pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity") | |
| transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility") | |
| # Load specific split | |
| train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv") | |
| ``` | |
| --- | |
| ## Reproducing Paper Results | |
| ### Step 1: Download HVUE Benchmark | |
| ```python | |
| from datasets import load_dataset | |
| # Download all datasets | |
| host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism") | |
| pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity") | |
| transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility") | |
| ``` | |
| ### Step 2: Fine-tune and Evaluate | |
| To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions: | |
| ```bash | |
| # Clone repository | |
| git clone https://github.com/duttaprat/HViLM.git | |
| cd HViLM | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Reproduce pathogenicity results on CINI dataset | |
| cd finetune | |
| bash scripts/run_patho_cini.sh | |
| # Reproduce host tropism results | |
| bash scripts/run_tropism_vhdb.sh | |
| # Reproduce transmissibility results | |
| bash scripts/run_r0_coronaviridae.sh | |
| ``` | |
| For detailed instructions, see the [GitHub repository](https://github.com/duttaprat/HViLM). | |
| --- | |
| ## Citation | |
| If you use DNABERT-2 (the base model), please also cite: | |
| ```bibtex | |
| @article{zhou2023dnabert2, | |
| title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome}, | |
| author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han}, | |
| journal={ICLR}, | |
| year={2024} | |
| } | |
| ``` | |
| If you use HViLM in your research, please cite our paper: | |
| ``` | |
| @article{dutta2025hvilm, | |
| title={HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism}, | |
| author={Dutta, Pratik and Vaska, Jack and Surana, Pallavi and Sathian, Rekha and Chao, Max and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V.}, | |
| journal={Submitted to RECOMB}, | |
| year={2025}, | |
| note={Under review} | |
| } | |
| ``` | |
| --- | |
| ## Model Card Authors | |
| - **Pratik Dutta** (Senior Research Scientist, Stony Brook University) | |
| - **Ramana V. Davuluri** (Professor, Stony Brook University) | |
| --- | |
| ## Contact | |
| - **Email**: pratik.dutta@stonybrook.edu | |
| - **Lab**: [Davuluri Lab, Stony Brook University](https://davulurilab.github.io/) | |
| - **GitHub Issues**: [Report bugs or request features](https://github.com/duttaprat/HViLM/issues) | |
| --- | |
| ## Acknowledgments | |
| This work builds upon [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) by Zhou et al. Pre-training data from the [VIRION database](https://virion.verena.org) maintained by the Viral Emergence Research Initiative (Verena). | |
| --- | |
| ## License | |
| This model is released under the **Apache License 2.0**. | |
| --- | |
| ## Disclaimer | |
| HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis. |