Update README.md

Browse files

Files changed (1) hide show

README.md +300 -23

README.md CHANGED Viewed

@@ -1,52 +1,329 @@
 ---
 license: apache-2.0
 tags:
 - genomics
-- dnabert
 - virology
 - foundation-model
 - hvilm
 ---
 # HViLM-base: A Foundation Model for Viral Genomics
-This is the base pre-trained model for **HViLM**, as described in the paper:
-**"HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism"**
-- **Paper:** [Link to your arXiv paper will go here]
-- **Fine-tuned Models:**
-    - `duttaprat/HViLM-finetuned-pathogenicity` (coming soon)
-    - `duttaprat/HViLM-finetuned-host-tropism` (coming soon)
-    - `duttaprat/HViLM-finetuned-transmissibility-R0` (coming soon)
 ## Model Description
-(Paste your abstract here)
-## How to Use
-This model requires trusting remote code because it uses custom architecture files (`bert_layers.py`, etc.).
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
-repo_id = "duttaprat/HViLM-base"
-# This will download the files you just uploaded
-tokenizer = AutoTokenizer.from_pretrained(repo_id)
 model = AutoModel.from_pretrained(
-    repo_id,
-    trust_remote_code=True  # <-- This is ESSENTIAL
 )
-print("Model and tokenizer loaded successfully!")
-# Example: Get embeddings for a sequence
-sequence = "ATGCGTACGT..."
-inputs = tokenizer(sequence, return_tensors="pt")
 with torch.no_grad():
     outputs = model(**inputs)
-    embeddings = outputs.last_hidden_state
-print(embeddings.shape)

 ---
+language:
+- en
 license: apache-2.0
+library_name: transformers
 tags:
 - genomics
 - virology
+- dnabert
 - foundation-model
 - hvilm
+- pathogenicity
+- transmissibility
+- host-tropism
+- viral-genomics
+datasets:
+- VIRION
+- BV-BRC
+- VHDB
+pipeline_tag: feature-extraction
+widget:
+- text: "ATGCGTACGTTAGCCGATCG"
+  example_title: "Viral Sequence Example"
 ---
 # HViLM-base: A Foundation Model for Viral Genomics
+<div align="center">
+[![Paper](https://img.shields.io/badge/Paper-RECOMB%202026-blue)](https://github.com/duttaprat/HViLM)
+[![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/duttaprat/HViLM)
+[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
+[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-HViLM--base-yellow)](https://huggingface.co/duttaprat/HViLM-base)
+</div>
 ## Model Description
+**HViLM (Human Virome Language Model)** is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) on 5 million viral genome sequences from the [VIRION database](https://virion.verena.org), HViLM captures universal viral genomic patterns relevant for human disease risk assessment.
+**Paper**: *HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism* (RECOMB 2026)
+**Authors**: Pratyay Dutta, Ramana V. Davuluri (Stony Brook University)
+**Code & Benchmarks**: [GitHub Repository](https://github.com/duttaprat/HViLM)
+---
+## Key Features
+- 🦠 **Viral-specialized pre-training** on 5M sequences from 10.8M genomes spanning 45+ viral families
+- 🎯 **Multi-task predictions** across 3 epidemiologically critical tasks:
+  - **Pathogenicity classification**: 95.32% average accuracy
+  - **Host tropism prediction**: 96.25% accuracy
+  - **Transmissibility assessment**: 97.36% average accuracy
+- 📊 **HVUE Benchmark**: 7 curated datasets totaling 60K+ viral sequences
+- 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
+- ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
+- 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
+---
+## Model Architecture
+HViLM is built upon **DNABERT-2** (117M parameters), which uses the MosaicBERT architecture with:
+- **Tokenization**: Byte Pair Encoding (BPE) with vocabulary size 4,096
+- **Max sequence length**: 1,000 base pairs
+- **Hidden size**: 768
+- **Attention heads**: 12
+- **Layers**: 12
+- **Positional encoding**: Attention with Linear Biases (ALiBi)
+**Continued pre-training**:
+- **Objective**: Masked Language Modeling (MLM)
+- **Training data**: 5M viral sequence chunks (non-overlapping, 1000 bp)
+- **Data source**: VIRION database (clustered at 80% identity with MMseqs2)
+- **Training**: 10 epochs, AdamW optimizer, learning rate 5e-5
+- **Hardware**: 4x NVIDIA A100 GPUs (72 hours)
+- **Performance**: 94.2% MLM accuracy on validation set
+---
+## Installation
+```bash
+pip install transformers torch
+```
+---
+## Quick Start
+### Basic Usage: Extract Sequence Embeddings
 ```python
 from transformers import AutoTokenizer, AutoModel
 import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    "duttaprat/HViLM-base",
+    trust_remote_code=True  # Required for custom architecture
+)
 model = AutoModel.from_pretrained(
+    "duttaprat/HViLM-base",
+    trust_remote_code=True
 )
+# Example: Get embeddings for a viral sequence
+viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT"
+# Tokenize
+inputs = tokenizer(
+    viral_sequence,
+    return_tensors="pt",
+    truncation=True,
+    max_length=512,
+    padding=True
+)
+# Generate embeddings
 with torch.no_grad():
     outputs = model(**inputs)
+    embeddings = outputs.last_hidden_state  # [batch_size, seq_len, 768]
+print(f"Sequence embeddings shape: {embeddings.shape}")
+# Mean pooling for sequence-level representation
+attention_mask = inputs['attention_mask']
+mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
+sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1)
+sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
+mean_embeddings = sum_embeddings / sum_mask
+print(f"Mean sequence embedding shape: {mean_embeddings.shape}")  # [batch_size, 768]
+```
+### Fine-tuning on Your Own Task
+For fine-tuning HViLM on custom viral classification tasks, please refer to the [GitHub repository](https://github.com/duttaprat/HViLM) for complete training scripts and examples.
+```python
+# Example fine-tuning setup (see GitHub for complete code)
+from transformers import AutoModel, TrainingArguments, Trainer
+from peft import LoraConfig, get_peft_model
+# Load base model
+model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True)
+# Configure LoRA for parameter-efficient fine-tuning
+lora_config = LoraConfig(
+    r=8,                    # rank
+    lora_alpha=16,         # scaling factor
+    target_modules=["query", "value"],  # attention layers
+    lora_dropout=0.1,
+    bias="none"
+)
+# Apply LoRA
+model = get_peft_model(model, lora_config)
+# Add classification head and train (see GitHub for details)
+```
+---
+## Performance on HVUE Benchmark
+### Pathogenicity Classification
+| Dataset | Sequences | Accuracy | F1-Score | MCC |
+|---------|-----------|----------|----------|-----|
+| CINI | 159 | **87.74%** | 86.98 | 74.48 |
+| BVBRC-CoV | 18,066 | **98.26%** | 98.26 | 96.52 |
+| BVBRC-Calici | 31,089 | **99.95%** | 99.93 | 99.90 |
+| **Average** | **49,314** | **95.32%** | **95.06** | **90.30** |
+### Host Tropism Prediction
+| Dataset | Sequences | Accuracy | F1-Score | MCC |
+|---------|-----------|----------|----------|-----|
+| VHDB | 9,428 | **96.25%** | 91.34 | 91.24 |
+### Transmissibility Assessment (R₀-based Classification)
+| Viral Family | Sequences | Accuracy | F1-Score | MCC |
+|--------------|-----------|----------|----------|-----|
+| Coronaviridae | ~3,000 | **97.45%** | 97.37 | 93.43 |
+| Orthomyxoviridae | ~2,500 | **95.62%** | 95.44 | 91.07 |
+| Caliciviridae | ~1,800 | **99.95%** | 99.95 | 99.90 |
+| **Average** | **~7,300** | **97.36%** | **97.59** | **94.80** |
+**Comparison with baselines**: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks.
+---
+## Interpretability: Transcription Factor Mimicry
+HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through **molecular mimicry of host regulatory elements**:
+- **42 conserved motifs** identified in high-attention regions of pathogenic coronaviruses
+- **10 vertebrate transcription factors** targeted, including:
+  - **Irf1** (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion
+  - **Foxq1**: Multiple motifs for epithelial cell tropism
+  - **ZNF354A**: 6 motifs for chromatin regulation
+This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations.
+---
+## Training Data
+### Pre-training Corpus
+- **Source**: [VIRION database](https://virion.verena.org) (476,242 virus-host associations)
+- **Genomes**: 10,817,265 unique NCBI accession numbers
+- **Processing**:
+  - Segmented into non-overlapping 1000 bp chunks
+  - Clustered with MMseqs2 at 80% identity threshold
+- **Final dataset**: 5 million unique sequences
+- **Coverage**: 45+ viral families across all Baltimore classification groups
+### Data Leakage Prevention
+Systematic overlap analysis performed between pre-training corpus and HVUE benchmark datasets:
+- **Method**: Accession ID matching + MMseqs2 similarity (>95% identity)
+- **Removed**: 186 overlapping sequences from pre-training corpus
+- **Result**: Clean separation between pre-training and evaluation data
+---
+## HVUE Benchmark Datasets
+The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 curated datasets:
+### Pathogenicity Prediction (3 datasets)
+- **CINI**: 159 sequences, 4 viral families, manual literature curation
+- **BVBRC-CoV**: 18,066 coronaviruses
+- **BVBRC-Calici**: 31,089 caliciviruses
+### Host Tropism Prediction (1 dataset)
+- **VHDB**: 9,428 sequences, 30 viral families
+- Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%)
+### Transmissibility Prediction (3 datasets)
+- **Coronaviridae**: R₀-based classification (R₀<1 vs R₀≥1)
+- **Orthomyxoviridae**: R₀-based classification
+- **Caliciviridae**: R₀-based classification
+All datasets available at: [GitHub - HVUE Benchmark](https://github.com/duttaprat/HViLM)
+---
+## Reproducing Paper Results
+To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
+```bash
+# Clone repository
+git clone https://github.com/duttaprat/HViLM.git
+cd HViLM
+# Install dependencies
+pip install -r requirements.txt
+# Reproduce pathogenicity results on CINI dataset
+cd finetune
+bash scripts/run_patho_cini.sh
+# Reproduce host tropism results
+bash scripts/run_tropism_vhdb.sh
+# Reproduce transmissibility results
+bash scripts/run_r0_coronaviridae.sh
+```
+For detailed instructions, see the [GitHub repository](https://github.com/duttaprat/HViLM).
+---
+## Citation
+If you use DNABERT-2 (the base model), please also cite:
+```bibtex
+@article{zhou2023dnabert2,
+  title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
+  author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han},
+  journal={ICLR},
+  year={2024}
+}
+```
+---
+## Model Card Authors
+- **Pratik Dutta** (Senior Research Scientist, Stony Brook University)
+- **Ramana V. Davuluri** (Professor, Stony Brook University)
+---
+## Contact
+- **Email**: pratik.dutta@stonybrook.edu
+- **Lab**: [Davuluri Lab, Stony Brook University](https://davulurilab.github.io/)
+- **GitHub Issues**: [Report bugs or request features](https://github.com/duttaprat/HViLM/issues)
+---
+## Acknowledgments
+This work builds upon [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) by Zhou et al. Pre-training data from the [VIRION database](https://virion.verena.org) maintained by the Viral Emergence Research Initiative (Verena).
+---
+## License
+This model is released under the **Apache License 2.0**.
+---
+## Disclaimer
+HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis.