duttaprat commited on
Commit
b851978
·
verified ·
1 Parent(s): ddf07ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +300 -23
README.md CHANGED
@@ -1,52 +1,329 @@
1
  ---
 
 
2
  license: apache-2.0
 
3
  tags:
4
  - genomics
5
- - dnabert
6
  - virology
 
7
  - foundation-model
8
  - hvilm
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
  # HViLM-base: A Foundation Model for Viral Genomics
12
 
13
- This is the base pre-trained model for **HViLM**, as described in the paper:
14
- **"HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism"**
15
 
16
- - **Paper:** [Link to your arXiv paper will go here]
17
- - **Fine-tuned Models:**
18
- - `duttaprat/HViLM-finetuned-pathogenicity` (coming soon)
19
- - `duttaprat/HViLM-finetuned-host-tropism` (coming soon)
20
- - `duttaprat/HViLM-finetuned-transmissibility-R0` (coming soon)
 
21
 
22
  ## Model Description
23
 
24
- (Paste your abstract here)
 
 
 
 
 
 
 
 
 
 
25
 
26
- ## How to Use
 
 
 
 
 
 
 
 
 
 
27
 
28
- This model requires trusting remote code because it uses custom architecture files (`bert_layers.py`, etc.).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ```python
31
  from transformers import AutoTokenizer, AutoModel
32
  import torch
33
 
34
- repo_id = "duttaprat/HViLM-base"
35
-
36
- # This will download the files you just uploaded
37
- tokenizer = AutoTokenizer.from_pretrained(repo_id)
 
38
  model = AutoModel.from_pretrained(
39
- repo_id,
40
- trust_remote_code=True # <-- This is ESSENTIAL
41
  )
42
 
43
- print("Model and tokenizer loaded successfully!")
 
 
 
 
 
 
 
 
 
 
44
 
45
- # Example: Get embeddings for a sequence
46
- sequence = "ATGCGTACGT..."
47
- inputs = tokenizer(sequence, return_tensors="pt")
48
  with torch.no_grad():
49
  outputs = model(**inputs)
50
- embeddings = outputs.last_hidden_state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- print(embeddings.shape)
 
1
  ---
2
+ language:
3
+ - en
4
  license: apache-2.0
5
+ library_name: transformers
6
  tags:
7
  - genomics
 
8
  - virology
9
+ - dnabert
10
  - foundation-model
11
  - hvilm
12
+ - pathogenicity
13
+ - transmissibility
14
+ - host-tropism
15
+ - viral-genomics
16
+ datasets:
17
+ - VIRION
18
+ - BV-BRC
19
+ - VHDB
20
+ pipeline_tag: feature-extraction
21
+ widget:
22
+ - text: "ATGCGTACGTTAGCCGATCG"
23
+ example_title: "Viral Sequence Example"
24
  ---
25
 
26
  # HViLM-base: A Foundation Model for Viral Genomics
27
 
28
+ <div align="center">
 
29
 
30
+ [![Paper](https://img.shields.io/badge/Paper-RECOMB%202026-blue)](https://github.com/duttaprat/HViLM)
31
+ [![GitHub](https://img.shields.io/badge/Code-GitHub-black)](https://github.com/duttaprat/HViLM)
32
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
33
+ [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-HViLM--base-yellow)](https://huggingface.co/duttaprat/HViLM-base)
34
+
35
+ </div>
36
 
37
  ## Model Description
38
 
39
+ **HViLM (Human Virome Language Model)** is the first foundation model specifically designed for comprehensive viral risk assessment through multi-task prediction of pathogenicity, host tropism, and transmissibility. Built through continued pre-training of [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) on 5 million viral genome sequences from the [VIRION database](https://virion.verena.org), HViLM captures universal viral genomic patterns relevant for human disease risk assessment.
40
+
41
+ **Paper**: *HViLM: A Foundation Model for Viral Genomics Enables Multi-Task Prediction of Pathogenicity, Transmissibility, and Host Tropism* (RECOMB 2026)
42
+
43
+ **Authors**: Pratyay Dutta, Ramana V. Davuluri (Stony Brook University)
44
+
45
+ **Code & Benchmarks**: [GitHub Repository](https://github.com/duttaprat/HViLM)
46
+
47
+ ---
48
+
49
+ ## Key Features
50
 
51
+ - 🦠 **Viral-specialized pre-training** on 5M sequences from 10.8M genomes spanning 45+ viral families
52
+ - 🎯 **Multi-task predictions** across 3 epidemiologically critical tasks:
53
+ - **Pathogenicity classification**: 95.32% average accuracy
54
+ - **Host tropism prediction**: 96.25% accuracy
55
+ - **Transmissibility assessment**: 97.36% average accuracy
56
+ - 📊 **HVUE Benchmark**: 7 curated datasets totaling 60K+ viral sequences
57
+ - 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
58
+ - ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
59
+ - 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
60
+
61
+ ---
62
 
63
+ ## Model Architecture
64
+
65
+ HViLM is built upon **DNABERT-2** (117M parameters), which uses the MosaicBERT architecture with:
66
+ - **Tokenization**: Byte Pair Encoding (BPE) with vocabulary size 4,096
67
+ - **Max sequence length**: 1,000 base pairs
68
+ - **Hidden size**: 768
69
+ - **Attention heads**: 12
70
+ - **Layers**: 12
71
+ - **Positional encoding**: Attention with Linear Biases (ALiBi)
72
+
73
+ **Continued pre-training**:
74
+ - **Objective**: Masked Language Modeling (MLM)
75
+ - **Training data**: 5M viral sequence chunks (non-overlapping, 1000 bp)
76
+ - **Data source**: VIRION database (clustered at 80% identity with MMseqs2)
77
+ - **Training**: 10 epochs, AdamW optimizer, learning rate 5e-5
78
+ - **Hardware**: 4x NVIDIA A100 GPUs (72 hours)
79
+ - **Performance**: 94.2% MLM accuracy on validation set
80
+
81
+ ---
82
+
83
+ ## Installation
84
+
85
+ ```bash
86
+ pip install transformers torch
87
+ ```
88
+
89
+ ---
90
+
91
+ ## Quick Start
92
+
93
+ ### Basic Usage: Extract Sequence Embeddings
94
 
95
  ```python
96
  from transformers import AutoTokenizer, AutoModel
97
  import torch
98
 
99
+ # Load model and tokenizer
100
+ tokenizer = AutoTokenizer.from_pretrained(
101
+ "duttaprat/HViLM-base",
102
+ trust_remote_code=True # Required for custom architecture
103
+ )
104
  model = AutoModel.from_pretrained(
105
+ "duttaprat/HViLM-base",
106
+ trust_remote_code=True
107
  )
108
 
109
+ # Example: Get embeddings for a viral sequence
110
+ viral_sequence = "ATGCGTACGTTAGCCGATCGATTACGCGTACGTAGCTAGCTAGCT"
111
+
112
+ # Tokenize
113
+ inputs = tokenizer(
114
+ viral_sequence,
115
+ return_tensors="pt",
116
+ truncation=True,
117
+ max_length=512,
118
+ padding=True
119
+ )
120
 
121
+ # Generate embeddings
 
 
122
  with torch.no_grad():
123
  outputs = model(**inputs)
124
+ embeddings = outputs.last_hidden_state # [batch_size, seq_len, 768]
125
+
126
+ print(f"Sequence embeddings shape: {embeddings.shape}")
127
+
128
+ # Mean pooling for sequence-level representation
129
+ attention_mask = inputs['attention_mask']
130
+ mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
131
+ sum_embeddings = torch.sum(embeddings * mask_expanded, dim=1)
132
+ sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
133
+ mean_embeddings = sum_embeddings / sum_mask
134
+
135
+ print(f"Mean sequence embedding shape: {mean_embeddings.shape}") # [batch_size, 768]
136
+ ```
137
+
138
+ ### Fine-tuning on Your Own Task
139
+
140
+ For fine-tuning HViLM on custom viral classification tasks, please refer to the [GitHub repository](https://github.com/duttaprat/HViLM) for complete training scripts and examples.
141
+
142
+ ```python
143
+ # Example fine-tuning setup (see GitHub for complete code)
144
+ from transformers import AutoModel, TrainingArguments, Trainer
145
+ from peft import LoraConfig, get_peft_model
146
+
147
+ # Load base model
148
+ model = AutoModel.from_pretrained("duttaprat/HViLM-base", trust_remote_code=True)
149
+
150
+ # Configure LoRA for parameter-efficient fine-tuning
151
+ lora_config = LoraConfig(
152
+ r=8, # rank
153
+ lora_alpha=16, # scaling factor
154
+ target_modules=["query", "value"], # attention layers
155
+ lora_dropout=0.1,
156
+ bias="none"
157
+ )
158
+
159
+ # Apply LoRA
160
+ model = get_peft_model(model, lora_config)
161
+
162
+ # Add classification head and train (see GitHub for details)
163
+ ```
164
+
165
+ ---
166
+
167
+ ## Performance on HVUE Benchmark
168
+
169
+ ### Pathogenicity Classification
170
+
171
+ | Dataset | Sequences | Accuracy | F1-Score | MCC |
172
+ |---------|-----------|----------|----------|-----|
173
+ | CINI | 159 | **87.74%** | 86.98 | 74.48 |
174
+ | BVBRC-CoV | 18,066 | **98.26%** | 98.26 | 96.52 |
175
+ | BVBRC-Calici | 31,089 | **99.95%** | 99.93 | 99.90 |
176
+ | **Average** | **49,314** | **95.32%** | **95.06** | **90.30** |
177
+
178
+ ### Host Tropism Prediction
179
+
180
+ | Dataset | Sequences | Accuracy | F1-Score | MCC |
181
+ |---------|-----------|----------|----------|-----|
182
+ | VHDB | 9,428 | **96.25%** | 91.34 | 91.24 |
183
+
184
+ ### Transmissibility Assessment (R₀-based Classification)
185
+
186
+ | Viral Family | Sequences | Accuracy | F1-Score | MCC |
187
+ |--------------|-----------|----------|----------|-----|
188
+ | Coronaviridae | ~3,000 | **97.45%** | 97.37 | 93.43 |
189
+ | Orthomyxoviridae | ~2,500 | **95.62%** | 95.44 | 91.07 |
190
+ | Caliciviridae | ~1,800 | **99.95%** | 99.95 | 99.90 |
191
+ | **Average** | **~7,300** | **97.36%** | **97.59** | **94.80** |
192
+
193
+ **Comparison with baselines**: HViLM consistently outperforms Nucleotide Transformer 500M-1000g, GENA-LM, and DNABERT-MB across all tasks.
194
+
195
+ ---
196
+
197
+ ## Interpretability: Transcription Factor Mimicry
198
+
199
+ HViLM's attention mechanisms reveal biologically meaningful pathogenicity determinants through **molecular mimicry of host regulatory elements**:
200
+
201
+ - **42 conserved motifs** identified in high-attention regions of pathogenic coronaviruses
202
+ - **10 vertebrate transcription factors** targeted, including:
203
+ - **Irf1** (Interferon Regulatory Factor 1): 8 convergent motifs for immune evasion
204
+ - **Foxq1**: Multiple motifs for epithelial cell tropism
205
+ - **ZNF354A**: 6 motifs for chromatin regulation
206
+
207
+ This demonstrates that HViLM captures genuine biological mechanisms rather than spurious correlations.
208
+
209
+ ---
210
+
211
+ ## Training Data
212
+
213
+ ### Pre-training Corpus
214
+
215
+ - **Source**: [VIRION database](https://virion.verena.org) (476,242 virus-host associations)
216
+ - **Genomes**: 10,817,265 unique NCBI accession numbers
217
+ - **Processing**:
218
+ - Segmented into non-overlapping 1000 bp chunks
219
+ - Clustered with MMseqs2 at 80% identity threshold
220
+ - **Final dataset**: 5 million unique sequences
221
+ - **Coverage**: 45+ viral families across all Baltimore classification groups
222
+
223
+ ### Data Leakage Prevention
224
+
225
+ Systematic overlap analysis performed between pre-training corpus and HVUE benchmark datasets:
226
+ - **Method**: Accession ID matching + MMseqs2 similarity (>95% identity)
227
+ - **Removed**: 186 overlapping sequences from pre-training corpus
228
+ - **Result**: Clean separation between pre-training and evaluation data
229
+
230
+ ---
231
+
232
+ ## HVUE Benchmark Datasets
233
+
234
+ The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 curated datasets:
235
+
236
+ ### Pathogenicity Prediction (3 datasets)
237
+ - **CINI**: 159 sequences, 4 viral families, manual literature curation
238
+ - **BVBRC-CoV**: 18,066 coronaviruses
239
+ - **BVBRC-Calici**: 31,089 caliciviruses
240
+
241
+ ### Host Tropism Prediction (1 dataset)
242
+ - **VHDB**: 9,428 sequences, 30 viral families
243
+ - Binary classification: human-tropic (13.1%) vs non-human-tropic (86.9%)
244
+
245
+ ### Transmissibility Prediction (3 datasets)
246
+ - **Coronaviridae**: R₀-based classification (R₀<1 vs R₀≥1)
247
+ - **Orthomyxoviridae**: R₀-based classification
248
+ - **Caliciviridae**: R₀-based classification
249
+
250
+ All datasets available at: [GitHub - HVUE Benchmark](https://github.com/duttaprat/HViLM)
251
+
252
+ ---
253
+
254
+ ## Reproducing Paper Results
255
+
256
+ To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
257
+
258
+ ```bash
259
+ # Clone repository
260
+ git clone https://github.com/duttaprat/HViLM.git
261
+ cd HViLM
262
+
263
+ # Install dependencies
264
+ pip install -r requirements.txt
265
+
266
+ # Reproduce pathogenicity results on CINI dataset
267
+ cd finetune
268
+ bash scripts/run_patho_cini.sh
269
+
270
+ # Reproduce host tropism results
271
+ bash scripts/run_tropism_vhdb.sh
272
+
273
+ # Reproduce transmissibility results
274
+ bash scripts/run_r0_coronaviridae.sh
275
+ ```
276
+
277
+ For detailed instructions, see the [GitHub repository](https://github.com/duttaprat/HViLM).
278
+
279
+ ---
280
+
281
+ ## Citation
282
+
283
+
284
+
285
+ If you use DNABERT-2 (the base model), please also cite:
286
+
287
+ ```bibtex
288
+ @article{zhou2023dnabert2,
289
+ title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome},
290
+ author={Zhou, Zhihan and Ji, Yanrong and Li, Weijian and Dutta, Pratik and Davuluri, Ramana and Liu, Han},
291
+ journal={ICLR},
292
+ year={2024}
293
+ }
294
+ ```
295
+
296
+ ---
297
+
298
+ ## Model Card Authors
299
+
300
+ - **Pratik Dutta** (Senior Research Scientist, Stony Brook University)
301
+ - **Ramana V. Davuluri** (Professor, Stony Brook University)
302
+
303
+ ---
304
+
305
+ ## Contact
306
+
307
+ - **Email**: pratik.dutta@stonybrook.edu
308
+ - **Lab**: [Davuluri Lab, Stony Brook University](https://davulurilab.github.io/)
309
+ - **GitHub Issues**: [Report bugs or request features](https://github.com/duttaprat/HViLM/issues)
310
+
311
+ ---
312
+
313
+ ## Acknowledgments
314
+
315
+ This work builds upon [DNABERT-2](https://github.com/MAGICS-LAB/DNABERT_2) by Zhou et al. Pre-training data from the [VIRION database](https://virion.verena.org) maintained by the Viral Emergence Research Initiative (Verena).
316
+
317
+
318
+
319
+ ---
320
+
321
+ ## License
322
+
323
+ This model is released under the **Apache License 2.0**.
324
+
325
+ ---
326
+
327
+ ## Disclaimer
328
 
329
+ HViLM is a research tool for computational biology and should not be used as the sole basis for clinical or public health decisions. Predictions should be validated through experimental methods and expert analysis.