Update README.md
Browse files
README.md
CHANGED
|
@@ -17,6 +17,7 @@ datasets:
|
|
| 17 |
- VIRION
|
| 18 |
- BV-BRC
|
| 19 |
- VHDB
|
|
|
|
| 20 |
pipeline_tag: feature-extraction
|
| 21 |
widget:
|
| 22 |
- text: "ATGCGTACGTTAGCCGATCG"
|
|
@@ -53,7 +54,7 @@ widget:
|
|
| 53 |
- **Pathogenicity classification**: 95.32% average accuracy
|
| 54 |
- **Host tropism prediction**: 96.25% accuracy
|
| 55 |
- **Transmissibility assessment**: 97.36% average accuracy
|
| 56 |
-
- π **HVUE Benchmark**: 7 curated datasets totaling 60K+ viral sequences
|
| 57 |
- π **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
|
| 58 |
- β‘ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
|
| 59 |
- π **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
|
|
@@ -220,12 +221,6 @@ This demonstrates that HViLM captures genuine biological mechanisms rather than
|
|
| 220 |
- **Final dataset**: 5 million unique sequences
|
| 221 |
- **Coverage**: 45+ viral families across all Baltimore classification groups
|
| 222 |
|
| 223 |
-
### Data Leakage Prevention
|
| 224 |
-
|
| 225 |
-
Systematic overlap analysis performed between pre-training corpus and HVUE benchmark datasets:
|
| 226 |
-
- **Method**: Accession ID matching + MMseqs2 similarity (>95% identity)
|
| 227 |
-
- **Removed**: 186 overlapping sequences from pre-training corpus
|
| 228 |
-
- **Result**: Clean separation between pre-training and evaluation data
|
| 229 |
|
| 230 |
---
|
| 231 |
|
|
@@ -247,14 +242,41 @@ The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 cur
|
|
| 247 |
- **Orthomyxoviridae**: Rβ-based classification
|
| 248 |
- **Caliciviridae**: Rβ-based classification
|
| 249 |
|
| 250 |
-
All datasets available at: [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 251 |
|
| 252 |
---
|
| 253 |
|
| 254 |
## Reproducing Paper Results
|
| 255 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 256 |
To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
|
| 257 |
|
|
|
|
|
|
|
| 258 |
```bash
|
| 259 |
# Clone repository
|
| 260 |
git clone https://github.com/duttaprat/HViLM.git
|
|
|
|
| 17 |
- VIRION
|
| 18 |
- BV-BRC
|
| 19 |
- VHDB
|
| 20 |
+
- duttaprat/HVUE
|
| 21 |
pipeline_tag: feature-extraction
|
| 22 |
widget:
|
| 23 |
- text: "ATGCGTACGTTAGCCGATCG"
|
|
|
|
| 54 |
- **Pathogenicity classification**: 95.32% average accuracy
|
| 55 |
- **Host tropism prediction**: 96.25% accuracy
|
| 56 |
- **Transmissibility assessment**: 97.36% average accuracy
|
| 57 |
+
- π **[HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE)**: 7 curated datasets totaling 60K+ viral sequences
|
| 58 |
- π **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
|
| 59 |
- β‘ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
|
| 60 |
- π **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
|
|
|
|
| 221 |
- **Final dataset**: 5 million unique sequences
|
| 222 |
- **Coverage**: 45+ viral families across all Baltimore classification groups
|
| 223 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 224 |
|
| 225 |
---
|
| 226 |
|
|
|
|
| 242 |
- **Orthomyxoviridae**: Rβ-based classification
|
| 243 |
- **Caliciviridae**: Rβ-based classification
|
| 244 |
|
| 245 |
+
All datasets available at: **[π€ duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)**
|
| 246 |
+
|
| 247 |
+
### Download and Use
|
| 248 |
+
```python
|
| 249 |
+
from datasets import load_dataset
|
| 250 |
+
|
| 251 |
+
# Load specific task
|
| 252 |
+
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
|
| 253 |
+
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
|
| 254 |
+
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
|
| 255 |
+
|
| 256 |
+
# Load specific split
|
| 257 |
+
train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv")
|
| 258 |
+
```
|
| 259 |
|
| 260 |
---
|
| 261 |
|
| 262 |
## Reproducing Paper Results
|
| 263 |
|
| 264 |
+
### Step 1: Download HVUE Benchmark
|
| 265 |
+
```python
|
| 266 |
+
from datasets import load_dataset
|
| 267 |
+
|
| 268 |
+
# Download all datasets
|
| 269 |
+
host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
|
| 270 |
+
pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
|
| 271 |
+
transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
### Step 2: Fine-tune and Evaluate
|
| 275 |
+
|
| 276 |
To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
|
| 277 |
|
| 278 |
+
|
| 279 |
+
|
| 280 |
```bash
|
| 281 |
# Clone repository
|
| 282 |
git clone https://github.com/duttaprat/HViLM.git
|