duttaprat
/

HViLM-base

@@ -17,6 +17,7 @@ datasets:
 - VIRION
 - BV-BRC
 - VHDB
 pipeline_tag: feature-extraction
 widget:
 - text: "ATGCGTACGTTAGCCGATCG"
@@ -53,7 +54,7 @@ widget:
   - **Pathogenicity classification**: 95.32% average accuracy
   - **Host tropism prediction**: 96.25% accuracy
   - **Transmissibility assessment**: 97.36% average accuracy
-- 📊 **HVUE Benchmark**: 7 curated datasets totaling 60K+ viral sequences
 - 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
 - ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
 - 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
@@ -220,12 +221,6 @@ This demonstrates that HViLM captures genuine biological mechanisms rather than
 - **Final dataset**: 5 million unique sequences
 - **Coverage**: 45+ viral families across all Baltimore classification groups
-### Data Leakage Prevention
-Systematic overlap analysis performed between pre-training corpus and HVUE benchmark datasets:
-- **Method**: Accession ID matching + MMseqs2 similarity (>95% identity)
-- **Removed**: 186 overlapping sequences from pre-training corpus
-- **Result**: Clean separation between pre-training and evaluation data
 ---
@@ -247,14 +242,41 @@ The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 cur
 - **Orthomyxoviridae**: R₀-based classification
 - **Caliciviridae**: R₀-based classification
-All datasets available at: [GitHub - HVUE Benchmark](https://github.com/duttaprat/HViLM)
 ---
 ## Reproducing Paper Results
 To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
 ```bash
 # Clone repository
 git clone https://github.com/duttaprat/HViLM.git

 - VIRION
 - BV-BRC
 - VHDB
+- duttaprat/HVUE
 pipeline_tag: feature-extraction
 widget:
 - text: "ATGCGTACGTTAGCCGATCG"
   - **Pathogenicity classification**: 95.32% average accuracy
   - **Host tropism prediction**: 96.25% accuracy
   - **Transmissibility assessment**: 97.36% average accuracy
+- 📊 **[HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE)**: 7 curated datasets totaling 60K+ viral sequences
 - 🔍 **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
 - ⚡ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
 - 🚀 **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
 - **Final dataset**: 5 million unique sequences
 - **Coverage**: 45+ viral families across all Baltimore classification groups
 ---
 - **Orthomyxoviridae**: R₀-based classification
 - **Caliciviridae**: R₀-based classification
+All datasets available at: **[🤗 duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)**
+### Download and Use
+```python
+from datasets import load_dataset
+# Load specific task
+host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
+pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
+transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
+# Load specific split
+train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv")
+```
 ---
 ## Reproducing Paper Results
+### Step 1: Download HVUE Benchmark
+```python
+from datasets import load_dataset
+# Download all datasets
+host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
+pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
+transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
+```
+### Step 2: Fine-tune and Evaluate
 To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
 ```bash
 # Clone repository
 git clone https://github.com/duttaprat/HViLM.git