duttaprat commited on
Commit
3fe8f58
Β·
verified Β·
1 Parent(s): 4c74e24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -8
README.md CHANGED
@@ -17,6 +17,7 @@ datasets:
17
  - VIRION
18
  - BV-BRC
19
  - VHDB
 
20
  pipeline_tag: feature-extraction
21
  widget:
22
  - text: "ATGCGTACGTTAGCCGATCG"
@@ -53,7 +54,7 @@ widget:
53
  - **Pathogenicity classification**: 95.32% average accuracy
54
  - **Host tropism prediction**: 96.25% accuracy
55
  - **Transmissibility assessment**: 97.36% average accuracy
56
- - πŸ“Š **HVUE Benchmark**: 7 curated datasets totaling 60K+ viral sequences
57
  - πŸ” **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
58
  - ⚑ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
59
  - πŸš€ **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
@@ -220,12 +221,6 @@ This demonstrates that HViLM captures genuine biological mechanisms rather than
220
  - **Final dataset**: 5 million unique sequences
221
  - **Coverage**: 45+ viral families across all Baltimore classification groups
222
 
223
- ### Data Leakage Prevention
224
-
225
- Systematic overlap analysis performed between pre-training corpus and HVUE benchmark datasets:
226
- - **Method**: Accession ID matching + MMseqs2 similarity (>95% identity)
227
- - **Removed**: 186 overlapping sequences from pre-training corpus
228
- - **Result**: Clean separation between pre-training and evaluation data
229
 
230
  ---
231
 
@@ -247,14 +242,41 @@ The **Human Virome Understanding Evaluation (HVUE)** benchmark consists of 7 cur
247
  - **Orthomyxoviridae**: Rβ‚€-based classification
248
  - **Caliciviridae**: Rβ‚€-based classification
249
 
250
- All datasets available at: [GitHub - HVUE Benchmark](https://github.com/duttaprat/HViLM)
 
 
 
 
 
 
 
 
 
 
 
 
 
251
 
252
  ---
253
 
254
  ## Reproducing Paper Results
255
 
 
 
 
 
 
 
 
 
 
 
 
 
256
  To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
257
 
 
 
258
  ```bash
259
  # Clone repository
260
  git clone https://github.com/duttaprat/HViLM.git
 
17
  - VIRION
18
  - BV-BRC
19
  - VHDB
20
+ - duttaprat/HVUE
21
  pipeline_tag: feature-extraction
22
  widget:
23
  - text: "ATGCGTACGTTAGCCGATCG"
 
54
  - **Pathogenicity classification**: 95.32% average accuracy
55
  - **Host tropism prediction**: 96.25% accuracy
56
  - **Transmissibility assessment**: 97.36% average accuracy
57
+ - πŸ“Š **[HVUE Benchmark](https://huggingface.co/datasets/duttaprat/HVUE)**: 7 curated datasets totaling 60K+ viral sequences
58
  - πŸ” **Mechanistic interpretability**: Identifies transcription factor binding site mimicry (42 conserved motifs)
59
  - ⚑ **Parameter-efficient fine-tuning**: LoRA adaptation (~0.3M trainable parameters per task)
60
  - πŸš€ **State-of-the-art performance**: Outperforms Nucleotide Transformer, GENA-LM, and DNABERT-MB
 
221
  - **Final dataset**: 5 million unique sequences
222
  - **Coverage**: 45+ viral families across all Baltimore classification groups
223
 
 
 
 
 
 
 
224
 
225
  ---
226
 
 
242
  - **Orthomyxoviridae**: Rβ‚€-based classification
243
  - **Caliciviridae**: Rβ‚€-based classification
244
 
245
+ All datasets available at: **[πŸ€— duttaprat/HVUE](https://huggingface.co/datasets/duttaprat/HVUE)**
246
+
247
+ ### Download and Use
248
+ ```python
249
+ from datasets import load_dataset
250
+
251
+ # Load specific task
252
+ host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
253
+ pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
254
+ transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
255
+
256
+ # Load specific split
257
+ train_data = load_dataset("duttaprat/HVUE", data_files="Host_Tropism/train.csv")
258
+ ```
259
 
260
  ---
261
 
262
  ## Reproducing Paper Results
263
 
264
+ ### Step 1: Download HVUE Benchmark
265
+ ```python
266
+ from datasets import load_dataset
267
+
268
+ # Download all datasets
269
+ host_tropism = load_dataset("duttaprat/HVUE", data_dir="Host_Tropism")
270
+ pathogenicity = load_dataset("duttaprat/HVUE", data_dir="Pathogenecity")
271
+ transmissibility = load_dataset("duttaprat/HVUE", data_dir="Transmissibility")
272
+ ```
273
+
274
+ ### Step 2: Fine-tune and Evaluate
275
+
276
  To reproduce the results reported in the paper, clone the repository and follow the fine-tuning instructions:
277
 
278
+
279
+
280
  ```bash
281
  # Clone repository
282
  git clone https://github.com/duttaprat/HViLM.git