VParka commited on
Commit
474e3ec
Β·
verified Β·
1 Parent(s): a3637b6

Upload README.md via DNA Console (Portable Version)

Browse files
Files changed (1) hide show
  1. README.md +60 -22
README.md CHANGED
@@ -4,28 +4,61 @@ tags:
4
  - genomics
5
  - classification
6
  - sklearn
 
 
 
7
  library_name: sklearn
 
 
 
8
  ---
9
 
10
- # BioGuard DNA Classifier Ensemble (Portable v1.4)
11
 
12
  This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
13
 
14
  > [!NOTE]
15
- > **Version 1.4 Update**: Fixed critical overfitting issue in ViralBoost model. Added class weighting for imbalanced training data and confidence thresholds for rare class predictions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## πŸ”„ Changelog
18
 
 
 
 
 
 
 
19
  ### v1.4 (2026-01-08)
 
20
  - **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
21
  - **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
22
  - **ADDED**: Confidence thresholds in `inference.py`:
23
  - General classes: 55% minimum confidence
24
  - Rare classes (Influenza B, etc.): 65% minimum confidence
25
  - Below threshold β†’ classified as "Unknown"
26
- - **IMPROVED**: Reduced overfitting with `max_depth=3`, `min_samples_leaf=10`
27
 
28
  ### v1.3
 
29
  - Refactored to remove custom class dependencies from `.joblib` files
30
 
31
  ## 🧬 Models Included
@@ -33,28 +66,30 @@ This repository contains a dual-model ensemble for DNA sequence analysis and vir
33
  This repository hosts two distinct models specialized for different aspects of genomic analysis:
34
 
35
  ### 1. **GenetiForest** (RandomForestClassifier)
36
- * **File**: `dna_classifier.joblib`
37
- * **Purpose**: General-purpose synthetic vs. biological sequence classification.
38
- * **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
39
- * **Performance (Test Set)**:
40
- * **Accuracy**: 89.4%
41
- * **F1 Score**: 89.9%
 
42
 
43
  ### 2. **ViralBoost** (GradientBoostingClassifier)
44
- * **File**: `sequence_model.joblib`
45
- * **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
46
- * **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
47
- * **Performance (Test Set)**:
48
- * **Accuracy**: 98.1%
49
- * **F1 Score**: 98.2%
50
- * **Classes**: Other, Influenza A, Norovirus, Influenza B
 
51
 
52
  ## πŸš€ Usage
53
 
54
  Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
55
 
56
- 1. Download all files (`.joblib` and `inference.py`).
57
- 2. Use the `inference.py` script:
58
 
59
  ```python
60
  from inference import predict_dna
@@ -66,12 +101,13 @@ print(f"Genetic Type: {results['classification']}")
66
  print(f"Virus Identity: {results['virus_identity']}")
67
  print(f"Confidence: {results['virus_confidence']*100:.1f}%")
68
 
69
- # New in v1.4: Check if prediction was below threshold
70
  if results['virus_identity'] == 'Unknown':
71
  print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
72
  ```
73
 
74
  Alternatively, you can load components manually:
 
75
  ```python
76
  import joblib
77
  classifier = joblib.load("dna_classifier.joblib")
@@ -80,6 +116,8 @@ scaler = joblib.load("scaler_rf.joblib")
80
  ```
81
 
82
  ## πŸ“Š Training Meta
83
- * **Generated By**: DNA Governance Console (vparka)
84
- * **Framework**: scikit-learn
85
- * **Class Weighting**: Balanced (inverse frequency)
 
 
 
4
  - genomics
5
  - classification
6
  - sklearn
7
+ - virus-detection
8
+ - influenza
9
+ - norovirus
10
  library_name: sklearn
11
+ datasets:
12
+ - ncbi-influenza
13
+ - ncbi-norovirus
14
  ---
15
 
16
+ # BioGuard DNA Classifier Ensemble (Portable v1.5)
17
 
18
  This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
19
 
20
  > [!NOTE]
21
+ > **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.
22
+
23
+ ## πŸ“Š Training Dataset
24
+
25
+ | Metric | Value |
26
+ |--------|-------|
27
+ | **Total Sequences** | 176,407 |
28
+ | **Unique Virus Types** | 40+ |
29
+ | **Data Source** | NCBI Influenza & Norovirus Database |
30
+ | **Last Updated** | 2026-01-08 |
31
+
32
+ ### Top Virus Classes
33
+
34
+ | Virus Type | Samples |
35
+ |------------|---------|
36
+ | Influenza A H1N1 | 11,251 |
37
+ | Influenza A H3N2 | 9,326 |
38
+ | Influenza A H5N1 | 6,675 |
39
+ | Norovirus GII | 5,488 |
40
+ | Influenza A H9N2 | 1,929 |
41
+ | Influenza A H1N2 | 1,349 |
42
 
43
  ## πŸ”„ Changelog
44
 
45
+ ### v1.5 (2026-01-08)
46
+
47
+ - **TRAINED**: 176,407 sequences (5x increase from v1.4)
48
+ - **EXPANDED**: 40+ virus subtypes now supported
49
+ - **IMPROVED**: Better generalization with larger dataset
50
+
51
  ### v1.4 (2026-01-08)
52
+
53
  - **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
54
  - **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
55
  - **ADDED**: Confidence thresholds in `inference.py`:
56
  - General classes: 55% minimum confidence
57
  - Rare classes (Influenza B, etc.): 65% minimum confidence
58
  - Below threshold β†’ classified as "Unknown"
 
59
 
60
  ### v1.3
61
+
62
  - Refactored to remove custom class dependencies from `.joblib` files
63
 
64
  ## 🧬 Models Included
 
66
  This repository hosts two distinct models specialized for different aspects of genomic analysis:
67
 
68
  ### 1. **GenetiForest** (RandomForestClassifier)
69
+
70
+ - **File**: `dna_classifier.joblib`
71
+ - **Purpose**: General-purpose synthetic vs. biological sequence classification.
72
+ - **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
73
+ - **Performance (Test Set)**:
74
+ - **Accuracy**: 89.1%
75
+ - **F1 Score**: 88.4%
76
 
77
  ### 2. **ViralBoost** (GradientBoostingClassifier)
78
+
79
+ - **File**: `sequence_model.joblib`
80
+ - **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
81
+ - **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
82
+ - **Performance (Test Set)**:
83
+ - **Accuracy**: 96.7%
84
+ - **F1 Score**: 97.0%
85
+ - **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
86
 
87
  ## πŸš€ Usage
88
 
89
  Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
90
 
91
+ 1. Download all files (`.joblib` and `inference.py`).
92
+ 2. Use the `inference.py` script:
93
 
94
  ```python
95
  from inference import predict_dna
 
101
  print(f"Virus Identity: {results['virus_identity']}")
102
  print(f"Confidence: {results['virus_confidence']*100:.1f}%")
103
 
104
+ # Check if prediction was below threshold
105
  if results['virus_identity'] == 'Unknown':
106
  print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
107
  ```
108
 
109
  Alternatively, you can load components manually:
110
+
111
  ```python
112
  import joblib
113
  classifier = joblib.load("dna_classifier.joblib")
 
116
  ```
117
 
118
  ## πŸ“Š Training Meta
119
+
120
+ - **Generated By**: DNA Governance Console (vparka)
121
+ - **Framework**: scikit-learn
122
+ - **Class Weighting**: Balanced (inverse frequency)
123
+ - **Train/Test Split**: 80/20