VParka commited on
Commit
f158e96
Β·
verified Β·
1 Parent(s): c6c590f

Upload README.md via DNA Console (Portable Version)

Browse files
Files changed (1) hide show
  1. README.md +28 -8
README.md CHANGED
@@ -7,12 +7,26 @@ tags:
7
  library_name: sklearn
8
  ---
9
 
10
- # BioGuard DNA Classifier Ensemble (Portable v1.3)
11
 
12
  This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
13
 
14
  > [!NOTE]
15
- > **Version 1.3 Update**: This version has been fully refactored to remove all custom class dependencies from `.joblib` files. Feature extraction is now strictly handled via source code.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## 🧬 Models Included
18
 
@@ -23,17 +37,17 @@ This repository hosts two distinct models specialized for different aspects of g
23
  * **Purpose**: General-purpose synthetic vs. biological sequence classification.
24
  * **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
25
  * **Performance (Test Set)**:
26
- * **Accuracy**: 88.9%
27
- * **F1 Score**: 89.2%
28
 
29
  ### 2. **ViralBoost** (GradientBoostingClassifier)
30
  * **File**: `sequence_model.joblib`
31
  * **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
32
- * **Architecture**: Gradient Boosting (sklearn) trained on real-world viral sequences.
33
  * **Performance (Test Set)**:
34
- * **Accuracy**: 99.3%
35
- * **F1 Score**: 99.3%
36
- * **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
37
 
38
  ## πŸš€ Usage
39
 
@@ -50,6 +64,11 @@ results = predict_dna(sequence)
50
 
51
  print(f"Genetic Type: {results['classification']}")
52
  print(f"Virus Identity: {results['virus_identity']}")
 
 
 
 
 
53
  ```
54
 
55
  Alternatively, you can load components manually:
@@ -63,3 +82,4 @@ scaler = joblib.load("scaler_rf.joblib")
63
  ## πŸ“Š Training Meta
64
  * **Generated By**: DNA Governance Console (vparka)
65
  * **Framework**: scikit-learn
 
 
7
  library_name: sklearn
8
  ---
9
 
10
+ # BioGuard DNA Classifier Ensemble (Portable v1.4)
11
 
12
  This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
13
 
14
  > [!NOTE]
15
+ > **Version 1.4 Update**: Fixed critical overfitting issue in ViralBoost model. Added class weighting for imbalanced training data and confidence thresholds for rare class predictions.
16
+
17
+ ## πŸ”„ Changelog
18
+
19
+ ### v1.4 (2026-01-08)
20
+ - **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
21
+ - **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
22
+ - **ADDED**: Confidence thresholds in `inference.py`:
23
+ - General classes: 55% minimum confidence
24
+ - Rare classes (Influenza B, etc.): 65% minimum confidence
25
+ - Below threshold β†’ classified as "Unknown"
26
+ - **IMPROVED**: Reduced overfitting with `max_depth=3`, `min_samples_leaf=10`
27
+
28
+ ### v1.3
29
+ - Refactored to remove custom class dependencies from `.joblib` files
30
 
31
  ## 🧬 Models Included
32
 
 
37
  * **Purpose**: General-purpose synthetic vs. biological sequence classification.
38
  * **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
39
  * **Performance (Test Set)**:
40
+ * **Accuracy**: 89.4%
41
+ * **F1 Score**: 89.9%
42
 
43
  ### 2. **ViralBoost** (GradientBoostingClassifier)
44
  * **File**: `sequence_model.joblib`
45
  * **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
46
+ * **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
47
  * **Performance (Test Set)**:
48
+ * **Accuracy**: 98.1%
49
+ * **F1 Score**: 98.2%
50
+ * **Classes**: Other, Influenza A, Norovirus, Influenza B
51
 
52
  ## πŸš€ Usage
53
 
 
64
 
65
  print(f"Genetic Type: {results['classification']}")
66
  print(f"Virus Identity: {results['virus_identity']}")
67
+ print(f"Confidence: {results['virus_confidence']*100:.1f}%")
68
+
69
+ # New in v1.4: Check if prediction was below threshold
70
+ if results['virus_identity'] == 'Unknown':
71
+ print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
72
  ```
73
 
74
  Alternatively, you can load components manually:
 
82
  ## πŸ“Š Training Meta
83
  * **Generated By**: DNA Governance Console (vparka)
84
  * **Framework**: scikit-learn
85
+ * **Class Weighting**: Balanced (inverse frequency)