VParka
/

BioGuard-DNA-Classifier-Ensemble

@@ -7,12 +7,26 @@ tags:
 library_name: sklearn
 ---
-# BioGuard DNA Classifier Ensemble (Portable v1.3)
 This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
 > [!NOTE]
-> **Version 1.3 Update**: This version has been fully refactored to remove all custom class dependencies from `.joblib` files. Feature extraction is now strictly handled via source code.
 ## 🧬 Models Included
@@ -23,17 +37,17 @@ This repository hosts two distinct models specialized for different aspects of g
 *   **Purpose**: General-purpose synthetic vs. biological sequence classification.
 *   **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
 *   **Performance (Test Set)**:
-    *   **Accuracy**: 88.9%
-    *   **F1 Score**: 89.2%
 ### 2. **ViralBoost** (GradientBoostingClassifier)
 *   **File**: `sequence_model.joblib`
 *   **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
-*   **Architecture**: Gradient Boosting (sklearn) trained on real-world viral sequences.
 *   **Performance (Test Set)**:
-    *   **Accuracy**: 99.3%
-    *   **F1 Score**: 99.3%
-*   **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
 ## 🚀 Usage
@@ -50,6 +64,11 @@ results = predict_dna(sequence)
 print(f"Genetic Type: {results['classification']}")
 print(f"Virus Identity: {results['virus_identity']}")
 ```
 Alternatively, you can load components manually:
@@ -63,3 +82,4 @@ scaler = joblib.load("scaler_rf.joblib")
 ## 📊 Training Meta
 *   **Generated By**: DNA Governance Console (vparka)
 *   **Framework**: scikit-learn

 library_name: sklearn
 ---
+# BioGuard DNA Classifier Ensemble (Portable v1.4)
 This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
 > [!NOTE]
+> **Version 1.4 Update**: Fixed critical overfitting issue in ViralBoost model. Added class weighting for imbalanced training data and confidence thresholds for rare class predictions.
+## 🔄 Changelog
+### v1.4 (2026-01-08)
+- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
+- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
+- **ADDED**: Confidence thresholds in `inference.py`:
+  - General classes: 55% minimum confidence
+  - Rare classes (Influenza B, etc.): 65% minimum confidence
+  - Below threshold → classified as "Unknown"
+- **IMPROVED**: Reduced overfitting with `max_depth=3`, `min_samples_leaf=10`
+### v1.3
+- Refactored to remove custom class dependencies from `.joblib` files
 ## 🧬 Models Included
 *   **Purpose**: General-purpose synthetic vs. biological sequence classification.
 *   **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
 *   **Performance (Test Set)**:
+    *   **Accuracy**: 89.4%
+    *   **F1 Score**: 89.9%
 ### 2. **ViralBoost** (GradientBoostingClassifier)
 *   **File**: `sequence_model.joblib`
 *   **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
+*   **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
 *   **Performance (Test Set)**:
+    *   **Accuracy**: 98.1%
+    *   **F1 Score**: 98.2%
+*   **Classes**: Other, Influenza A, Norovirus, Influenza B
 ## 🚀 Usage
 print(f"Genetic Type: {results['classification']}")
 print(f"Virus Identity: {results['virus_identity']}")
+print(f"Confidence: {results['virus_confidence']*100:.1f}%")
+# New in v1.4: Check if prediction was below threshold
+if results['virus_identity'] == 'Unknown':
+    print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
 ```
 Alternatively, you can load components manually:
 ## 📊 Training Meta
 *   **Generated By**: DNA Governance Console (vparka)
 *   **Framework**: scikit-learn
+*   **Class Weighting**: Balanced (inverse frequency)