VParka
/

BioGuard-DNA-Classifier-Ensemble

@@ -4,28 +4,61 @@ tags:
 - genomics
 - classification
 - sklearn
 library_name: sklearn
 ---
-# BioGuard DNA Classifier Ensemble (Portable v1.4)
 This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
 > [!NOTE]
-> **Version 1.4 Update**: Fixed critical overfitting issue in ViralBoost model. Added class weighting for imbalanced training data and confidence thresholds for rare class predictions.
 ## 🔄 Changelog
 ### v1.4 (2026-01-08)
 - **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
 - **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
 - **ADDED**: Confidence thresholds in `inference.py`:
   - General classes: 55% minimum confidence
   - Rare classes (Influenza B, etc.): 65% minimum confidence
   - Below threshold → classified as "Unknown"
-- **IMPROVED**: Reduced overfitting with `max_depth=3`, `min_samples_leaf=10`
 ### v1.3
 - Refactored to remove custom class dependencies from `.joblib` files
 ## 🧬 Models Included
@@ -33,28 +66,30 @@ This repository contains a dual-model ensemble for DNA sequence analysis and vir
 This repository hosts two distinct models specialized for different aspects of genomic analysis:
 ### 1. **GenetiForest** (RandomForestClassifier)
-*   **File**: `dna_classifier.joblib`
-*   **Purpose**: General-purpose synthetic vs. biological sequence classification.
-*   **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
-*   **Performance (Test Set)**:
-    *   **Accuracy**: 89.4%
-    *   **F1 Score**: 89.9%
 ### 2. **ViralBoost** (GradientBoostingClassifier)
-*   **File**: `sequence_model.joblib`
-*   **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
-*   **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
-*   **Performance (Test Set)**:
-    *   **Accuracy**: 98.1%
-    *   **F1 Score**: 98.2%
-*   **Classes**: Other, Influenza A, Norovirus, Influenza B
 ## 🚀 Usage
 Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
-1.  Download all files (`.joblib` and `inference.py`).
-2.  Use the `inference.py` script:
 ```python
 from inference import predict_dna
@@ -66,12 +101,13 @@ print(f"Genetic Type: {results['classification']}")
 print(f"Virus Identity: {results['virus_identity']}")
 print(f"Confidence: {results['virus_confidence']*100:.1f}%")
-# New in v1.4: Check if prediction was below threshold
 if results['virus_identity'] == 'Unknown':
     print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
 ```
 Alternatively, you can load components manually:
 ```python
 import joblib
 classifier = joblib.load("dna_classifier.joblib")
@@ -80,6 +116,8 @@ scaler = joblib.load("scaler_rf.joblib")
 ```
 ## 📊 Training Meta
-*   **Generated By**: DNA Governance Console (vparka)
-*   **Framework**: scikit-learn
-*   **Class Weighting**: Balanced (inverse frequency)

 - genomics
 - classification
 - sklearn
+- virus-detection
+- influenza
+- norovirus
 library_name: sklearn
+datasets:
+- ncbi-influenza
+- ncbi-norovirus
 ---
+# BioGuard DNA Classifier Ensemble (Portable v1.5)
 This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
 > [!NOTE]
+> **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.
+## 📊 Training Dataset
+| Metric | Value |
+|--------|-------|
+| **Total Sequences** | 176,407 |
+| **Unique Virus Types** | 40+ |
+| **Data Source** | NCBI Influenza & Norovirus Database |
+| **Last Updated** | 2026-01-08 |
+### Top Virus Classes
+| Virus Type | Samples |
+|------------|---------|
+| Influenza A H1N1 | 11,251 |
+| Influenza A H3N2 | 9,326 |
+| Influenza A H5N1 | 6,675 |
+| Norovirus GII | 5,488 |
+| Influenza A H9N2 | 1,929 |
+| Influenza A H1N2 | 1,349 |
 ## 🔄 Changelog
+### v1.5 (2026-01-08)
+- **TRAINED**: 176,407 sequences (5x increase from v1.4)
+- **EXPANDED**: 40+ virus subtypes now supported
+- **IMPROVED**: Better generalization with larger dataset
 ### v1.4 (2026-01-08)
 - **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
 - **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
 - **ADDED**: Confidence thresholds in `inference.py`:
   - General classes: 55% minimum confidence
   - Rare classes (Influenza B, etc.): 65% minimum confidence
   - Below threshold → classified as "Unknown"
 ### v1.3
 - Refactored to remove custom class dependencies from `.joblib` files
 ## 🧬 Models Included
 This repository hosts two distinct models specialized for different aspects of genomic analysis:
 ### 1. **GenetiForest** (RandomForestClassifier)
+- **File**: `dna_classifier.joblib`
+- **Purpose**: General-purpose synthetic vs. biological sequence classification.
+- **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
+- **Performance (Test Set)**:
+  - **Accuracy**: 89.1%
+  - **F1 Score**: 88.4%
 ### 2. **ViralBoost** (GradientBoostingClassifier)
+- **File**: `sequence_model.joblib`
+- **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
+- **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
+- **Performance (Test Set)**:
+  - **Accuracy**: 96.7%
+  - **F1 Score**: 97.0%
+- **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
 ## 🚀 Usage
 Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
+1. Download all files (`.joblib` and `inference.py`).
+2. Use the `inference.py` script:
 ```python
 from inference import predict_dna
 print(f"Virus Identity: {results['virus_identity']}")
 print(f"Confidence: {results['virus_confidence']*100:.1f}%")
+# Check if prediction was below threshold
 if results['virus_identity'] == 'Unknown':
     print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
 ```
 Alternatively, you can load components manually:
 ```python
 import joblib
 classifier = joblib.load("dna_classifier.joblib")
 ```
 ## 📊 Training Meta
+- **Generated By**: DNA Governance Console (vparka)
+- **Framework**: scikit-learn
+- **Class Weighting**: Balanced (inverse frequency)
+- **Train/Test Split**: 80/20