Upload README.md via DNA Console (Portable Version)
Browse files
README.md
CHANGED
|
@@ -7,12 +7,26 @@ tags:
|
|
| 7 |
library_name: sklearn
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# BioGuard DNA Classifier Ensemble (Portable v1.
|
| 11 |
|
| 12 |
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
|
| 13 |
|
| 14 |
> [!NOTE]
|
| 15 |
-
> **Version 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## 𧬠Models Included
|
| 18 |
|
|
@@ -23,17 +37,17 @@ This repository hosts two distinct models specialized for different aspects of g
|
|
| 23 |
* **Purpose**: General-purpose synthetic vs. biological sequence classification.
|
| 24 |
* **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
|
| 25 |
* **Performance (Test Set)**:
|
| 26 |
-
* **Accuracy**:
|
| 27 |
-
* **F1 Score**: 89.
|
| 28 |
|
| 29 |
### 2. **ViralBoost** (GradientBoostingClassifier)
|
| 30 |
* **File**: `sequence_model.joblib`
|
| 31 |
* **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
|
| 32 |
-
* **Architecture**: Gradient Boosting (sklearn)
|
| 33 |
* **Performance (Test Set)**:
|
| 34 |
-
* **Accuracy**:
|
| 35 |
-
* **F1 Score**:
|
| 36 |
-
* **Classes**: Other, Influenza A,
|
| 37 |
|
| 38 |
## π Usage
|
| 39 |
|
|
@@ -50,6 +64,11 @@ results = predict_dna(sequence)
|
|
| 50 |
|
| 51 |
print(f"Genetic Type: {results['classification']}")
|
| 52 |
print(f"Virus Identity: {results['virus_identity']}")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
```
|
| 54 |
|
| 55 |
Alternatively, you can load components manually:
|
|
@@ -63,3 +82,4 @@ scaler = joblib.load("scaler_rf.joblib")
|
|
| 63 |
## π Training Meta
|
| 64 |
* **Generated By**: DNA Governance Console (vparka)
|
| 65 |
* **Framework**: scikit-learn
|
|
|
|
|
|
| 7 |
library_name: sklearn
|
| 8 |
---
|
| 9 |
|
| 10 |
+
# BioGuard DNA Classifier Ensemble (Portable v1.4)
|
| 11 |
|
| 12 |
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
|
| 13 |
|
| 14 |
> [!NOTE]
|
| 15 |
+
> **Version 1.4 Update**: Fixed critical overfitting issue in ViralBoost model. Added class weighting for imbalanced training data and confidence thresholds for rare class predictions.
|
| 16 |
+
|
| 17 |
+
## π Changelog
|
| 18 |
+
|
| 19 |
+
### v1.4 (2026-01-08)
|
| 20 |
+
- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
|
| 21 |
+
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
|
| 22 |
+
- **ADDED**: Confidence thresholds in `inference.py`:
|
| 23 |
+
- General classes: 55% minimum confidence
|
| 24 |
+
- Rare classes (Influenza B, etc.): 65% minimum confidence
|
| 25 |
+
- Below threshold β classified as "Unknown"
|
| 26 |
+
- **IMPROVED**: Reduced overfitting with `max_depth=3`, `min_samples_leaf=10`
|
| 27 |
+
|
| 28 |
+
### v1.3
|
| 29 |
+
- Refactored to remove custom class dependencies from `.joblib` files
|
| 30 |
|
| 31 |
## 𧬠Models Included
|
| 32 |
|
|
|
|
| 37 |
* **Purpose**: General-purpose synthetic vs. biological sequence classification.
|
| 38 |
* **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
|
| 39 |
* **Performance (Test Set)**:
|
| 40 |
+
* **Accuracy**: 89.4%
|
| 41 |
+
* **F1 Score**: 89.9%
|
| 42 |
|
| 43 |
### 2. **ViralBoost** (GradientBoostingClassifier)
|
| 44 |
* **File**: `sequence_model.joblib`
|
| 45 |
* **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
|
| 46 |
+
* **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
|
| 47 |
* **Performance (Test Set)**:
|
| 48 |
+
* **Accuracy**: 98.1%
|
| 49 |
+
* **F1 Score**: 98.2%
|
| 50 |
+
* **Classes**: Other, Influenza A, Norovirus, Influenza B
|
| 51 |
|
| 52 |
## π Usage
|
| 53 |
|
|
|
|
| 64 |
|
| 65 |
print(f"Genetic Type: {results['classification']}")
|
| 66 |
print(f"Virus Identity: {results['virus_identity']}")
|
| 67 |
+
print(f"Confidence: {results['virus_confidence']*100:.1f}%")
|
| 68 |
+
|
| 69 |
+
# New in v1.4: Check if prediction was below threshold
|
| 70 |
+
if results['virus_identity'] == 'Unknown':
|
| 71 |
+
print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
|
| 72 |
```
|
| 73 |
|
| 74 |
Alternatively, you can load components manually:
|
|
|
|
| 82 |
## π Training Meta
|
| 83 |
* **Generated By**: DNA Governance Console (vparka)
|
| 84 |
* **Framework**: scikit-learn
|
| 85 |
+
* **Class Weighting**: Balanced (inverse frequency)
|