Upload README.md via DNA Console (Portable Version)
Browse files
README.md
CHANGED
|
@@ -4,28 +4,61 @@ tags:
|
|
| 4 |
- genomics
|
| 5 |
- classification
|
| 6 |
- sklearn
|
|
|
|
|
|
|
|
|
|
| 7 |
library_name: sklearn
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
# BioGuard DNA Classifier Ensemble (Portable v1.
|
| 11 |
|
| 12 |
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
|
| 13 |
|
| 14 |
> [!NOTE]
|
| 15 |
-
> **Version 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## π Changelog
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
### v1.4 (2026-01-08)
|
|
|
|
| 20 |
- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
|
| 21 |
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
|
| 22 |
- **ADDED**: Confidence thresholds in `inference.py`:
|
| 23 |
- General classes: 55% minimum confidence
|
| 24 |
- Rare classes (Influenza B, etc.): 65% minimum confidence
|
| 25 |
- Below threshold β classified as "Unknown"
|
| 26 |
-
- **IMPROVED**: Reduced overfitting with `max_depth=3`, `min_samples_leaf=10`
|
| 27 |
|
| 28 |
### v1.3
|
|
|
|
| 29 |
- Refactored to remove custom class dependencies from `.joblib` files
|
| 30 |
|
| 31 |
## 𧬠Models Included
|
|
@@ -33,28 +66,30 @@ This repository contains a dual-model ensemble for DNA sequence analysis and vir
|
|
| 33 |
This repository hosts two distinct models specialized for different aspects of genomic analysis:
|
| 34 |
|
| 35 |
### 1. **GenetiForest** (RandomForestClassifier)
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
| 42 |
|
| 43 |
### 2. **ViralBoost** (GradientBoostingClassifier)
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
| 51 |
|
| 52 |
## π Usage
|
| 53 |
|
| 54 |
Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
|
| 55 |
|
| 56 |
-
1.
|
| 57 |
-
2.
|
| 58 |
|
| 59 |
```python
|
| 60 |
from inference import predict_dna
|
|
@@ -66,12 +101,13 @@ print(f"Genetic Type: {results['classification']}")
|
|
| 66 |
print(f"Virus Identity: {results['virus_identity']}")
|
| 67 |
print(f"Confidence: {results['virus_confidence']*100:.1f}%")
|
| 68 |
|
| 69 |
-
#
|
| 70 |
if results['virus_identity'] == 'Unknown':
|
| 71 |
print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
|
| 72 |
```
|
| 73 |
|
| 74 |
Alternatively, you can load components manually:
|
|
|
|
| 75 |
```python
|
| 76 |
import joblib
|
| 77 |
classifier = joblib.load("dna_classifier.joblib")
|
|
@@ -80,6 +116,8 @@ scaler = joblib.load("scaler_rf.joblib")
|
|
| 80 |
```
|
| 81 |
|
| 82 |
## π Training Meta
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
|
|
| 4 |
- genomics
|
| 5 |
- classification
|
| 6 |
- sklearn
|
| 7 |
+
- virus-detection
|
| 8 |
+
- influenza
|
| 9 |
+
- norovirus
|
| 10 |
library_name: sklearn
|
| 11 |
+
datasets:
|
| 12 |
+
- ncbi-influenza
|
| 13 |
+
- ncbi-norovirus
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# BioGuard DNA Classifier Ensemble (Portable v1.5)
|
| 17 |
|
| 18 |
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
|
| 19 |
|
| 20 |
> [!NOTE]
|
| 21 |
+
> **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.
|
| 22 |
+
|
| 23 |
+
## π Training Dataset
|
| 24 |
+
|
| 25 |
+
| Metric | Value |
|
| 26 |
+
|--------|-------|
|
| 27 |
+
| **Total Sequences** | 176,407 |
|
| 28 |
+
| **Unique Virus Types** | 40+ |
|
| 29 |
+
| **Data Source** | NCBI Influenza & Norovirus Database |
|
| 30 |
+
| **Last Updated** | 2026-01-08 |
|
| 31 |
+
|
| 32 |
+
### Top Virus Classes
|
| 33 |
+
|
| 34 |
+
| Virus Type | Samples |
|
| 35 |
+
|------------|---------|
|
| 36 |
+
| Influenza A H1N1 | 11,251 |
|
| 37 |
+
| Influenza A H3N2 | 9,326 |
|
| 38 |
+
| Influenza A H5N1 | 6,675 |
|
| 39 |
+
| Norovirus GII | 5,488 |
|
| 40 |
+
| Influenza A H9N2 | 1,929 |
|
| 41 |
+
| Influenza A H1N2 | 1,349 |
|
| 42 |
|
| 43 |
## π Changelog
|
| 44 |
|
| 45 |
+
### v1.5 (2026-01-08)
|
| 46 |
+
|
| 47 |
+
- **TRAINED**: 176,407 sequences (5x increase from v1.4)
|
| 48 |
+
- **EXPANDED**: 40+ virus subtypes now supported
|
| 49 |
+
- **IMPROVED**: Better generalization with larger dataset
|
| 50 |
+
|
| 51 |
### v1.4 (2026-01-08)
|
| 52 |
+
|
| 53 |
- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
|
| 54 |
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
|
| 55 |
- **ADDED**: Confidence thresholds in `inference.py`:
|
| 56 |
- General classes: 55% minimum confidence
|
| 57 |
- Rare classes (Influenza B, etc.): 65% minimum confidence
|
| 58 |
- Below threshold β classified as "Unknown"
|
|
|
|
| 59 |
|
| 60 |
### v1.3
|
| 61 |
+
|
| 62 |
- Refactored to remove custom class dependencies from `.joblib` files
|
| 63 |
|
| 64 |
## 𧬠Models Included
|
|
|
|
| 66 |
This repository hosts two distinct models specialized for different aspects of genomic analysis:
|
| 67 |
|
| 68 |
### 1. **GenetiForest** (RandomForestClassifier)
|
| 69 |
+
|
| 70 |
+
- **File**: `dna_classifier.joblib`
|
| 71 |
+
- **Purpose**: General-purpose synthetic vs. biological sequence classification.
|
| 72 |
+
- **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
|
| 73 |
+
- **Performance (Test Set)**:
|
| 74 |
+
- **Accuracy**: 89.1%
|
| 75 |
+
- **F1 Score**: 88.4%
|
| 76 |
|
| 77 |
### 2. **ViralBoost** (GradientBoostingClassifier)
|
| 78 |
+
|
| 79 |
+
- **File**: `sequence_model.joblib`
|
| 80 |
+
- **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
|
| 81 |
+
- **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
|
| 82 |
+
- **Performance (Test Set)**:
|
| 83 |
+
- **Accuracy**: 96.7%
|
| 84 |
+
- **F1 Score**: 97.0%
|
| 85 |
+
- **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
|
| 86 |
|
| 87 |
## π Usage
|
| 88 |
|
| 89 |
Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
|
| 90 |
|
| 91 |
+
1. Download all files (`.joblib` and `inference.py`).
|
| 92 |
+
2. Use the `inference.py` script:
|
| 93 |
|
| 94 |
```python
|
| 95 |
from inference import predict_dna
|
|
|
|
| 101 |
print(f"Virus Identity: {results['virus_identity']}")
|
| 102 |
print(f"Confidence: {results['virus_confidence']*100:.1f}%")
|
| 103 |
|
| 104 |
+
# Check if prediction was below threshold
|
| 105 |
if results['virus_identity'] == 'Unknown':
|
| 106 |
print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
|
| 107 |
```
|
| 108 |
|
| 109 |
Alternatively, you can load components manually:
|
| 110 |
+
|
| 111 |
```python
|
| 112 |
import joblib
|
| 113 |
classifier = joblib.load("dna_classifier.joblib")
|
|
|
|
| 116 |
```
|
| 117 |
|
| 118 |
## π Training Meta
|
| 119 |
+
|
| 120 |
+
- **Generated By**: DNA Governance Console (vparka)
|
| 121 |
+
- **Framework**: scikit-learn
|
| 122 |
+
- **Class Weighting**: Balanced (inverse frequency)
|
| 123 |
+
- **Train/Test Split**: 80/20
|