|
|
--- |
|
|
tags: |
|
|
- biology |
|
|
- genomics |
|
|
- classification |
|
|
- sklearn |
|
|
- virus-detection |
|
|
- influenza |
|
|
- norovirus |
|
|
library_name: sklearn |
|
|
datasets: |
|
|
- ncbi-influenza |
|
|
- ncbi-norovirus |
|
|
--- |
|
|
|
|
|
# BioGuard DNA Classifier Ensemble (Portable v1.5) |
|
|
|
|
|
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**. |
|
|
|
|
|
> [!NOTE] |
|
|
> **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants. |
|
|
|
|
|
## π Training Dataset |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Total Sequences** | 176,407 | |
|
|
| **Unique Virus Types** | 40+ | |
|
|
| **Data Source** | NCBI Influenza & Norovirus Database | |
|
|
| **Last Updated** | 2026-01-08 | |
|
|
|
|
|
### Top Virus Classes |
|
|
|
|
|
| Virus Type | Samples | |
|
|
|------------|---------| |
|
|
| Influenza A H1N1 | 11,251 | |
|
|
| Influenza A H3N2 | 9,326 | |
|
|
| Influenza A H5N1 | 6,675 | |
|
|
| Norovirus GII | 5,488 | |
|
|
| Influenza A H9N2 | 1,929 | |
|
|
| Influenza A H1N2 | 1,349 | |
|
|
|
|
|
## π Changelog |
|
|
|
|
|
### v1.5 (2026-01-08) |
|
|
|
|
|
- **TRAINED**: 176,407 sequences (5x increase from v1.4) |
|
|
- **EXPANDED**: 40+ virus subtypes now supported |
|
|
- **IMPROVED**: Better generalization with larger dataset |
|
|
|
|
|
### v1.4 (2026-01-08) |
|
|
|
|
|
- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)" |
|
|
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance |
|
|
- **ADDED**: Confidence thresholds in `inference.py`: |
|
|
- General classes: 55% minimum confidence |
|
|
- Rare classes (Influenza B, etc.): 65% minimum confidence |
|
|
- Below threshold β classified as "Unknown" |
|
|
|
|
|
### v1.3 |
|
|
|
|
|
- Refactored to remove custom class dependencies from `.joblib` files |
|
|
|
|
|
## 𧬠Models Included |
|
|
|
|
|
This repository hosts two distinct models specialized for different aspects of genomic analysis: |
|
|
|
|
|
### 1. **GenetiForest** (RandomForestClassifier) |
|
|
|
|
|
- **File**: `dna_classifier.joblib` |
|
|
- **Purpose**: General-purpose synthetic vs. biological sequence classification. |
|
|
- **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.). |
|
|
- **Performance (Test Set)**: |
|
|
- **Accuracy**: 89.1% |
|
|
- **F1 Score**: 88.4% |
|
|
|
|
|
### 2. **ViralBoost** (GradientBoostingClassifier) |
|
|
|
|
|
- **File**: `sequence_model.joblib` |
|
|
- **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures. |
|
|
- **Architecture**: Gradient Boosting (sklearn) with class-weighted training. |
|
|
- **Performance (Test Set)**: |
|
|
- **Accuracy**: 96.7% |
|
|
- **F1 Score**: 97.0% |
|
|
- **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B |
|
|
|
|
|
## π Usage |
|
|
|
|
|
Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage. |
|
|
|
|
|
1. Download all files (`.joblib` and `inference.py`). |
|
|
2. Use the `inference.py` script: |
|
|
|
|
|
```python |
|
|
from inference import predict_dna |
|
|
|
|
|
sequence = "ATGCTAGCTAGCTAG..." |
|
|
results = predict_dna(sequence) |
|
|
|
|
|
print(f"Genetic Type: {results['classification']}") |
|
|
print(f"Virus Identity: {results['virus_identity']}") |
|
|
print(f"Confidence: {results['virus_confidence']*100:.1f}%") |
|
|
|
|
|
# Check if prediction was below threshold |
|
|
if results['virus_identity'] == 'Unknown': |
|
|
print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)") |
|
|
``` |
|
|
|
|
|
Alternatively, you can load components manually: |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
classifier = joblib.load("dna_classifier.joblib") |
|
|
scaler = joblib.load("scaler_rf.joblib") |
|
|
# (Refer to inference.py for Feature Extraction logic) |
|
|
``` |
|
|
|
|
|
## π Training Meta |
|
|
|
|
|
- **Generated By**: DNA Governance Console (vparka) |
|
|
- **Framework**: scikit-learn |
|
|
- **Class Weighting**: Balanced (inverse frequency) |
|
|
- **Train/Test Split**: 80/20 |
|
|
|