File size: 3,753 Bytes
03b1034 474e3ec 03b1034 474e3ec 03b1034 474e3ec 03b1034 5e3c446 474e3ec f158e96 474e3ec f158e96 474e3ec f158e96 474e3ec f158e96 03b1034 474e3ec 03b1034 474e3ec 03b1034 5e3c446 474e3ec 03b1034 5e3c446 03b1034 5e3c446 03b1034 5e3c446 f158e96 474e3ec f158e96 5e3c446 03b1034 5e3c446 474e3ec 5e3c446 03b1034 474e3ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
tags:
- biology
- genomics
- classification
- sklearn
- virus-detection
- influenza
- norovirus
library_name: sklearn
datasets:
- ncbi-influenza
- ncbi-norovirus
---
# BioGuard DNA Classifier Ensemble (Portable v1.5)
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.
> [!NOTE]
> **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.
## π Training Dataset
| Metric | Value |
|--------|-------|
| **Total Sequences** | 176,407 |
| **Unique Virus Types** | 40+ |
| **Data Source** | NCBI Influenza & Norovirus Database |
| **Last Updated** | 2026-01-08 |
### Top Virus Classes
| Virus Type | Samples |
|------------|---------|
| Influenza A H1N1 | 11,251 |
| Influenza A H3N2 | 9,326 |
| Influenza A H5N1 | 6,675 |
| Norovirus GII | 5,488 |
| Influenza A H9N2 | 1,929 |
| Influenza A H1N2 | 1,349 |
## π Changelog
### v1.5 (2026-01-08)
- **TRAINED**: 176,407 sequences (5x increase from v1.4)
- **EXPANDED**: 40+ virus subtypes now supported
- **IMPROVED**: Better generalization with larger dataset
### v1.4 (2026-01-08)
- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
- **ADDED**: Confidence thresholds in `inference.py`:
- General classes: 55% minimum confidence
- Rare classes (Influenza B, etc.): 65% minimum confidence
- Below threshold β classified as "Unknown"
### v1.3
- Refactored to remove custom class dependencies from `.joblib` files
## 𧬠Models Included
This repository hosts two distinct models specialized for different aspects of genomic analysis:
### 1. **GenetiForest** (RandomForestClassifier)
- **File**: `dna_classifier.joblib`
- **Purpose**: General-purpose synthetic vs. biological sequence classification.
- **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
- **Performance (Test Set)**:
- **Accuracy**: 89.1%
- **F1 Score**: 88.4%
### 2. **ViralBoost** (GradientBoostingClassifier)
- **File**: `sequence_model.joblib`
- **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
- **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
- **Performance (Test Set)**:
- **Accuracy**: 96.7%
- **F1 Score**: 97.0%
- **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
## π Usage
Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.
1. Download all files (`.joblib` and `inference.py`).
2. Use the `inference.py` script:
```python
from inference import predict_dna
sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)
print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
print(f"Confidence: {results['virus_confidence']*100:.1f}%")
# Check if prediction was below threshold
if results['virus_identity'] == 'Unknown':
print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
```
Alternatively, you can load components manually:
```python
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
```
## π Training Meta
- **Generated By**: DNA Governance Console (vparka)
- **Framework**: scikit-learn
- **Class Weighting**: Balanced (inverse frequency)
- **Train/Test Split**: 80/20
|