metadata
tags:
- biology
- genomics
- classification
- sklearn
- virus-detection
- influenza
- norovirus
library_name: sklearn
datasets:
- ncbi-influenza
- ncbi-norovirus
BioGuard DNA Classifier Ensemble (Portable v1.5)
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for portability and zero-dependency loading.
Version 1.5 Update: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.
π Training Dataset
| Metric | Value |
|---|---|
| Total Sequences | 176,407 |
| Unique Virus Types | 40+ |
| Data Source | NCBI Influenza & Norovirus Database |
| Last Updated | 2026-01-08 |
Top Virus Classes
| Virus Type | Samples |
|---|---|
| Influenza A H1N1 | 11,251 |
| Influenza A H3N2 | 9,326 |
| Influenza A H5N1 | 6,675 |
| Norovirus GII | 5,488 |
| Influenza A H9N2 | 1,929 |
| Influenza A H1N2 | 1,349 |
π Changelog
v1.5 (2026-01-08)
- TRAINED: 176,407 sequences (5x increase from v1.4)
- EXPANDED: 40+ virus subtypes now supported
- IMPROVED: Better generalization with larger dataset
v1.4 (2026-01-08)
- FIXED: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
- ADDED: Class weighting (
compute_class_weight('balanced')) to handle severe class imbalance - ADDED: Confidence thresholds in
inference.py:- General classes: 55% minimum confidence
- Rare classes (Influenza B, etc.): 65% minimum confidence
- Below threshold β classified as "Unknown"
v1.3
- Refactored to remove custom class dependencies from
.joblibfiles
𧬠Models Included
This repository hosts two distinct models specialized for different aspects of genomic analysis:
1. GenetiForest (RandomForestClassifier)
- File:
dna_classifier.joblib - Purpose: General-purpose synthetic vs. biological sequence classification.
- Architecture: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
- Performance (Test Set):
- Accuracy: 89.1%
- F1 Score: 88.4%
2. ViralBoost (GradientBoostingClassifier)
- File:
sequence_model.joblib - Purpose: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
- Architecture: Gradient Boosting (sklearn) with class-weighted training.
- Performance (Test Set):
- Accuracy: 96.7%
- F1 Score: 97.0%
- Classes: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
π Usage
Since these models use biological feature extraction, we provide a standalone inference.py script for easy usage.
- Download all files (
.joblibandinference.py). - Use the
inference.pyscript:
from inference import predict_dna
sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)
print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
print(f"Confidence: {results['virus_confidence']*100:.1f}%")
# Check if prediction was below threshold
if results['virus_identity'] == 'Unknown':
print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
Alternatively, you can load components manually:
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
π Training Meta
- Generated By: DNA Governance Console (vparka)
- Framework: scikit-learn
- Class Weighting: Balanced (inverse frequency)
- Train/Test Split: 80/20