--- tags: - biology - genomics - classification - sklearn - virus-detection - influenza - norovirus library_name: sklearn datasets: - ncbi-influenza - ncbi-norovirus --- # BioGuard DNA Classifier Ensemble (Portable v1.5) This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**. > [!NOTE] > **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants. ## 📊 Training Dataset | Metric | Value | |--------|-------| | **Total Sequences** | 176,407 | | **Unique Virus Types** | 40+ | | **Data Source** | NCBI Influenza & Norovirus Database | | **Last Updated** | 2026-01-08 | ### Top Virus Classes | Virus Type | Samples | |------------|---------| | Influenza A H1N1 | 11,251 | | Influenza A H3N2 | 9,326 | | Influenza A H5N1 | 6,675 | | Norovirus GII | 5,488 | | Influenza A H9N2 | 1,929 | | Influenza A H1N2 | 1,349 | ## 🔄 Changelog ### v1.5 (2026-01-08) - **TRAINED**: 176,407 sequences (5x increase from v1.4) - **EXPANDED**: 40+ virus subtypes now supported - **IMPROVED**: Better generalization with larger dataset ### v1.4 (2026-01-08) - **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)" - **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance - **ADDED**: Confidence thresholds in `inference.py`: - General classes: 55% minimum confidence - Rare classes (Influenza B, etc.): 65% minimum confidence - Below threshold → classified as "Unknown" ### v1.3 - Refactored to remove custom class dependencies from `.joblib` files ## 🧬 Models Included This repository hosts two distinct models specialized for different aspects of genomic analysis: ### 1. **GenetiForest** (RandomForestClassifier) - **File**: `dna_classifier.joblib` - **Purpose**: General-purpose synthetic vs. biological sequence classification. - **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.). - **Performance (Test Set)**: - **Accuracy**: 89.1% - **F1 Score**: 88.4% ### 2. **ViralBoost** (GradientBoostingClassifier) - **File**: `sequence_model.joblib` - **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures. - **Architecture**: Gradient Boosting (sklearn) with class-weighted training. - **Performance (Test Set)**: - **Accuracy**: 96.7% - **F1 Score**: 97.0% - **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B ## 🚀 Usage Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage. 1. Download all files (`.joblib` and `inference.py`). 2. Use the `inference.py` script: ```python from inference import predict_dna sequence = "ATGCTAGCTAGCTAG..." results = predict_dna(sequence) print(f"Genetic Type: {results['classification']}") print(f"Virus Identity: {results['virus_identity']}") print(f"Confidence: {results['virus_confidence']*100:.1f}%") # Check if prediction was below threshold if results['virus_identity'] == 'Unknown': print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)") ``` Alternatively, you can load components manually: ```python import joblib classifier = joblib.load("dna_classifier.joblib") scaler = joblib.load("scaler_rf.joblib") # (Refer to inference.py for Feature Extraction logic) ``` ## 📊 Training Meta - **Generated By**: DNA Governance Console (vparka) - **Framework**: scikit-learn - **Class Weighting**: Balanced (inverse frequency) - **Train/Test Split**: 80/20