---
tags:
- biology
- genomics
- classification
- sklearn
- virus-detection
- influenza
- norovirus
library_name: sklearn
datasets:
- ncbi-influenza
- ncbi-norovirus
---

# BioGuard DNA Classifier Ensemble (Portable v1.5)

This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.

> [!NOTE]
> **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.

## 📊 Training Dataset

| Metric | Value |
|--------|-------|
| **Total Sequences** | 176,407 |
| **Unique Virus Types** | 40+ |
| **Data Source** | NCBI Influenza & Norovirus Database |
| **Last Updated** | 2026-01-08 |

### Top Virus Classes

| Virus Type | Samples |
|------------|---------|
| Influenza A H1N1 | 11,251 |
| Influenza A H3N2 | 9,326 |
| Influenza A H5N1 | 6,675 |
| Norovirus GII | 5,488 |
| Influenza A H9N2 | 1,929 |
| Influenza A H1N2 | 1,349 |

## 🔄 Changelog

### v1.5 (2026-01-08)

- **TRAINED**: 176,407 sequences (5x increase from v1.4)
- **EXPANDED**: 40+ virus subtypes now supported
- **IMPROVED**: Better generalization with larger dataset

### v1.4 (2026-01-08)

- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
- **ADDED**: Confidence thresholds in `inference.py`:
  - General classes: 55% minimum confidence
  - Rare classes (Influenza B, etc.): 65% minimum confidence
  - Below threshold → classified as "Unknown"

### v1.3

- Refactored to remove custom class dependencies from `.joblib` files

## 🧬 Models Included

This repository hosts two distinct models specialized for different aspects of genomic analysis:

### 1. **GenetiForest** (RandomForestClassifier)

- **File**: `dna_classifier.joblib`
- **Purpose**: General-purpose synthetic vs. biological sequence classification.
- **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
- **Performance (Test Set)**:
  - **Accuracy**: 89.1%
  - **F1 Score**: 88.4%

### 2. **ViralBoost** (GradientBoostingClassifier)

- **File**: `sequence_model.joblib`
- **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
- **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
- **Performance (Test Set)**:
  - **Accuracy**: 96.7%
  - **F1 Score**: 97.0%
- **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B

## 🚀 Usage

Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.

1. Download all files (`.joblib` and `inference.py`).
2. Use the `inference.py` script:

```python
from inference import predict_dna

sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)

print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
print(f"Confidence: {results['virus_confidence']*100:.1f}%")

# Check if prediction was below threshold
if results['virus_identity'] == 'Unknown':
    print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
```

Alternatively, you can load components manually:

```python
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
```

## 📊 Training Meta

- **Generated By**: DNA Governance Console (vparka)
- **Framework**: scikit-learn
- **Class Weighting**: Balanced (inverse frequency)
- **Train/Test Split**: 80/20