File size: 3,753 Bytes
03b1034
 
 
 
 
 
474e3ec
 
 
03b1034
474e3ec
 
 
03b1034
 
474e3ec
03b1034
5e3c446
 
 
474e3ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f158e96
 
 
474e3ec
 
 
 
 
 
f158e96
474e3ec
f158e96
 
 
 
 
 
 
 
474e3ec
f158e96
03b1034
 
 
 
 
 
474e3ec
 
 
 
 
 
 
03b1034
 
474e3ec
 
 
 
 
 
 
 
03b1034
 
 
5e3c446
 
474e3ec
 
03b1034
 
5e3c446
03b1034
5e3c446
 
03b1034
5e3c446
 
f158e96
 
474e3ec
f158e96
 
5e3c446
03b1034
5e3c446
474e3ec
5e3c446
 
 
 
 
03b1034
 
 
474e3ec
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
tags:
- biology
- genomics
- classification
- sklearn
- virus-detection
- influenza
- norovirus
library_name: sklearn
datasets:
- ncbi-influenza
- ncbi-norovirus
---

# BioGuard DNA Classifier Ensemble (Portable v1.5)

This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.

> [!NOTE]
> **Version 1.5 Update**: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.

## πŸ“Š Training Dataset

| Metric | Value |
|--------|-------|
| **Total Sequences** | 176,407 |
| **Unique Virus Types** | 40+ |
| **Data Source** | NCBI Influenza & Norovirus Database |
| **Last Updated** | 2026-01-08 |

### Top Virus Classes

| Virus Type | Samples |
|------------|---------|
| Influenza A H1N1 | 11,251 |
| Influenza A H3N2 | 9,326 |
| Influenza A H5N1 | 6,675 |
| Norovirus GII | 5,488 |
| Influenza A H9N2 | 1,929 |
| Influenza A H1N2 | 1,349 |

## πŸ”„ Changelog

### v1.5 (2026-01-08)

- **TRAINED**: 176,407 sequences (5x increase from v1.4)
- **EXPANDED**: 40+ virus subtypes now supported
- **IMPROVED**: Better generalization with larger dataset

### v1.4 (2026-01-08)

- **FIXED**: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
- **ADDED**: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
- **ADDED**: Confidence thresholds in `inference.py`:
  - General classes: 55% minimum confidence
  - Rare classes (Influenza B, etc.): 65% minimum confidence
  - Below threshold β†’ classified as "Unknown"

### v1.3

- Refactored to remove custom class dependencies from `.joblib` files

## 🧬 Models Included

This repository hosts two distinct models specialized for different aspects of genomic analysis:

### 1. **GenetiForest** (RandomForestClassifier)

- **File**: `dna_classifier.joblib`
- **Purpose**: General-purpose synthetic vs. biological sequence classification.
- **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
- **Performance (Test Set)**:
  - **Accuracy**: 89.1%
  - **F1 Score**: 88.4%

### 2. **ViralBoost** (GradientBoostingClassifier)

- **File**: `sequence_model.joblib`
- **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
- **Architecture**: Gradient Boosting (sklearn) with class-weighted training.
- **Performance (Test Set)**:
  - **Accuracy**: 96.7%
  - **F1 Score**: 97.0%
- **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B

## πŸš€ Usage

Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.

1. Download all files (`.joblib` and `inference.py`).
2. Use the `inference.py` script:

```python
from inference import predict_dna

sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)

print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
print(f"Confidence: {results['virus_confidence']*100:.1f}%")

# Check if prediction was below threshold
if results['virus_identity'] == 'Unknown':
    print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
```

Alternatively, you can load components manually:

```python
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
```

## πŸ“Š Training Meta

- **Generated By**: DNA Governance Console (vparka)
- **Framework**: scikit-learn
- **Class Weighting**: Balanced (inverse frequency)
- **Train/Test Split**: 80/20