File size: 2,219 Bytes
03b1034
 
 
 
 
 
 
 
 
5e3c446
03b1034
5e3c446
 
 
 
03b1034
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e3c446
 
 
 
03b1034
 
5e3c446
03b1034
5e3c446
 
03b1034
5e3c446
 
 
03b1034
5e3c446
 
 
 
 
 
03b1034
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
tags:
- biology
- genomics
- classification
- sklearn
library_name: sklearn
---

# BioGuard DNA Classifier Ensemble (Portable v1.1)

This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**.

> [!NOTE]
> **Version 1.1 Update**: This version has been refactored to decouple the models from custom feature extraction classes. It uses a raw scikit-learn format for maximum compatibility.

## 🧬 Models Included

This repository hosts two distinct models specialized for different aspects of genomic analysis:

### 1. **GenetiForest** (RandomForestClassifier)
*   **File**: `dna_classifier.joblib`
*   **Purpose**: General-purpose synthetic vs. biological sequence classification.
*   **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
*   **Performance (Test Set)**:
    *   **Accuracy**: 89.4%
    *   **F1 Score**: 89.4%

### 2. **ViralBoost** (GradientBoostingClassifier)
*   **File**: `sequence_model.joblib`
*   **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
*   **Architecture**: Gradient Boosting (sklearn) trained on real-world viral sequences.
*   **Performance (Test Set)**:
    *   **Accuracy**: 99.4%
    *   **F1 Score**: 99.4%
*   **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B

## 🚀 Usage

Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.

1.  Download all files (`.joblib` and `inference.py`).
2.  Use the `inference.py` script:

```python
from inference import predict_dna

sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)

print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
```

Alternatively, you can load components manually:
```python
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
```

## 📊 Training Meta
*   **Generated By**: DNA Governance Console (vparka)
*   **Framework**: scikit-learn