|
|
--- |
|
|
tags: |
|
|
- biology |
|
|
- genomics |
|
|
- classification |
|
|
- sklearn |
|
|
library_name: sklearn |
|
|
--- |
|
|
|
|
|
# BioGuard DNA Classifier Ensemble (Portable v1.1) |
|
|
|
|
|
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for **portability and zero-dependency loading**. |
|
|
|
|
|
> [!NOTE] |
|
|
> **Version 1.1 Update**: This version has been refactored to decouple the models from custom feature extraction classes. It uses a raw scikit-learn format for maximum compatibility. |
|
|
|
|
|
## ๐งฌ Models Included |
|
|
|
|
|
This repository hosts two distinct models specialized for different aspects of genomic analysis: |
|
|
|
|
|
### 1. **GenetiForest** (RandomForestClassifier) |
|
|
* **File**: `dna_classifier.joblib` |
|
|
* **Purpose**: General-purpose synthetic vs. biological sequence classification. |
|
|
* **Architecture**: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.). |
|
|
* **Performance (Test Set)**: |
|
|
* **Accuracy**: 89.4% |
|
|
* **F1 Score**: 89.4% |
|
|
|
|
|
### 2. **ViralBoost** (GradientBoostingClassifier) |
|
|
* **File**: `sequence_model.joblib` |
|
|
* **Purpose**: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures. |
|
|
* **Architecture**: Gradient Boosting (sklearn) trained on real-world viral sequences. |
|
|
* **Performance (Test Set)**: |
|
|
* **Accuracy**: 99.4% |
|
|
* **F1 Score**: 99.4% |
|
|
* **Classes**: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B |
|
|
|
|
|
## ๐ Usage |
|
|
|
|
|
Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage. |
|
|
|
|
|
1. Download all files (`.joblib` and `inference.py`). |
|
|
2. Use the `inference.py` script: |
|
|
|
|
|
```python |
|
|
from inference import predict_dna |
|
|
|
|
|
sequence = "ATGCTAGCTAGCTAG..." |
|
|
results = predict_dna(sequence) |
|
|
|
|
|
print(f"Genetic Type: {results['classification']}") |
|
|
print(f"Virus Identity: {results['virus_identity']}") |
|
|
``` |
|
|
|
|
|
Alternatively, you can load components manually: |
|
|
```python |
|
|
import joblib |
|
|
classifier = joblib.load("dna_classifier.joblib") |
|
|
scaler = joblib.load("scaler_rf.joblib") |
|
|
# (Refer to inference.py for Feature Extraction logic) |
|
|
``` |
|
|
|
|
|
## ๐ Training Meta |
|
|
* **Generated By**: DNA Governance Console (vparka) |
|
|
* **Framework**: scikit-learn |
|
|
|