metadata
tags:
- biology
- genomics
- classification
- sklearn
library_name: sklearn
BioGuard DNA Classifier Ensemble (Portable v1.1)
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for portability and zero-dependency loading.
Version 1.1 Update: This version has been refactored to decouple the models from custom feature extraction classes. It uses a raw scikit-learn format for maximum compatibility.
𧬠Models Included
This repository hosts two distinct models specialized for different aspects of genomic analysis:
1. GenetiForest (RandomForestClassifier)
- File:
dna_classifier.joblib - Purpose: General-purpose synthetic vs. biological sequence classification.
- Architecture: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
- Performance (Test Set):
- Accuracy: 89.4%
- F1 Score: 89.4%
2. ViralBoost (GradientBoostingClassifier)
- File:
sequence_model.joblib - Purpose: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
- Architecture: Gradient Boosting (sklearn) trained on real-world viral sequences.
- Performance (Test Set):
- Accuracy: 99.4%
- F1 Score: 99.4%
- Classes: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
π Usage
Since these models use biological feature extraction, we provide a standalone inference.py script for easy usage.
- Download all files (
.joblibandinference.py). - Use the
inference.pyscript:
from inference import predict_dna
sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)
print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
Alternatively, you can load components manually:
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
π Training Meta
- Generated By: DNA Governance Console (vparka)
- Framework: scikit-learn