metadata
tags:
- biology
- genomics
- classification
- sklearn
library_name: sklearn
BioGuard DNA Classifier Ensemble (Portable v1.3)
This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for portability and zero-dependency loading.
Version 1.3 Update: This version has been fully refactored to remove all custom class dependencies from
.joblibfiles. Feature extraction is now strictly handled via source code.
𧬠Models Included
This repository hosts two distinct models specialized for different aspects of genomic analysis:
1. GenetiForest (RandomForestClassifier)
- File:
dna_classifier.joblib - Purpose: General-purpose synthetic vs. biological sequence classification.
- Architecture: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
- Performance (Test Set):
- Accuracy: 88.9%
- F1 Score: 89.2%
2. ViralBoost (GradientBoostingClassifier)
- File:
sequence_model.joblib - Purpose: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
- Architecture: Gradient Boosting (sklearn) trained on real-world viral sequences.
- Performance (Test Set):
- Accuracy: 99.3%
- F1 Score: 99.3%
- Classes: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B
π Usage
Since these models use biological feature extraction, we provide a standalone inference.py script for easy usage.
- Download all files (
.joblibandinference.py). - Use the
inference.pyscript:
from inference import predict_dna
sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)
print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
Alternatively, you can load components manually:
import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)
π Training Meta
- Generated By: DNA Governance Console (vparka)
- Framework: scikit-learn