Upload README.md via DNA Console (Portable Version)

f158e96 verified 4 months ago

3.11 kB

tags:
  - biology
  - genomics
  - classification
  - sklearn
library_name: sklearn

BioGuard DNA Classifier Ensemble (Portable v1.4)

This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for portability and zero-dependency loading.

Version 1.4 Update: Fixed critical overfitting issue in ViralBoost model. Added class weighting for imbalanced training data and confidence thresholds for rare class predictions.

🔄 Changelog

v1.4 (2026-01-08)

FIXED: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
ADDED: Class weighting (compute_class_weight('balanced')) to handle severe class imbalance
ADDED: Confidence thresholds in inference.py:
- General classes: 55% minimum confidence
- Rare classes (Influenza B, etc.): 65% minimum confidence
- Below threshold → classified as "Unknown"
IMPROVED: Reduced overfitting with max_depth=3, min_samples_leaf=10

v1.3

Refactored to remove custom class dependencies from .joblib files

🧬 Models Included

This repository hosts two distinct models specialized for different aspects of genomic analysis:

1. GenetiForest (RandomForestClassifier)

File: dna_classifier.joblib
Purpose: General-purpose synthetic vs. biological sequence classification.
Architecture: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
Performance (Test Set):
- Accuracy: 89.4%
- F1 Score: 89.9%

2. ViralBoost (GradientBoostingClassifier)

File: sequence_model.joblib
Purpose: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
Architecture: Gradient Boosting (sklearn) with class-weighted training.
Performance (Test Set):
- Accuracy: 98.1%
- F1 Score: 98.2%
Classes: Other, Influenza A, Norovirus, Influenza B

🚀 Usage

Since these models use biological feature extraction, we provide a standalone inference.py script for easy usage.

Download all files (.joblib and inference.py).
Use the inference.py script:

from inference import predict_dna

sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)

print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
print(f"Confidence: {results['virus_confidence']*100:.1f}%")

# New in v1.4: Check if prediction was below threshold
if results['virus_identity'] == 'Unknown':
    print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")

Alternatively, you can load components manually:

import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)

📊 Training Meta

Generated By: DNA Governance Console (vparka)
Framework: scikit-learn
Class Weighting: Balanced (inverse frequency)