VParka's picture
Upload README.md via DNA Console (Portable Version)
474e3ec verified
metadata
tags:
  - biology
  - genomics
  - classification
  - sklearn
  - virus-detection
  - influenza
  - norovirus
library_name: sklearn
datasets:
  - ncbi-influenza
  - ncbi-norovirus

BioGuard DNA Classifier Ensemble (Portable v1.5)

This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for portability and zero-dependency loading.

Version 1.5 Update: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.

πŸ“Š Training Dataset

Metric Value
Total Sequences 176,407
Unique Virus Types 40+
Data Source NCBI Influenza & Norovirus Database
Last Updated 2026-01-08

Top Virus Classes

Virus Type Samples
Influenza A H1N1 11,251
Influenza A H3N2 9,326
Influenza A H5N1 6,675
Norovirus GII 5,488
Influenza A H9N2 1,929
Influenza A H1N2 1,349

πŸ”„ Changelog

v1.5 (2026-01-08)

  • TRAINED: 176,407 sequences (5x increase from v1.4)
  • EXPANDED: 40+ virus subtypes now supported
  • IMPROVED: Better generalization with larger dataset

v1.4 (2026-01-08)

  • FIXED: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
  • ADDED: Class weighting (compute_class_weight('balanced')) to handle severe class imbalance
  • ADDED: Confidence thresholds in inference.py:
    • General classes: 55% minimum confidence
    • Rare classes (Influenza B, etc.): 65% minimum confidence
    • Below threshold β†’ classified as "Unknown"

v1.3

  • Refactored to remove custom class dependencies from .joblib files

🧬 Models Included

This repository hosts two distinct models specialized for different aspects of genomic analysis:

1. GenetiForest (RandomForestClassifier)

  • File: dna_classifier.joblib
  • Purpose: General-purpose synthetic vs. biological sequence classification.
  • Architecture: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
  • Performance (Test Set):
    • Accuracy: 89.1%
    • F1 Score: 88.4%

2. ViralBoost (GradientBoostingClassifier)

  • File: sequence_model.joblib
  • Purpose: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
  • Architecture: Gradient Boosting (sklearn) with class-weighted training.
  • Performance (Test Set):
    • Accuracy: 96.7%
    • F1 Score: 97.0%
  • Classes: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B

πŸš€ Usage

Since these models use biological feature extraction, we provide a standalone inference.py script for easy usage.

  1. Download all files (.joblib and inference.py).
  2. Use the inference.py script:
from inference import predict_dna

sequence = "ATGCTAGCTAGCTAG..."
results = predict_dna(sequence)

print(f"Genetic Type: {results['classification']}")
print(f"Virus Identity: {results['virus_identity']}")
print(f"Confidence: {results['virus_confidence']*100:.1f}%")

# Check if prediction was below threshold
if results['virus_identity'] == 'Unknown':
    print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")

Alternatively, you can load components manually:

import joblib
classifier = joblib.load("dna_classifier.joblib")
scaler = joblib.load("scaler_rf.joblib")
# (Refer to inference.py for Feature Extraction logic)

πŸ“Š Training Meta

  • Generated By: DNA Governance Console (vparka)
  • Framework: scikit-learn
  • Class Weighting: Balanced (inverse frequency)
  • Train/Test Split: 80/20