BioGuard-DNA-Classifier-Ensemble / README.md

Upload README.md via DNA Console (Portable Version)

474e3ec verified 15 days ago

3.75 kB

	---
	tags:
	- biology
	- genomics
	- classification
	- sklearn
	- virus-detection
	- influenza
	- norovirus
	library_name: sklearn
	datasets:
	- ncbi-influenza
	- ncbi-norovirus
	---

	# BioGuard DNA Classifier Ensemble (Portable v1.5)

	This repository contains a dual-model ensemble for DNA sequence analysis and virus classification, optimized for portability and zero-dependency loading.

	> [!NOTE]
	> Version 1.5 Update: Trained on 176,000+ sequences from NCBI database. Covers 40+ Influenza A subtypes (H1N1, H3N2, H5N1, etc.) and Norovirus GII variants.

	## 📊 Training Dataset

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total Sequences \| 176,407 \|
	\| Unique Virus Types \| 40+ \|
	\| Data Source \| NCBI Influenza & Norovirus Database \|
	\| Last Updated \| 2026-01-08 \|

	### Top Virus Classes

	\| Virus Type \| Samples \|
	\|------------\|---------\|
	\| Influenza A H1N1 \| 11,251 \|
	\| Influenza A H3N2 \| 9,326 \|
	\| Influenza A H5N1 \| 6,675 \|
	\| Norovirus GII \| 5,488 \|
	\| Influenza A H9N2 \| 1,929 \|
	\| Influenza A H1N2 \| 1,349 \|

	## 🔄 Changelog

	### v1.5 (2026-01-08)

	- TRAINED: 176,407 sequences (5x increase from v1.4)
	- EXPANDED: 40+ virus subtypes now supported
	- IMPROVED: Better generalization with larger dataset

	### v1.4 (2026-01-08)

	- FIXED: ViralBoost model no longer over-classifies random sequences as "Influenza B (100%)"
	- ADDED: Class weighting (`compute_class_weight('balanced')`) to handle severe class imbalance
	- ADDED: Confidence thresholds in `inference.py`:
	- General classes: 55% minimum confidence
	- Rare classes (Influenza B, etc.): 65% minimum confidence
	- Below threshold → classified as "Unknown"

	### v1.3

	- Refactored to remove custom class dependencies from `.joblib` files

	## 🧬 Models Included

	This repository hosts two distinct models specialized for different aspects of genomic analysis:

	### 1. GenetiForest (RandomForestClassifier)

	- File: `dna_classifier.joblib`
	- Purpose: General-purpose synthetic vs. biological sequence classification.
	- Architecture: Random Forest (sklearn) with biological feature extraction (k-mers, GC content, etc.).
	- Performance (Test Set):
	- Accuracy: 89.1%
	- F1 Score: 88.4%

	### 2. ViralBoost (GradientBoostingClassifier)

	- File: `sequence_model.joblib`
	- Purpose: Specific virus type identification (Influenza A, Norovirus, etc.) based on sequence signatures.
	- Architecture: Gradient Boosting (sklearn) with class-weighted training.
	- Performance (Test Set):
	- Accuracy: 96.7%
	- F1 Score: 97.0%
	- Classes: Other, Influenza A, Chicken anemia virus, Norovirus, Influenza B

	## 🚀 Usage

	Since these models use biological feature extraction, we provide a standalone `inference.py` script for easy usage.

	1. Download all files (`.joblib` and `inference.py`).
	2. Use the `inference.py` script:

	```python
	from inference import predict_dna

	sequence = "ATGCTAGCTAGCTAG..."
	results = predict_dna(sequence)

	print(f"Genetic Type: {results['classification']}")
	print(f"Virus Identity: {results['virus_identity']}")
	print(f"Confidence: {results['virus_confidence']*100:.1f}%")

	# Check if prediction was below threshold
	if results['virus_identity'] == 'Unknown':
	print(f"Raw prediction was: {results['raw_prediction']} ({results['raw_confidence']*100:.1f}%)")
	```

	Alternatively, you can load components manually:

	```python
	import joblib
	classifier = joblib.load("dna_classifier.joblib")
	scaler = joblib.load("scaler_rf.joblib")
	# (Refer to inference.py for Feature Extraction logic)
	```

	## 📊 Training Meta

	- Generated By: DNA Governance Console (vparka)
	- Framework: scikit-learn
	- Class Weighting: Balanced (inverse frequency)
	- Train/Test Split: 80/20