ddi / src /validation /README.md
github-actions[bot]
Deploy from GitHub Actions (fb28c05c54cf19184fc3f14f1bf3297ba5749ea2)
d29b763
"""Validation and Optimization Module README
This directory contains the complete 9-phase optimization and validation framework
for the MEDCARE-DDI AI system.
FILES
=====
Master Orchestration:
- run_complete_workflow.py
Execute all 9 phases or subset
Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]
Phase 1: Dataset Audit
- dataset_audit.py
Detect duplicates, conflicts, class imbalance, data quality issues
Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv
Phase 2: Embedding Benchmarks
- embedding_benchmark.py
Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa
Output: embedding_benchmark_results.csv, embedding_ablation_report.md
Phase 3: Hyperparameter Optimization
- optuna_hyperparameter_tune.py
Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma
Usage: python optuna_hyperparameter_tune.py --n-trials 50
Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md
Phase 4: Ensemble Ablation
- ensemble_ablation_study.py
Compare voting, blending, stacking strategies
Output: ensemble_benchmark.csv, ensemble_ablation.md
Phase 5: Healthcare Safety Tuning
- healthcare_safety_tuning.py
Analyze false negatives, optimize severe thresholds
Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json
Phase 6: Explainability Validation
- explainability_validation.py
Feature importance, SHAP, example explanations
Output: explainability_examples.md, feature_importance.csv
Phase 7: Comprehensive Benchmarks
- comprehensive_benchmark.py
Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency)
Output: final_benchmark_report.md, benchmark_metrics.json
Phase 8: Production Validation
- production_validation.py
FastAPI compatibility, CPU/GPU inference, latency, readiness
Output: production_validation_report.md, production_readiness_report.md, final_model_card.md
Existing Tests:
- test_multimodal_components.py
Unit tests for calibration, ensemble, embeddings, molecular features
Usage: python -m unittest test_multimodal_components -v
QUICK START
===========
Run all 9 phases:
```bash
cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py
```
Run specific phases:
```bash
python src/validation/run_complete_workflow.py --phases 1 3 7
```
Run individual phase:
```bash
python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc
```
OUTPUT REPORTS
==============
All reports saved to: MEDCARE-DDI-AI/models/reports/
Key Reports (in reading order):
1. dataset_audit_report.json
β†’ Data quality overview
2. embedding_benchmark_results.csv
β†’ Best embedding model to use
3. optuna_best_params.json
β†’ Optimal hyperparameters (use for training)
4. ensemble_benchmark.csv
β†’ Best ensemble strategy (voting/blending/stacking)
5. safety_analysis_report.md
β†’ False negatives and recommended thresholds
6. final_benchmark_report.md
β†’ Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)
7. production_readiness_report.md
β†’ Deployment checklist and instructions
8. final_model_card.md
β†’ Production model specification
TARGET METRICS
==============
Success criteria (all must be met):
βœ“ Accuracy β‰₯ 85%
βœ“ Severe Recall β‰₯ 90% (CRITICAL: minimize false negatives)
βœ“ AUROC β‰₯ 0.90
βœ“ ECE < 0.05 (CRITICAL: trustworthy confidence)
βœ“ p99 Latency < 200ms
WORKFLOW DIAGRAM
================
Data
↓
Phase 1: Dataset Audit
β”œβ”€β†’ data quality issues detected?
└─→ conflicts < 5%?
↓
Phase 2: Embedding Benchmarks
β”œβ”€β†’ Compare 4 embedding models
β”œβ”€β†’ Choose best (by severe recall)
└─→ Use for Phase 3+
↓
Phase 3: Hyperparameter Optimization
β”œβ”€β†’ Optuna: 50 trials with healthcare objective
β”œβ”€β†’ Find: best LR, dropout, hidden_dim, batch_size
└─→ Use best params for Phase 4+
↓
Phase 4: Ensemble Ablation
β”œβ”€β†’ Compare: voting, blending, stacking
β”œβ”€β†’ Choose best ensemble strategy
└─→ Use for Phase 7+
↓
Phase 5: Healthcare Safety Tuning
β”œβ”€β†’ Analyze false negatives
β”œβ”€β†’ Optimize severe escalation threshold
└─→ Verify FN rate < 5%
↓
Phase 6: Explainability Validation
β”œβ”€β†’ Feature importance ranking
β”œβ”€β†’ Example predictions + rationales
└─→ Verify model behavior is intuitive
↓
Phase 7: Comprehensive Benchmarks
β”œβ”€β†’ Final metrics on all dimensions
β”œβ”€β†’ Confusion matrices
└─→ Check: all targets met?
↓
Phase 8: Production Validation
β”œβ”€β†’ FastAPI compatibility
β”œβ”€β†’ Latency benchmarks (p99 < 200ms?)
└─→ Deployment readiness
↓
Phase 9: Model Selection
└─→ Choose production model + deployment instructions
TROUBLESHOOTING
===============
If Severe Recall < 90%:
1. Check safety_analysis_report.md for FN breakdown
2. Re-run Phase 5 with lower threshold (0.30-0.35)
3. Re-run Phase 3 with higher focal_gamma
4. Increase data collection (Phase 1 may show missing classes)
If ECE > 0.05:
1. Use calibrated voting (Phase 4)
2. Temperature scale calibration (Phase 5)
3. Use validation-based temperature fitting
If p99 Latency > 200ms:
1. Reduce hidden_dim in Phase 3
2. Fewer ensemble models in Phase 4
3. Deploy on GPU instead of CPU
REPRODUCIBILITY
================
All scripts use fixed seed (2026) for reproducibility:
- Train/test splits deterministic
- Model initialization deterministic
- Optuna trials deterministic
- Results exactly reproducible on same hardware
For exact reproduction:
1. Use Python 3.12+
2. Use exact package versions from requirements.txt
3. Run on same hardware type (CPU vs GPU type matters)
4. Don't interrupt workflow
MONITORING (POST-DEPLOYMENT)
=============================
After deploying production model, monitor:
Health Check:
curl http://localhost:8000/health
β†’ Should return: "status": "healthy"
Latency:
β†’ Track p50, p99 over time (alert if > 200ms)
Severe Recall:
β†’ If ground truth available, track %FN on severe class
Calibration:
β†’ Monitor drift: is confidence still matching accuracy?
Example Request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"drug_a": "aspirin", "drug_b": "warfarin"}'
DOCUMENTATION
==============
For complete details, see:
- ../OPTIMIZATION_FRAMEWORK.py
β†’ Comprehensive framework documentation (read this first!)
- ../README.md
β†’ Quick start and deployment guide
- ../MEDCARE-DDI-AI/src/inference/app_production.py
β†’ FastAPI backend specification
QUESTIONS
=========
Q: Which phases are required?
A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.
Q: How long does full workflow take?
A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials).
Q: Can I run phases out of order?
A: Some dependencies: Phase 1 informs Phase 3+ validation sets.
Phase 2 results used in Phase 3. Generally: run 1-9 in order.
Q: What if Phase X fails?
A: Check error message in COMPLETE_WORKFLOW_REPORT.md
Most failures are due to missing dependencies or data files.
See requirements.txt and ensure DDInter data is in data/processed/
Q: How do I select which model to deploy?
A: Use final_model_card.md which ranks models by:
severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)
Q: Can I customize the workflow?
A: Yes! Each phase is modular and can be run independently with custom arguments.
See --help for each script:
python src/validation/optuna_hyperparameter_tune.py --help
"""
from pathlib import Path
if __name__ == '__main__':
output_path = Path(__file__).parent / 'README.md'
with output_path.open('w') as f:
f.write(__doc__)
print(f'Saved validation module README to {output_path}')