"""Validation and Optimization Module README This directory contains the complete 9-phase optimization and validation framework for the MEDCARE-DDI AI system. FILES ===== Master Orchestration: - run_complete_workflow.py Execute all 9 phases or subset Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4] Phase 1: Dataset Audit - dataset_audit.py Detect duplicates, conflicts, class imbalance, data quality issues Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv Phase 2: Embedding Benchmarks - embedding_benchmark.py Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa Output: embedding_benchmark_results.csv, embedding_ablation_report.md Phase 3: Hyperparameter Optimization - optuna_hyperparameter_tune.py Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma Usage: python optuna_hyperparameter_tune.py --n-trials 50 Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md Phase 4: Ensemble Ablation - ensemble_ablation_study.py Compare voting, blending, stacking strategies Output: ensemble_benchmark.csv, ensemble_ablation.md Phase 5: Healthcare Safety Tuning - healthcare_safety_tuning.py Analyze false negatives, optimize severe thresholds Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json Phase 6: Explainability Validation - explainability_validation.py Feature importance, SHAP, example explanations Output: explainability_examples.md, feature_importance.csv Phase 7: Comprehensive Benchmarks - comprehensive_benchmark.py Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency) Output: final_benchmark_report.md, benchmark_metrics.json Phase 8: Production Validation - production_validation.py FastAPI compatibility, CPU/GPU inference, latency, readiness Output: production_validation_report.md, production_readiness_report.md, final_model_card.md Existing Tests: - test_multimodal_components.py Unit tests for calibration, ensemble, embeddings, molecular features Usage: python -m unittest test_multimodal_components -v QUICK START =========== Run all 9 phases: ```bash cd MEDCARE-DDI-AI python src/validation/run_complete_workflow.py ``` Run specific phases: ```bash python src/validation/run_complete_workflow.py --phases 1 3 7 ``` Run individual phase: ```bash python src/validation/dataset_audit.py python src/validation/embedding_benchmark.py # ... etc ``` OUTPUT REPORTS ============== All reports saved to: MEDCARE-DDI-AI/models/reports/ Key Reports (in reading order): 1. dataset_audit_report.json → Data quality overview 2. embedding_benchmark_results.csv → Best embedding model to use 3. optuna_best_params.json → Optimal hyperparameters (use for training) 4. ensemble_benchmark.csv → Best ensemble strategy (voting/blending/stacking) 5. safety_analysis_report.md → False negatives and recommended thresholds 6. final_benchmark_report.md → Main performance metrics (accuracy, F1, severe recall, AUROC, ECE) 7. production_readiness_report.md → Deployment checklist and instructions 8. final_model_card.md → Production model specification TARGET METRICS ============== Success criteria (all must be met): ✓ Accuracy ≥ 85% ✓ Severe Recall ≥ 90% (CRITICAL: minimize false negatives) ✓ AUROC ≥ 0.90 ✓ ECE < 0.05 (CRITICAL: trustworthy confidence) ✓ p99 Latency < 200ms WORKFLOW DIAGRAM ================ Data ↓ Phase 1: Dataset Audit ├─→ data quality issues detected? └─→ conflicts < 5%? ↓ Phase 2: Embedding Benchmarks ├─→ Compare 4 embedding models ├─→ Choose best (by severe recall) └─→ Use for Phase 3+ ↓ Phase 3: Hyperparameter Optimization ├─→ Optuna: 50 trials with healthcare objective ├─→ Find: best LR, dropout, hidden_dim, batch_size └─→ Use best params for Phase 4+ ↓ Phase 4: Ensemble Ablation ├─→ Compare: voting, blending, stacking ├─→ Choose best ensemble strategy └─→ Use for Phase 7+ ↓ Phase 5: Healthcare Safety Tuning ├─→ Analyze false negatives ├─→ Optimize severe escalation threshold └─→ Verify FN rate < 5% ↓ Phase 6: Explainability Validation ├─→ Feature importance ranking ├─→ Example predictions + rationales └─→ Verify model behavior is intuitive ↓ Phase 7: Comprehensive Benchmarks ├─→ Final metrics on all dimensions ├─→ Confusion matrices └─→ Check: all targets met? ↓ Phase 8: Production Validation ├─→ FastAPI compatibility ├─→ Latency benchmarks (p99 < 200ms?) └─→ Deployment readiness ↓ Phase 9: Model Selection └─→ Choose production model + deployment instructions TROUBLESHOOTING =============== If Severe Recall < 90%: 1. Check safety_analysis_report.md for FN breakdown 2. Re-run Phase 5 with lower threshold (0.30-0.35) 3. Re-run Phase 3 with higher focal_gamma 4. Increase data collection (Phase 1 may show missing classes) If ECE > 0.05: 1. Use calibrated voting (Phase 4) 2. Temperature scale calibration (Phase 5) 3. Use validation-based temperature fitting If p99 Latency > 200ms: 1. Reduce hidden_dim in Phase 3 2. Fewer ensemble models in Phase 4 3. Deploy on GPU instead of CPU REPRODUCIBILITY ================ All scripts use fixed seed (2026) for reproducibility: - Train/test splits deterministic - Model initialization deterministic - Optuna trials deterministic - Results exactly reproducible on same hardware For exact reproduction: 1. Use Python 3.12+ 2. Use exact package versions from requirements.txt 3. Run on same hardware type (CPU vs GPU type matters) 4. Don't interrupt workflow MONITORING (POST-DEPLOYMENT) ============================= After deploying production model, monitor: Health Check: curl http://localhost:8000/health → Should return: "status": "healthy" Latency: → Track p50, p99 over time (alert if > 200ms) Severe Recall: → If ground truth available, track %FN on severe class Calibration: → Monitor drift: is confidence still matching accuracy? Example Request: curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"drug_a": "aspirin", "drug_b": "warfarin"}' DOCUMENTATION ============== For complete details, see: - ../OPTIMIZATION_FRAMEWORK.py → Comprehensive framework documentation (read this first!) - ../README.md → Quick start and deployment guide - ../MEDCARE-DDI-AI/src/inference/app_production.py → FastAPI backend specification QUESTIONS ========= Q: Which phases are required? A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation. Q: How long does full workflow take? A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials). Q: Can I run phases out of order? A: Some dependencies: Phase 1 informs Phase 3+ validation sets. Phase 2 results used in Phase 3. Generally: run 1-9 in order. Q: What if Phase X fails? A: Check error message in COMPLETE_WORKFLOW_REPORT.md Most failures are due to missing dependencies or data files. See requirements.txt and ensure DDInter data is in data/processed/ Q: How do I select which model to deploy? A: Use final_model_card.md which ranks models by: severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%) Q: Can I customize the workflow? A: Yes! Each phase is modular and can be run independently with custom arguments. See --help for each script: python src/validation/optuna_hyperparameter_tune.py --help """ from pathlib import Path if __name__ == '__main__': output_path = Path(__file__).parent / 'README.md' with output_path.open('w') as f: f.write(__doc__) print(f'Saved validation module README to {output_path}')