Spaces:
Running
Running
| """Validation and Optimization Module README | |
| This directory contains the complete 9-phase optimization and validation framework | |
| for the MEDCARE-DDI AI system. | |
| FILES | |
| ===== | |
| Master Orchestration: | |
| - run_complete_workflow.py | |
| Execute all 9 phases or subset | |
| Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4] | |
| Phase 1: Dataset Audit | |
| - dataset_audit.py | |
| Detect duplicates, conflicts, class imbalance, data quality issues | |
| Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv | |
| Phase 2: Embedding Benchmarks | |
| - embedding_benchmark.py | |
| Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa | |
| Output: embedding_benchmark_results.csv, embedding_ablation_report.md | |
| Phase 3: Hyperparameter Optimization | |
| - optuna_hyperparameter_tune.py | |
| Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma | |
| Usage: python optuna_hyperparameter_tune.py --n-trials 50 | |
| Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md | |
| Phase 4: Ensemble Ablation | |
| - ensemble_ablation_study.py | |
| Compare voting, blending, stacking strategies | |
| Output: ensemble_benchmark.csv, ensemble_ablation.md | |
| Phase 5: Healthcare Safety Tuning | |
| - healthcare_safety_tuning.py | |
| Analyze false negatives, optimize severe thresholds | |
| Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json | |
| Phase 6: Explainability Validation | |
| - explainability_validation.py | |
| Feature importance, SHAP, example explanations | |
| Output: explainability_examples.md, feature_importance.csv | |
| Phase 7: Comprehensive Benchmarks | |
| - comprehensive_benchmark.py | |
| Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency) | |
| Output: final_benchmark_report.md, benchmark_metrics.json | |
| Phase 8: Production Validation | |
| - production_validation.py | |
| FastAPI compatibility, CPU/GPU inference, latency, readiness | |
| Output: production_validation_report.md, production_readiness_report.md, final_model_card.md | |
| Existing Tests: | |
| - test_multimodal_components.py | |
| Unit tests for calibration, ensemble, embeddings, molecular features | |
| Usage: python -m unittest test_multimodal_components -v | |
| QUICK START | |
| =========== | |
| Run all 9 phases: | |
| ```bash | |
| cd MEDCARE-DDI-AI | |
| python src/validation/run_complete_workflow.py | |
| ``` | |
| Run specific phases: | |
| ```bash | |
| python src/validation/run_complete_workflow.py --phases 1 3 7 | |
| ``` | |
| Run individual phase: | |
| ```bash | |
| python src/validation/dataset_audit.py | |
| python src/validation/embedding_benchmark.py | |
| # ... etc | |
| ``` | |
| OUTPUT REPORTS | |
| ============== | |
| All reports saved to: MEDCARE-DDI-AI/models/reports/ | |
| Key Reports (in reading order): | |
| 1. dataset_audit_report.json | |
| β Data quality overview | |
| 2. embedding_benchmark_results.csv | |
| β Best embedding model to use | |
| 3. optuna_best_params.json | |
| β Optimal hyperparameters (use for training) | |
| 4. ensemble_benchmark.csv | |
| β Best ensemble strategy (voting/blending/stacking) | |
| 5. safety_analysis_report.md | |
| β False negatives and recommended thresholds | |
| 6. final_benchmark_report.md | |
| β Main performance metrics (accuracy, F1, severe recall, AUROC, ECE) | |
| 7. production_readiness_report.md | |
| β Deployment checklist and instructions | |
| 8. final_model_card.md | |
| β Production model specification | |
| TARGET METRICS | |
| ============== | |
| Success criteria (all must be met): | |
| β Accuracy β₯ 85% | |
| β Severe Recall β₯ 90% (CRITICAL: minimize false negatives) | |
| β AUROC β₯ 0.90 | |
| β ECE < 0.05 (CRITICAL: trustworthy confidence) | |
| β p99 Latency < 200ms | |
| WORKFLOW DIAGRAM | |
| ================ | |
| Data | |
| β | |
| Phase 1: Dataset Audit | |
| βββ data quality issues detected? | |
| βββ conflicts < 5%? | |
| β | |
| Phase 2: Embedding Benchmarks | |
| βββ Compare 4 embedding models | |
| βββ Choose best (by severe recall) | |
| βββ Use for Phase 3+ | |
| β | |
| Phase 3: Hyperparameter Optimization | |
| βββ Optuna: 50 trials with healthcare objective | |
| βββ Find: best LR, dropout, hidden_dim, batch_size | |
| βββ Use best params for Phase 4+ | |
| β | |
| Phase 4: Ensemble Ablation | |
| βββ Compare: voting, blending, stacking | |
| βββ Choose best ensemble strategy | |
| βββ Use for Phase 7+ | |
| β | |
| Phase 5: Healthcare Safety Tuning | |
| βββ Analyze false negatives | |
| βββ Optimize severe escalation threshold | |
| βββ Verify FN rate < 5% | |
| β | |
| Phase 6: Explainability Validation | |
| βββ Feature importance ranking | |
| βββ Example predictions + rationales | |
| βββ Verify model behavior is intuitive | |
| β | |
| Phase 7: Comprehensive Benchmarks | |
| βββ Final metrics on all dimensions | |
| βββ Confusion matrices | |
| βββ Check: all targets met? | |
| β | |
| Phase 8: Production Validation | |
| βββ FastAPI compatibility | |
| βββ Latency benchmarks (p99 < 200ms?) | |
| βββ Deployment readiness | |
| β | |
| Phase 9: Model Selection | |
| βββ Choose production model + deployment instructions | |
| TROUBLESHOOTING | |
| =============== | |
| If Severe Recall < 90%: | |
| 1. Check safety_analysis_report.md for FN breakdown | |
| 2. Re-run Phase 5 with lower threshold (0.30-0.35) | |
| 3. Re-run Phase 3 with higher focal_gamma | |
| 4. Increase data collection (Phase 1 may show missing classes) | |
| If ECE > 0.05: | |
| 1. Use calibrated voting (Phase 4) | |
| 2. Temperature scale calibration (Phase 5) | |
| 3. Use validation-based temperature fitting | |
| If p99 Latency > 200ms: | |
| 1. Reduce hidden_dim in Phase 3 | |
| 2. Fewer ensemble models in Phase 4 | |
| 3. Deploy on GPU instead of CPU | |
| REPRODUCIBILITY | |
| ================ | |
| All scripts use fixed seed (2026) for reproducibility: | |
| - Train/test splits deterministic | |
| - Model initialization deterministic | |
| - Optuna trials deterministic | |
| - Results exactly reproducible on same hardware | |
| For exact reproduction: | |
| 1. Use Python 3.12+ | |
| 2. Use exact package versions from requirements.txt | |
| 3. Run on same hardware type (CPU vs GPU type matters) | |
| 4. Don't interrupt workflow | |
| MONITORING (POST-DEPLOYMENT) | |
| ============================= | |
| After deploying production model, monitor: | |
| Health Check: | |
| curl http://localhost:8000/health | |
| β Should return: "status": "healthy" | |
| Latency: | |
| β Track p50, p99 over time (alert if > 200ms) | |
| Severe Recall: | |
| β If ground truth available, track %FN on severe class | |
| Calibration: | |
| β Monitor drift: is confidence still matching accuracy? | |
| Example Request: | |
| curl -X POST http://localhost:8000/predict \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"drug_a": "aspirin", "drug_b": "warfarin"}' | |
| DOCUMENTATION | |
| ============== | |
| For complete details, see: | |
| - ../OPTIMIZATION_FRAMEWORK.py | |
| β Comprehensive framework documentation (read this first!) | |
| - ../README.md | |
| β Quick start and deployment guide | |
| - ../MEDCARE-DDI-AI/src/inference/app_production.py | |
| β FastAPI backend specification | |
| QUESTIONS | |
| ========= | |
| Q: Which phases are required? | |
| A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation. | |
| Q: How long does full workflow take? | |
| A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials). | |
| Q: Can I run phases out of order? | |
| A: Some dependencies: Phase 1 informs Phase 3+ validation sets. | |
| Phase 2 results used in Phase 3. Generally: run 1-9 in order. | |
| Q: What if Phase X fails? | |
| A: Check error message in COMPLETE_WORKFLOW_REPORT.md | |
| Most failures are due to missing dependencies or data files. | |
| See requirements.txt and ensure DDInter data is in data/processed/ | |
| Q: How do I select which model to deploy? | |
| A: Use final_model_card.md which ranks models by: | |
| severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%) | |
| Q: Can I customize the workflow? | |
| A: Yes! Each phase is modular and can be run independently with custom arguments. | |
| See --help for each script: | |
| python src/validation/optuna_hyperparameter_tune.py --help | |
| """ | |
| from pathlib import Path | |
| if __name__ == '__main__': | |
| output_path = Path(__file__).parent / 'README.md' | |
| with output_path.open('w') as f: | |
| f.write(__doc__) | |
| print(f'Saved validation module README to {output_path}') | |