Spaces:
Running
"""Validation and Optimization Module README
This directory contains the complete 9-phase optimization and validation framework for the MEDCARE-DDI AI system.
FILES
Master Orchestration:
- run_complete_workflow.py Execute all 9 phases or subset Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]
Phase 1: Dataset Audit
- dataset_audit.py Detect duplicates, conflicts, class imbalance, data quality issues Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv
Phase 2: Embedding Benchmarks
- embedding_benchmark.py Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa Output: embedding_benchmark_results.csv, embedding_ablation_report.md
Phase 3: Hyperparameter Optimization
- optuna_hyperparameter_tune.py Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma Usage: python optuna_hyperparameter_tune.py --n-trials 50 Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md
Phase 4: Ensemble Ablation
- ensemble_ablation_study.py Compare voting, blending, stacking strategies Output: ensemble_benchmark.csv, ensemble_ablation.md
Phase 5: Healthcare Safety Tuning
- healthcare_safety_tuning.py Analyze false negatives, optimize severe thresholds Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json
Phase 6: Explainability Validation
- explainability_validation.py Feature importance, SHAP, example explanations Output: explainability_examples.md, feature_importance.csv
Phase 7: Comprehensive Benchmarks
- comprehensive_benchmark.py Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency) Output: final_benchmark_report.md, benchmark_metrics.json
Phase 8: Production Validation
- production_validation.py FastAPI compatibility, CPU/GPU inference, latency, readiness Output: production_validation_report.md, production_readiness_report.md, final_model_card.md
Existing Tests:
- test_multimodal_components.py Unit tests for calibration, ensemble, embeddings, molecular features Usage: python -m unittest test_multimodal_components -v
QUICK START
Run all 9 phases:
cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py
Run specific phases:
python src/validation/run_complete_workflow.py --phases 1 3 7
Run individual phase:
python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc
OUTPUT REPORTS
All reports saved to: MEDCARE-DDI-AI/models/reports/
Key Reports (in reading order):
dataset_audit_report.json β Data quality overview
embedding_benchmark_results.csv β Best embedding model to use
optuna_best_params.json β Optimal hyperparameters (use for training)
ensemble_benchmark.csv β Best ensemble strategy (voting/blending/stacking)
safety_analysis_report.md β False negatives and recommended thresholds
final_benchmark_report.md β Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)
production_readiness_report.md β Deployment checklist and instructions
final_model_card.md β Production model specification
TARGET METRICS
Success criteria (all must be met): β Accuracy β₯ 85% β Severe Recall β₯ 90% (CRITICAL: minimize false negatives) β AUROC β₯ 0.90 β ECE < 0.05 (CRITICAL: trustworthy confidence) β p99 Latency < 200ms
WORKFLOW DIAGRAM
Data β Phase 1: Dataset Audit βββ data quality issues detected? βββ conflicts < 5%? β Phase 2: Embedding Benchmarks βββ Compare 4 embedding models βββ Choose best (by severe recall) βββ Use for Phase 3+ β Phase 3: Hyperparameter Optimization βββ Optuna: 50 trials with healthcare objective βββ Find: best LR, dropout, hidden_dim, batch_size βββ Use best params for Phase 4+ β Phase 4: Ensemble Ablation βββ Compare: voting, blending, stacking βββ Choose best ensemble strategy βββ Use for Phase 7+ β Phase 5: Healthcare Safety Tuning βββ Analyze false negatives βββ Optimize severe escalation threshold βββ Verify FN rate < 5% β Phase 6: Explainability Validation βββ Feature importance ranking βββ Example predictions + rationales βββ Verify model behavior is intuitive β Phase 7: Comprehensive Benchmarks βββ Final metrics on all dimensions βββ Confusion matrices βββ Check: all targets met? β Phase 8: Production Validation βββ FastAPI compatibility βββ Latency benchmarks (p99 < 200ms?) βββ Deployment readiness β Phase 9: Model Selection βββ Choose production model + deployment instructions
TROUBLESHOOTING
If Severe Recall < 90%:
- Check safety_analysis_report.md for FN breakdown
- Re-run Phase 5 with lower threshold (0.30-0.35)
- Re-run Phase 3 with higher focal_gamma
- Increase data collection (Phase 1 may show missing classes)
If ECE > 0.05:
- Use calibrated voting (Phase 4)
- Temperature scale calibration (Phase 5)
- Use validation-based temperature fitting
If p99 Latency > 200ms:
- Reduce hidden_dim in Phase 3
- Fewer ensemble models in Phase 4
- Deploy on GPU instead of CPU
REPRODUCIBILITY
All scripts use fixed seed (2026) for reproducibility:
- Train/test splits deterministic
- Model initialization deterministic
- Optuna trials deterministic
- Results exactly reproducible on same hardware
For exact reproduction:
- Use Python 3.12+
- Use exact package versions from requirements.txt
- Run on same hardware type (CPU vs GPU type matters)
- Don't interrupt workflow
MONITORING (POST-DEPLOYMENT)
After deploying production model, monitor:
Health Check: curl http://localhost:8000/health β Should return: "status": "healthy"
Latency: β Track p50, p99 over time (alert if > 200ms)
Severe Recall: β If ground truth available, track %FN on severe class
Calibration: β Monitor drift: is confidence still matching accuracy?
Example Request:
curl -X POST http://localhost:8000/predict
-H "Content-Type: application/json"
-d '{"drug_a": "aspirin", "drug_b": "warfarin"}'
DOCUMENTATION
For complete details, see:
- ../OPTIMIZATION_FRAMEWORK.py β Comprehensive framework documentation (read this first!)
- ../README.md β Quick start and deployment guide
- ../MEDCARE-DDI-AI/src/inference/app_production.py β FastAPI backend specification
QUESTIONS
Q: Which phases are required? A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.
Q: How long does full workflow take?
A: 2-3 hours on modern CPU. Phase 3 (Optuna) dominates (1 hour for 50 trials).
Q: Can I run phases out of order? A: Some dependencies: Phase 1 informs Phase 3+ validation sets. Phase 2 results used in Phase 3. Generally: run 1-9 in order.
Q: What if Phase X fails? A: Check error message in COMPLETE_WORKFLOW_REPORT.md Most failures are due to missing dependencies or data files. See requirements.txt and ensure DDInter data is in data/processed/
Q: How do I select which model to deploy? A: Use final_model_card.md which ranks models by: severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)
Q: Can I customize the workflow? A: Yes! Each phase is modular and can be run independently with custom arguments. See --help for each script: python src/validation/optuna_hyperparameter_tune.py --help """
from pathlib import Path
if name == 'main': output_path = Path(file).parent / 'README.md' with output_path.open('w') as f: f.write(doc) print(f'Saved validation module README to {output_path}')