Spaces:
Running
Running
File size: 8,056 Bytes
d29b763 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 | """Validation and Optimization Module README
This directory contains the complete 9-phase optimization and validation framework
for the MEDCARE-DDI AI system.
FILES
=====
Master Orchestration:
- run_complete_workflow.py
Execute all 9 phases or subset
Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]
Phase 1: Dataset Audit
- dataset_audit.py
Detect duplicates, conflicts, class imbalance, data quality issues
Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv
Phase 2: Embedding Benchmarks
- embedding_benchmark.py
Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa
Output: embedding_benchmark_results.csv, embedding_ablation_report.md
Phase 3: Hyperparameter Optimization
- optuna_hyperparameter_tune.py
Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma
Usage: python optuna_hyperparameter_tune.py --n-trials 50
Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md
Phase 4: Ensemble Ablation
- ensemble_ablation_study.py
Compare voting, blending, stacking strategies
Output: ensemble_benchmark.csv, ensemble_ablation.md
Phase 5: Healthcare Safety Tuning
- healthcare_safety_tuning.py
Analyze false negatives, optimize severe thresholds
Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json
Phase 6: Explainability Validation
- explainability_validation.py
Feature importance, SHAP, example explanations
Output: explainability_examples.md, feature_importance.csv
Phase 7: Comprehensive Benchmarks
- comprehensive_benchmark.py
Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency)
Output: final_benchmark_report.md, benchmark_metrics.json
Phase 8: Production Validation
- production_validation.py
FastAPI compatibility, CPU/GPU inference, latency, readiness
Output: production_validation_report.md, production_readiness_report.md, final_model_card.md
Existing Tests:
- test_multimodal_components.py
Unit tests for calibration, ensemble, embeddings, molecular features
Usage: python -m unittest test_multimodal_components -v
QUICK START
===========
Run all 9 phases:
```bash
cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py
```
Run specific phases:
```bash
python src/validation/run_complete_workflow.py --phases 1 3 7
```
Run individual phase:
```bash
python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc
```
OUTPUT REPORTS
==============
All reports saved to: MEDCARE-DDI-AI/models/reports/
Key Reports (in reading order):
1. dataset_audit_report.json
β Data quality overview
2. embedding_benchmark_results.csv
β Best embedding model to use
3. optuna_best_params.json
β Optimal hyperparameters (use for training)
4. ensemble_benchmark.csv
β Best ensemble strategy (voting/blending/stacking)
5. safety_analysis_report.md
β False negatives and recommended thresholds
6. final_benchmark_report.md
β Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)
7. production_readiness_report.md
β Deployment checklist and instructions
8. final_model_card.md
β Production model specification
TARGET METRICS
==============
Success criteria (all must be met):
β Accuracy β₯ 85%
β Severe Recall β₯ 90% (CRITICAL: minimize false negatives)
β AUROC β₯ 0.90
β ECE < 0.05 (CRITICAL: trustworthy confidence)
β p99 Latency < 200ms
WORKFLOW DIAGRAM
================
Data
β
Phase 1: Dataset Audit
βββ data quality issues detected?
βββ conflicts < 5%?
β
Phase 2: Embedding Benchmarks
βββ Compare 4 embedding models
βββ Choose best (by severe recall)
βββ Use for Phase 3+
β
Phase 3: Hyperparameter Optimization
βββ Optuna: 50 trials with healthcare objective
βββ Find: best LR, dropout, hidden_dim, batch_size
βββ Use best params for Phase 4+
β
Phase 4: Ensemble Ablation
βββ Compare: voting, blending, stacking
βββ Choose best ensemble strategy
βββ Use for Phase 7+
β
Phase 5: Healthcare Safety Tuning
βββ Analyze false negatives
βββ Optimize severe escalation threshold
βββ Verify FN rate < 5%
β
Phase 6: Explainability Validation
βββ Feature importance ranking
βββ Example predictions + rationales
βββ Verify model behavior is intuitive
β
Phase 7: Comprehensive Benchmarks
βββ Final metrics on all dimensions
βββ Confusion matrices
βββ Check: all targets met?
β
Phase 8: Production Validation
βββ FastAPI compatibility
βββ Latency benchmarks (p99 < 200ms?)
βββ Deployment readiness
β
Phase 9: Model Selection
βββ Choose production model + deployment instructions
TROUBLESHOOTING
===============
If Severe Recall < 90%:
1. Check safety_analysis_report.md for FN breakdown
2. Re-run Phase 5 with lower threshold (0.30-0.35)
3. Re-run Phase 3 with higher focal_gamma
4. Increase data collection (Phase 1 may show missing classes)
If ECE > 0.05:
1. Use calibrated voting (Phase 4)
2. Temperature scale calibration (Phase 5)
3. Use validation-based temperature fitting
If p99 Latency > 200ms:
1. Reduce hidden_dim in Phase 3
2. Fewer ensemble models in Phase 4
3. Deploy on GPU instead of CPU
REPRODUCIBILITY
================
All scripts use fixed seed (2026) for reproducibility:
- Train/test splits deterministic
- Model initialization deterministic
- Optuna trials deterministic
- Results exactly reproducible on same hardware
For exact reproduction:
1. Use Python 3.12+
2. Use exact package versions from requirements.txt
3. Run on same hardware type (CPU vs GPU type matters)
4. Don't interrupt workflow
MONITORING (POST-DEPLOYMENT)
=============================
After deploying production model, monitor:
Health Check:
curl http://localhost:8000/health
β Should return: "status": "healthy"
Latency:
β Track p50, p99 over time (alert if > 200ms)
Severe Recall:
β If ground truth available, track %FN on severe class
Calibration:
β Monitor drift: is confidence still matching accuracy?
Example Request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"drug_a": "aspirin", "drug_b": "warfarin"}'
DOCUMENTATION
==============
For complete details, see:
- ../OPTIMIZATION_FRAMEWORK.py
β Comprehensive framework documentation (read this first!)
- ../README.md
β Quick start and deployment guide
- ../MEDCARE-DDI-AI/src/inference/app_production.py
β FastAPI backend specification
QUESTIONS
=========
Q: Which phases are required?
A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.
Q: How long does full workflow take?
A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials).
Q: Can I run phases out of order?
A: Some dependencies: Phase 1 informs Phase 3+ validation sets.
Phase 2 results used in Phase 3. Generally: run 1-9 in order.
Q: What if Phase X fails?
A: Check error message in COMPLETE_WORKFLOW_REPORT.md
Most failures are due to missing dependencies or data files.
See requirements.txt and ensure DDInter data is in data/processed/
Q: How do I select which model to deploy?
A: Use final_model_card.md which ranks models by:
severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)
Q: Can I customize the workflow?
A: Yes! Each phase is modular and can be run independently with custom arguments.
See --help for each script:
python src/validation/optuna_hyperparameter_tune.py --help
"""
from pathlib import Path
if __name__ == '__main__':
output_path = Path(__file__).parent / 'README.md'
with output_path.open('w') as f:
f.write(__doc__)
print(f'Saved validation module README to {output_path}')
|