File size: 8,056 Bytes
d29b763
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
"""Validation and Optimization Module README

This directory contains the complete 9-phase optimization and validation framework
for the MEDCARE-DDI AI system.

FILES
=====

Master Orchestration:
  - run_complete_workflow.py
    Execute all 9 phases or subset
    Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]

Phase 1: Dataset Audit
  - dataset_audit.py
    Detect duplicates, conflicts, class imbalance, data quality issues
    Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv

Phase 2: Embedding Benchmarks
  - embedding_benchmark.py
    Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa
    Output: embedding_benchmark_results.csv, embedding_ablation_report.md

Phase 3: Hyperparameter Optimization
  - optuna_hyperparameter_tune.py
    Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma
    Usage: python optuna_hyperparameter_tune.py --n-trials 50
    Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md

Phase 4: Ensemble Ablation
  - ensemble_ablation_study.py
    Compare voting, blending, stacking strategies
    Output: ensemble_benchmark.csv, ensemble_ablation.md

Phase 5: Healthcare Safety Tuning
  - healthcare_safety_tuning.py
    Analyze false negatives, optimize severe thresholds
    Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json

Phase 6: Explainability Validation
  - explainability_validation.py
    Feature importance, SHAP, example explanations
    Output: explainability_examples.md, feature_importance.csv

Phase 7: Comprehensive Benchmarks
  - comprehensive_benchmark.py
    Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency)
    Output: final_benchmark_report.md, benchmark_metrics.json

Phase 8: Production Validation
  - production_validation.py
    FastAPI compatibility, CPU/GPU inference, latency, readiness
    Output: production_validation_report.md, production_readiness_report.md, final_model_card.md

Existing Tests:
  - test_multimodal_components.py
    Unit tests for calibration, ensemble, embeddings, molecular features
    Usage: python -m unittest test_multimodal_components -v

QUICK START
===========

Run all 9 phases:
```bash
cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py
```

Run specific phases:
```bash
python src/validation/run_complete_workflow.py --phases 1 3 7
```

Run individual phase:
```bash
python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc
```

OUTPUT REPORTS
==============

All reports saved to: MEDCARE-DDI-AI/models/reports/

Key Reports (in reading order):
  1. dataset_audit_report.json
     β†’ Data quality overview
  
  2. embedding_benchmark_results.csv
     β†’ Best embedding model to use
  
  3. optuna_best_params.json
     β†’ Optimal hyperparameters (use for training)
  
  4. ensemble_benchmark.csv
     β†’ Best ensemble strategy (voting/blending/stacking)
  
  5. safety_analysis_report.md
     β†’ False negatives and recommended thresholds
  
  6. final_benchmark_report.md
     β†’ Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)
  
  7. production_readiness_report.md
     β†’ Deployment checklist and instructions
  
  8. final_model_card.md
     β†’ Production model specification

TARGET METRICS
==============

Success criteria (all must be met):
  βœ“ Accuracy β‰₯ 85%
  βœ“ Severe Recall β‰₯ 90% (CRITICAL: minimize false negatives)
  βœ“ AUROC β‰₯ 0.90
  βœ“ ECE < 0.05 (CRITICAL: trustworthy confidence)
  βœ“ p99 Latency < 200ms

WORKFLOW DIAGRAM
================

Data
  ↓
Phase 1: Dataset Audit
  β”œβ”€β†’ data quality issues detected?
  └─→ conflicts < 5%?
  ↓
Phase 2: Embedding Benchmarks
  β”œβ”€β†’ Compare 4 embedding models
  β”œβ”€β†’ Choose best (by severe recall)
  └─→ Use for Phase 3+
  ↓
Phase 3: Hyperparameter Optimization
  β”œβ”€β†’ Optuna: 50 trials with healthcare objective
  β”œβ”€β†’ Find: best LR, dropout, hidden_dim, batch_size
  └─→ Use best params for Phase 4+
  ↓
Phase 4: Ensemble Ablation
  β”œβ”€β†’ Compare: voting, blending, stacking
  β”œβ”€β†’ Choose best ensemble strategy
  └─→ Use for Phase 7+
  ↓
Phase 5: Healthcare Safety Tuning
  β”œβ”€β†’ Analyze false negatives
  β”œβ”€β†’ Optimize severe escalation threshold
  └─→ Verify FN rate < 5%
  ↓
Phase 6: Explainability Validation
  β”œβ”€β†’ Feature importance ranking
  β”œβ”€β†’ Example predictions + rationales
  └─→ Verify model behavior is intuitive
  ↓
Phase 7: Comprehensive Benchmarks
  β”œβ”€β†’ Final metrics on all dimensions
  β”œβ”€β†’ Confusion matrices
  └─→ Check: all targets met?
  ↓
Phase 8: Production Validation
  β”œβ”€β†’ FastAPI compatibility
  β”œβ”€β†’ Latency benchmarks (p99 < 200ms?)
  └─→ Deployment readiness
  ↓
Phase 9: Model Selection
  └─→ Choose production model + deployment instructions

TROUBLESHOOTING
===============

If Severe Recall < 90%:
  1. Check safety_analysis_report.md for FN breakdown
  2. Re-run Phase 5 with lower threshold (0.30-0.35)
  3. Re-run Phase 3 with higher focal_gamma
  4. Increase data collection (Phase 1 may show missing classes)

If ECE > 0.05:
  1. Use calibrated voting (Phase 4)
  2. Temperature scale calibration (Phase 5)
  3. Use validation-based temperature fitting

If p99 Latency > 200ms:
  1. Reduce hidden_dim in Phase 3
  2. Fewer ensemble models in Phase 4
  3. Deploy on GPU instead of CPU

REPRODUCIBILITY
================

All scripts use fixed seed (2026) for reproducibility:
  - Train/test splits deterministic
  - Model initialization deterministic
  - Optuna trials deterministic
  - Results exactly reproducible on same hardware

For exact reproduction:
  1. Use Python 3.12+
  2. Use exact package versions from requirements.txt
  3. Run on same hardware type (CPU vs GPU type matters)
  4. Don't interrupt workflow

MONITORING (POST-DEPLOYMENT)
=============================

After deploying production model, monitor:

Health Check:
  curl http://localhost:8000/health
  β†’ Should return: "status": "healthy"

Latency:
  β†’ Track p50, p99 over time (alert if > 200ms)

Severe Recall:
  β†’ If ground truth available, track %FN on severe class

Calibration:
  β†’ Monitor drift: is confidence still matching accuracy?

Example Request:
  curl -X POST http://localhost:8000/predict \
    -H "Content-Type: application/json" \
    -d '{"drug_a": "aspirin", "drug_b": "warfarin"}'

DOCUMENTATION
==============

For complete details, see:
  - ../OPTIMIZATION_FRAMEWORK.py
    β†’ Comprehensive framework documentation (read this first!)
  - ../README.md
    β†’ Quick start and deployment guide
  - ../MEDCARE-DDI-AI/src/inference/app_production.py
    β†’ FastAPI backend specification

QUESTIONS
=========

Q: Which phases are required?
A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.

Q: How long does full workflow take?
A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials).

Q: Can I run phases out of order?
A: Some dependencies: Phase 1 informs Phase 3+ validation sets. 
   Phase 2 results used in Phase 3. Generally: run 1-9 in order.

Q: What if Phase X fails?
A: Check error message in COMPLETE_WORKFLOW_REPORT.md
   Most failures are due to missing dependencies or data files.
   See requirements.txt and ensure DDInter data is in data/processed/

Q: How do I select which model to deploy?
A: Use final_model_card.md which ranks models by:
   severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)

Q: Can I customize the workflow?
A: Yes! Each phase is modular and can be run independently with custom arguments.
   See --help for each script:
   python src/validation/optuna_hyperparameter_tune.py --help
"""

from pathlib import Path

if __name__ == '__main__':
    output_path = Path(__file__).parent / 'README.md'
    with output_path.open('w') as f:
        f.write(__doc__)
    print(f'Saved validation module README to {output_path}')