ddi / src /validation /README.md
github-actions[bot]
Deploy from GitHub Actions (fb28c05c54cf19184fc3f14f1bf3297ba5749ea2)
d29b763

"""Validation and Optimization Module README

This directory contains the complete 9-phase optimization and validation framework for the MEDCARE-DDI AI system.

FILES

Master Orchestration:

  • run_complete_workflow.py Execute all 9 phases or subset Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]

Phase 1: Dataset Audit

  • dataset_audit.py Detect duplicates, conflicts, class imbalance, data quality issues Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv

Phase 2: Embedding Benchmarks

  • embedding_benchmark.py Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa Output: embedding_benchmark_results.csv, embedding_ablation_report.md

Phase 3: Hyperparameter Optimization

  • optuna_hyperparameter_tune.py Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma Usage: python optuna_hyperparameter_tune.py --n-trials 50 Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md

Phase 4: Ensemble Ablation

  • ensemble_ablation_study.py Compare voting, blending, stacking strategies Output: ensemble_benchmark.csv, ensemble_ablation.md

Phase 5: Healthcare Safety Tuning

  • healthcare_safety_tuning.py Analyze false negatives, optimize severe thresholds Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json

Phase 6: Explainability Validation

  • explainability_validation.py Feature importance, SHAP, example explanations Output: explainability_examples.md, feature_importance.csv

Phase 7: Comprehensive Benchmarks

  • comprehensive_benchmark.py Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency) Output: final_benchmark_report.md, benchmark_metrics.json

Phase 8: Production Validation

  • production_validation.py FastAPI compatibility, CPU/GPU inference, latency, readiness Output: production_validation_report.md, production_readiness_report.md, final_model_card.md

Existing Tests:

  • test_multimodal_components.py Unit tests for calibration, ensemble, embeddings, molecular features Usage: python -m unittest test_multimodal_components -v

QUICK START

Run all 9 phases:

cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py

Run specific phases:

python src/validation/run_complete_workflow.py --phases 1 3 7

Run individual phase:

python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc

OUTPUT REPORTS

All reports saved to: MEDCARE-DDI-AI/models/reports/

Key Reports (in reading order):

  1. dataset_audit_report.json β†’ Data quality overview

  2. embedding_benchmark_results.csv β†’ Best embedding model to use

  3. optuna_best_params.json β†’ Optimal hyperparameters (use for training)

  4. ensemble_benchmark.csv β†’ Best ensemble strategy (voting/blending/stacking)

  5. safety_analysis_report.md β†’ False negatives and recommended thresholds

  6. final_benchmark_report.md β†’ Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)

  7. production_readiness_report.md β†’ Deployment checklist and instructions

  8. final_model_card.md β†’ Production model specification

TARGET METRICS

Success criteria (all must be met): βœ“ Accuracy β‰₯ 85% βœ“ Severe Recall β‰₯ 90% (CRITICAL: minimize false negatives) βœ“ AUROC β‰₯ 0.90 βœ“ ECE < 0.05 (CRITICAL: trustworthy confidence) βœ“ p99 Latency < 200ms

WORKFLOW DIAGRAM

Data ↓ Phase 1: Dataset Audit β”œβ”€β†’ data quality issues detected? └─→ conflicts < 5%? ↓ Phase 2: Embedding Benchmarks β”œβ”€β†’ Compare 4 embedding models β”œβ”€β†’ Choose best (by severe recall) └─→ Use for Phase 3+ ↓ Phase 3: Hyperparameter Optimization β”œβ”€β†’ Optuna: 50 trials with healthcare objective β”œβ”€β†’ Find: best LR, dropout, hidden_dim, batch_size └─→ Use best params for Phase 4+ ↓ Phase 4: Ensemble Ablation β”œβ”€β†’ Compare: voting, blending, stacking β”œβ”€β†’ Choose best ensemble strategy └─→ Use for Phase 7+ ↓ Phase 5: Healthcare Safety Tuning β”œβ”€β†’ Analyze false negatives β”œβ”€β†’ Optimize severe escalation threshold └─→ Verify FN rate < 5% ↓ Phase 6: Explainability Validation β”œβ”€β†’ Feature importance ranking β”œβ”€β†’ Example predictions + rationales └─→ Verify model behavior is intuitive ↓ Phase 7: Comprehensive Benchmarks β”œβ”€β†’ Final metrics on all dimensions β”œβ”€β†’ Confusion matrices └─→ Check: all targets met? ↓ Phase 8: Production Validation β”œβ”€β†’ FastAPI compatibility β”œβ”€β†’ Latency benchmarks (p99 < 200ms?) └─→ Deployment readiness ↓ Phase 9: Model Selection └─→ Choose production model + deployment instructions

TROUBLESHOOTING

If Severe Recall < 90%:

  1. Check safety_analysis_report.md for FN breakdown
  2. Re-run Phase 5 with lower threshold (0.30-0.35)
  3. Re-run Phase 3 with higher focal_gamma
  4. Increase data collection (Phase 1 may show missing classes)

If ECE > 0.05:

  1. Use calibrated voting (Phase 4)
  2. Temperature scale calibration (Phase 5)
  3. Use validation-based temperature fitting

If p99 Latency > 200ms:

  1. Reduce hidden_dim in Phase 3
  2. Fewer ensemble models in Phase 4
  3. Deploy on GPU instead of CPU

REPRODUCIBILITY

All scripts use fixed seed (2026) for reproducibility:

  • Train/test splits deterministic
  • Model initialization deterministic
  • Optuna trials deterministic
  • Results exactly reproducible on same hardware

For exact reproduction:

  1. Use Python 3.12+
  2. Use exact package versions from requirements.txt
  3. Run on same hardware type (CPU vs GPU type matters)
  4. Don't interrupt workflow

MONITORING (POST-DEPLOYMENT)

After deploying production model, monitor:

Health Check: curl http://localhost:8000/health β†’ Should return: "status": "healthy"

Latency: β†’ Track p50, p99 over time (alert if > 200ms)

Severe Recall: β†’ If ground truth available, track %FN on severe class

Calibration: β†’ Monitor drift: is confidence still matching accuracy?

Example Request: curl -X POST http://localhost:8000/predict
-H "Content-Type: application/json"
-d '{"drug_a": "aspirin", "drug_b": "warfarin"}'

DOCUMENTATION

For complete details, see:

  • ../OPTIMIZATION_FRAMEWORK.py β†’ Comprehensive framework documentation (read this first!)
  • ../README.md β†’ Quick start and deployment guide
  • ../MEDCARE-DDI-AI/src/inference/app_production.py β†’ FastAPI backend specification

QUESTIONS

Q: Which phases are required? A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.

Q: How long does full workflow take? A: 2-3 hours on modern CPU. Phase 3 (Optuna) dominates (1 hour for 50 trials).

Q: Can I run phases out of order? A: Some dependencies: Phase 1 informs Phase 3+ validation sets. Phase 2 results used in Phase 3. Generally: run 1-9 in order.

Q: What if Phase X fails? A: Check error message in COMPLETE_WORKFLOW_REPORT.md Most failures are due to missing dependencies or data files. See requirements.txt and ensure DDInter data is in data/processed/

Q: How do I select which model to deploy? A: Use final_model_card.md which ranks models by: severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)

Q: Can I customize the workflow? A: Yes! Each phase is modular and can be run independently with custom arguments. See --help for each script: python src/validation/optuna_hyperparameter_tune.py --help """

from pathlib import Path

if name == 'main': output_path = Path(file).parent / 'README.md' with output_path.open('w') as f: f.write(doc) print(f'Saved validation module README to {output_path}')