Spaces:

megamind22
/

ddi

Running

File size: 8,056 Bytes

d29b763

"""Validation and Optimization Module README

This directory contains the complete 9-phase optimization and validation framework
for the MEDCARE-DDI AI system.

FILES
=====

Master Orchestration:
  - run_complete_workflow.py
    Execute all 9 phases or subset
    Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]

Phase 1: Dataset Audit
  - dataset_audit.py
    Detect duplicates, conflicts, class imbalance, data quality issues
    Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv

Phase 2: Embedding Benchmarks
  - embedding_benchmark.py
    Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa
    Output: embedding_benchmark_results.csv, embedding_ablation_report.md

Phase 3: Hyperparameter Optimization
  - optuna_hyperparameter_tune.py
    Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma
    Usage: python optuna_hyperparameter_tune.py --n-trials 50
    Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md

Phase 4: Ensemble Ablation
  - ensemble_ablation_study.py
    Compare voting, blending, stacking strategies
    Output: ensemble_benchmark.csv, ensemble_ablation.md

Phase 5: Healthcare Safety Tuning
  - healthcare_safety_tuning.py
    Analyze false negatives, optimize severe thresholds
    Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json

Phase 6: Explainability Validation
  - explainability_validation.py
    Feature importance, SHAP, example explanations
    Output: explainability_examples.md, feature_importance.csv

Phase 7: Comprehensive Benchmarks
  - comprehensive_benchmark.py
    Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency)
    Output: final_benchmark_report.md, benchmark_metrics.json

Phase 8: Production Validation
  - production_validation.py
    FastAPI compatibility, CPU/GPU inference, latency, readiness
    Output: production_validation_report.md, production_readiness_report.md, final_model_card.md

Existing Tests:
  - test_multimodal_components.py
    Unit tests for calibration, ensemble, embeddings, molecular features
    Usage: python -m unittest test_multimodal_components -v

QUICK START
===========

Run all 9 phases:
```bash
cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py
```

Run specific phases:
```bash
python src/validation/run_complete_workflow.py --phases 1 3 7
```

Run individual phase:
```bash
python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc
```

OUTPUT REPORTS
==============

All reports saved to: MEDCARE-DDI-AI/models/reports/

Key Reports (in reading order):
  1. dataset_audit_report.json
     → Data quality overview
  
  2. embedding_benchmark_results.csv
     → Best embedding model to use
  
  3. optuna_best_params.json
     → Optimal hyperparameters (use for training)
  
  4. ensemble_benchmark.csv
     → Best ensemble strategy (voting/blending/stacking)
  
  5. safety_analysis_report.md
     → False negatives and recommended thresholds
  
  6. final_benchmark_report.md
     → Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)
  
  7. production_readiness_report.md
     → Deployment checklist and instructions
  
  8. final_model_card.md
     → Production model specification

TARGET METRICS
==============

Success criteria (all must be met):
  ✓ Accuracy ≥ 85%
  ✓ Severe Recall ≥ 90% (CRITICAL: minimize false negatives)
  ✓ AUROC ≥ 0.90
  ✓ ECE < 0.05 (CRITICAL: trustworthy confidence)
  ✓ p99 Latency < 200ms

WORKFLOW DIAGRAM
================

Data
  ↓
Phase 1: Dataset Audit
  ├─→ data quality issues detected?
  └─→ conflicts < 5%?
  ↓
Phase 2: Embedding Benchmarks
  ├─→ Compare 4 embedding models
  ├─→ Choose best (by severe recall)
  └─→ Use for Phase 3+
  ↓
Phase 3: Hyperparameter Optimization
  ├─→ Optuna: 50 trials with healthcare objective
  ├─→ Find: best LR, dropout, hidden_dim, batch_size
  └─→ Use best params for Phase 4+
  ↓
Phase 4: Ensemble Ablation
  ├─→ Compare: voting, blending, stacking
  ├─→ Choose best ensemble strategy
  └─→ Use for Phase 7+
  ↓
Phase 5: Healthcare Safety Tuning
  ├─→ Analyze false negatives
  ├─→ Optimize severe escalation threshold
  └─→ Verify FN rate < 5%
  ↓
Phase 6: Explainability Validation
  ├─→ Feature importance ranking
  ├─→ Example predictions + rationales
  └─→ Verify model behavior is intuitive
  ↓
Phase 7: Comprehensive Benchmarks
  ├─→ Final metrics on all dimensions
  ├─→ Confusion matrices
  └─→ Check: all targets met?
  ↓
Phase 8: Production Validation
  ├─→ FastAPI compatibility
  ├─→ Latency benchmarks (p99 < 200ms?)
  └─→ Deployment readiness
  ↓
Phase 9: Model Selection
  └─→ Choose production model + deployment instructions

TROUBLESHOOTING
===============

If Severe Recall < 90%:
  1. Check safety_analysis_report.md for FN breakdown
  2. Re-run Phase 5 with lower threshold (0.30-0.35)
  3. Re-run Phase 3 with higher focal_gamma
  4. Increase data collection (Phase 1 may show missing classes)

If ECE > 0.05:
  1. Use calibrated voting (Phase 4)
  2. Temperature scale calibration (Phase 5)
  3. Use validation-based temperature fitting

If p99 Latency > 200ms:
  1. Reduce hidden_dim in Phase 3
  2. Fewer ensemble models in Phase 4
  3. Deploy on GPU instead of CPU

REPRODUCIBILITY
================

All scripts use fixed seed (2026) for reproducibility:
  - Train/test splits deterministic
  - Model initialization deterministic
  - Optuna trials deterministic
  - Results exactly reproducible on same hardware

For exact reproduction:
  1. Use Python 3.12+
  2. Use exact package versions from requirements.txt
  3. Run on same hardware type (CPU vs GPU type matters)
  4. Don't interrupt workflow

MONITORING (POST-DEPLOYMENT)
=============================

After deploying production model, monitor:

Health Check:
  curl http://localhost:8000/health
  → Should return: "status": "healthy"

Latency:
  → Track p50, p99 over time (alert if > 200ms)

Severe Recall:
  → If ground truth available, track %FN on severe class

Calibration:
  → Monitor drift: is confidence still matching accuracy?

Example Request:
  curl -X POST http://localhost:8000/predict \
    -H "Content-Type: application/json" \
    -d '{"drug_a": "aspirin", "drug_b": "warfarin"}'

DOCUMENTATION
==============

For complete details, see:
  - ../OPTIMIZATION_FRAMEWORK.py
    → Comprehensive framework documentation (read this first!)
  - ../README.md
    → Quick start and deployment guide
  - ../MEDCARE-DDI-AI/src/inference/app_production.py
    → FastAPI backend specification

QUESTIONS
=========

Q: Which phases are required?
A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.

Q: How long does full workflow take?
A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials).

Q: Can I run phases out of order?
A: Some dependencies: Phase 1 informs Phase 3+ validation sets. 
   Phase 2 results used in Phase 3. Generally: run 1-9 in order.

Q: What if Phase X fails?
A: Check error message in COMPLETE_WORKFLOW_REPORT.md
   Most failures are due to missing dependencies or data files.
   See requirements.txt and ensure DDInter data is in data/processed/

Q: How do I select which model to deploy?
A: Use final_model_card.md which ranks models by:
   severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)

Q: Can I customize the workflow?
A: Yes! Each phase is modular and can be run independently with custom arguments.
   See --help for each script:
   python src/validation/optuna_hyperparameter_tune.py --help
"""

from pathlib import Path

if __name__ == '__main__':
    output_path = Path(__file__).parent / 'README.md'
    with output_path.open('w') as f:
        f.write(__doc__)
    print(f'Saved validation module README to {output_path}')