Spaces:

megamind22
/

ddi

Running

App Files Files Community

ddi / src /validation /README.md

github-actions[bot]

Deploy from GitHub Actions (fb28c05c54cf19184fc3f14f1bf3297ba5749ea2)

d29b763 9 days ago

preview code

raw

history blame contribute delete

8.06 kB

"""Validation and Optimization Module README

This directory contains the complete 9-phase optimization and validation framework for the MEDCARE-DDI AI system.

FILES

Master Orchestration:

run_complete_workflow.py Execute all 9 phases or subset Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]

Phase 1: Dataset Audit

dataset_audit.py Detect duplicates, conflicts, class imbalance, data quality issues Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv

Phase 2: Embedding Benchmarks

embedding_benchmark.py Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa Output: embedding_benchmark_results.csv, embedding_ablation_report.md

Phase 3: Hyperparameter Optimization

optuna_hyperparameter_tune.py Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma Usage: python optuna_hyperparameter_tune.py --n-trials 50 Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md

Phase 4: Ensemble Ablation

ensemble_ablation_study.py Compare voting, blending, stacking strategies Output: ensemble_benchmark.csv, ensemble_ablation.md

Phase 5: Healthcare Safety Tuning

healthcare_safety_tuning.py Analyze false negatives, optimize severe thresholds Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json

Phase 6: Explainability Validation

explainability_validation.py Feature importance, SHAP, example explanations Output: explainability_examples.md, feature_importance.csv

Phase 7: Comprehensive Benchmarks

comprehensive_benchmark.py Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency) Output: final_benchmark_report.md, benchmark_metrics.json

Phase 8: Production Validation

production_validation.py FastAPI compatibility, CPU/GPU inference, latency, readiness Output: production_validation_report.md, production_readiness_report.md, final_model_card.md

Existing Tests:

test_multimodal_components.py Unit tests for calibration, ensemble, embeddings, molecular features Usage: python -m unittest test_multimodal_components -v

QUICK START

Run all 9 phases:

cd MEDCARE-DDI-AI
python src/validation/run_complete_workflow.py

Run specific phases:

python src/validation/run_complete_workflow.py --phases 1 3 7

Run individual phase:

python src/validation/dataset_audit.py
python src/validation/embedding_benchmark.py
# ... etc

OUTPUT REPORTS

All reports saved to: MEDCARE-DDI-AI/models/reports/

Key Reports (in reading order):

dataset_audit_report.json → Data quality overview
embedding_benchmark_results.csv → Best embedding model to use
optuna_best_params.json → Optimal hyperparameters (use for training)
ensemble_benchmark.csv → Best ensemble strategy (voting/blending/stacking)
safety_analysis_report.md → False negatives and recommended thresholds
final_benchmark_report.md → Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)
production_readiness_report.md → Deployment checklist and instructions
final_model_card.md → Production model specification

TARGET METRICS

Success criteria (all must be met): ✓ Accuracy ≥ 85% ✓ Severe Recall ≥ 90% (CRITICAL: minimize false negatives) ✓ AUROC ≥ 0.90 ✓ ECE < 0.05 (CRITICAL: trustworthy confidence) ✓ p99 Latency < 200ms

WORKFLOW DIAGRAM

Data ↓ Phase 1: Dataset Audit ├─→ data quality issues detected? └─→ conflicts < 5%? ↓ Phase 2: Embedding Benchmarks ├─→ Compare 4 embedding models ├─→ Choose best (by severe recall) └─→ Use for Phase 3+ ↓ Phase 3: Hyperparameter Optimization ├─→ Optuna: 50 trials with healthcare objective ├─→ Find: best LR, dropout, hidden_dim, batch_size └─→ Use best params for Phase 4+ ↓ Phase 4: Ensemble Ablation ├─→ Compare: voting, blending, stacking ├─→ Choose best ensemble strategy └─→ Use for Phase 7+ ↓ Phase 5: Healthcare Safety Tuning ├─→ Analyze false negatives ├─→ Optimize severe escalation threshold └─→ Verify FN rate < 5% ↓ Phase 6: Explainability Validation ├─→ Feature importance ranking ├─→ Example predictions + rationales └─→ Verify model behavior is intuitive ↓ Phase 7: Comprehensive Benchmarks ├─→ Final metrics on all dimensions ├─→ Confusion matrices └─→ Check: all targets met? ↓ Phase 8: Production Validation ├─→ FastAPI compatibility ├─→ Latency benchmarks (p99 < 200ms?) └─→ Deployment readiness ↓ Phase 9: Model Selection └─→ Choose production model + deployment instructions

TROUBLESHOOTING

If Severe Recall < 90%:

Check safety_analysis_report.md for FN breakdown
Re-run Phase 5 with lower threshold (0.30-0.35)
Re-run Phase 3 with higher focal_gamma
Increase data collection (Phase 1 may show missing classes)

If ECE > 0.05:

Use calibrated voting (Phase 4)
Temperature scale calibration (Phase 5)
Use validation-based temperature fitting

If p99 Latency > 200ms:

Reduce hidden_dim in Phase 3
Fewer ensemble models in Phase 4
Deploy on GPU instead of CPU

REPRODUCIBILITY

All scripts use fixed seed (2026) for reproducibility:

Train/test splits deterministic
Model initialization deterministic
Optuna trials deterministic
Results exactly reproducible on same hardware

For exact reproduction:

Use Python 3.12+
Use exact package versions from requirements.txt
Run on same hardware type (CPU vs GPU type matters)
Don't interrupt workflow

MONITORING (POST-DEPLOYMENT)

After deploying production model, monitor:

Health Check: curl http://localhost:8000/health → Should return: "status": "healthy"

Latency: → Track p50, p99 over time (alert if > 200ms)

Severe Recall: → If ground truth available, track %FN on severe class

Calibration: → Monitor drift: is confidence still matching accuracy?

Example Request: curl -X POST http://localhost:8000/predict
-H "Content-Type: application/json"
-d '{"drug_a": "aspirin", "drug_b": "warfarin"}'

DOCUMENTATION

For complete details, see:

../OPTIMIZATION_FRAMEWORK.py → Comprehensive framework documentation (read this first!)
../README.md → Quick start and deployment guide
../MEDCARE-DDI-AI/src/inference/app_production.py → FastAPI backend specification

QUESTIONS

Q: Which phases are required? A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.

Q: How long does full workflow take? A: ~~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~~1 hour for 50 trials).

Q: Can I run phases out of order? A: Some dependencies: Phase 1 informs Phase 3+ validation sets. Phase 2 results used in Phase 3. Generally: run 1-9 in order.

Q: What if Phase X fails? A: Check error message in COMPLETE_WORKFLOW_REPORT.md Most failures are due to missing dependencies or data files. See requirements.txt and ensure DDInter data is in data/processed/

Q: How do I select which model to deploy? A: Use final_model_card.md which ranks models by: severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)

Q: Can I customize the workflow? A: Yes! Each phase is modular and can be run independently with custom arguments. See --help for each script: python src/validation/optuna_hyperparameter_tune.py --help """

from pathlib import Path

if name == 'main': output_path = Path(file).parent / 'README.md' with output_path.open('w') as f: f.write(doc) print(f'Saved validation module README to {output_path}')