Spaces:

megamind22
/

ddi

Running

App Files Files Community

ddi / src /validation /README.md

github-actions[bot]

Deploy from GitHub Actions (fb28c05c54cf19184fc3f14f1bf3297ba5749ea2)

d29b763 9 days ago

preview code

raw

history blame contribute delete

8.06 kB

	"""Validation and Optimization Module README

	This directory contains the complete 9-phase optimization and validation framework
	for the MEDCARE-DDI AI system.

	FILES
	=====

	Master Orchestration:
	- run_complete_workflow.py
	Execute all 9 phases or subset
	Usage: python run_complete_workflow.py [--phases 1 2 3] [--skip-phases 4]

	Phase 1: Dataset Audit
	- dataset_audit.py
	Detect duplicates, conflicts, class imbalance, data quality issues
	Output: dataset_audit_report.json, class_balance_report.json, conflict_analysis.csv

	Phase 2: Embedding Benchmarks
	- embedding_benchmark.py
	Compare BioBERT, PubMedBERT, SapBERT, ChemBERTa
	Output: embedding_benchmark_results.csv, embedding_ablation_report.md

	Phase 3: Hyperparameter Optimization
	- optuna_hyperparameter_tune.py
	Optuna-based tuning of LR, dropout, hidden_dim, batch_size, focal_gamma
	Usage: python optuna_hyperparameter_tune.py --n-trials 50
	Output: optuna_trials.json, optuna_best_params.json, hyperparameter_optimization_report.md

	Phase 4: Ensemble Ablation
	- ensemble_ablation_study.py
	Compare voting, blending, stacking strategies
	Output: ensemble_benchmark.csv, ensemble_ablation.md

	Phase 5: Healthcare Safety Tuning
	- healthcare_safety_tuning.py
	Analyze false negatives, optimize severe thresholds
	Output: safety_analysis_report.md, severe_case_review.csv, threshold_optimization.json

	Phase 6: Explainability Validation
	- explainability_validation.py
	Feature importance, SHAP, example explanations
	Output: explainability_examples.md, feature_importance.csv

	Phase 7: Comprehensive Benchmarks
	- comprehensive_benchmark.py
	Final metrics (accuracy, F1, severe recall, AUROC, ECE, latency)
	Output: final_benchmark_report.md, benchmark_metrics.json

	Phase 8: Production Validation
	- production_validation.py
	FastAPI compatibility, CPU/GPU inference, latency, readiness
	Output: production_validation_report.md, production_readiness_report.md, final_model_card.md

	Existing Tests:
	- test_multimodal_components.py
	Unit tests for calibration, ensemble, embeddings, molecular features
	Usage: python -m unittest test_multimodal_components -v

	QUICK START
	===========

	Run all 9 phases:
	```bash
	cd MEDCARE-DDI-AI
	python src/validation/run_complete_workflow.py
	```

	Run specific phases:
	```bash
	python src/validation/run_complete_workflow.py --phases 1 3 7
	```

	Run individual phase:
	```bash
	python src/validation/dataset_audit.py
	python src/validation/embedding_benchmark.py
	# ... etc
	```

	OUTPUT REPORTS
	==============

	All reports saved to: MEDCARE-DDI-AI/models/reports/

	Key Reports (in reading order):
	1. dataset_audit_report.json
	→ Data quality overview

	2. embedding_benchmark_results.csv
	→ Best embedding model to use

	3. optuna_best_params.json
	→ Optimal hyperparameters (use for training)

	4. ensemble_benchmark.csv
	→ Best ensemble strategy (voting/blending/stacking)

	5. safety_analysis_report.md
	→ False negatives and recommended thresholds

	6. final_benchmark_report.md
	→ Main performance metrics (accuracy, F1, severe recall, AUROC, ECE)

	7. production_readiness_report.md
	→ Deployment checklist and instructions

	8. final_model_card.md
	→ Production model specification

	TARGET METRICS
	==============

	Success criteria (all must be met):
	✓ Accuracy ≥ 85%
	✓ Severe Recall ≥ 90% (CRITICAL: minimize false negatives)
	✓ AUROC ≥ 0.90
	✓ ECE < 0.05 (CRITICAL: trustworthy confidence)
	✓ p99 Latency < 200ms

	WORKFLOW DIAGRAM
	================

	Data
	↓
	Phase 1: Dataset Audit
	├─→ data quality issues detected?
	└─→ conflicts < 5%?
	↓
	Phase 2: Embedding Benchmarks
	├─→ Compare 4 embedding models
	├─→ Choose best (by severe recall)
	└─→ Use for Phase 3+
	↓
	Phase 3: Hyperparameter Optimization
	├─→ Optuna: 50 trials with healthcare objective
	├─→ Find: best LR, dropout, hidden_dim, batch_size
	└─→ Use best params for Phase 4+
	↓
	Phase 4: Ensemble Ablation
	├─→ Compare: voting, blending, stacking
	├─→ Choose best ensemble strategy
	└─→ Use for Phase 7+
	↓
	Phase 5: Healthcare Safety Tuning
	├─→ Analyze false negatives
	├─→ Optimize severe escalation threshold
	└─→ Verify FN rate < 5%
	↓
	Phase 6: Explainability Validation
	├─→ Feature importance ranking
	├─→ Example predictions + rationales
	└─→ Verify model behavior is intuitive
	↓
	Phase 7: Comprehensive Benchmarks
	├─→ Final metrics on all dimensions
	├─→ Confusion matrices
	└─→ Check: all targets met?
	↓
	Phase 8: Production Validation
	├─→ FastAPI compatibility
	├─→ Latency benchmarks (p99 < 200ms?)
	└─→ Deployment readiness
	↓
	Phase 9: Model Selection
	└─→ Choose production model + deployment instructions

	TROUBLESHOOTING
	===============

	If Severe Recall < 90%:
	1. Check safety_analysis_report.md for FN breakdown
	2. Re-run Phase 5 with lower threshold (0.30-0.35)
	3. Re-run Phase 3 with higher focal_gamma
	4. Increase data collection (Phase 1 may show missing classes)

	If ECE > 0.05:
	1. Use calibrated voting (Phase 4)
	2. Temperature scale calibration (Phase 5)
	3. Use validation-based temperature fitting

	If p99 Latency > 200ms:
	1. Reduce hidden_dim in Phase 3
	2. Fewer ensemble models in Phase 4
	3. Deploy on GPU instead of CPU

	REPRODUCIBILITY
	================

	All scripts use fixed seed (2026) for reproducibility:
	- Train/test splits deterministic
	- Model initialization deterministic
	- Optuna trials deterministic
	- Results exactly reproducible on same hardware

	For exact reproduction:
	1. Use Python 3.12+
	2. Use exact package versions from requirements.txt
	3. Run on same hardware type (CPU vs GPU type matters)
	4. Don't interrupt workflow

	MONITORING (POST-DEPLOYMENT)
	=============================

	After deploying production model, monitor:

	Health Check:
	curl http://localhost:8000/health
	→ Should return: "status": "healthy"

	Latency:
	→ Track p50, p99 over time (alert if > 200ms)

	Severe Recall:
	→ If ground truth available, track %FN on severe class

	Calibration:
	→ Monitor drift: is confidence still matching accuracy?

	Example Request:
	curl -X POST http://localhost:8000/predict \
	-H "Content-Type: application/json" \
	-d '{"drug_a": "aspirin", "drug_b": "warfarin"}'

	DOCUMENTATION
	==============

	For complete details, see:
	- ../OPTIMIZATION_FRAMEWORK.py
	→ Comprehensive framework documentation (read this first!)
	- ../README.md
	→ Quick start and deployment guide
	- ../MEDCARE-DDI-AI/src/inference/app_production.py
	→ FastAPI backend specification

	QUESTIONS
	=========

	Q: Which phases are required?
	A: All 9 for first-time production. Phase 7 (benchmarks) is minimum validation.

	Q: How long does full workflow take?
	A: ~2-3 hours on modern CPU. Phase 3 (Optuna) dominates (~1 hour for 50 trials).

	Q: Can I run phases out of order?
	A: Some dependencies: Phase 1 informs Phase 3+ validation sets.
	Phase 2 results used in Phase 3. Generally: run 1-9 in order.

	Q: What if Phase X fails?
	A: Check error message in COMPLETE_WORKFLOW_REPORT.md
	Most failures are due to missing dependencies or data files.
	See requirements.txt and ensure DDInter data is in data/processed/

	Q: How do I select which model to deploy?
	A: Use final_model_card.md which ranks models by:
	severe_recall (40%) > calibration (20%) > auroc (20%) > stability (10%) > latency (10%)

	Q: Can I customize the workflow?
	A: Yes! Each phase is modular and can be run independently with custom arguments.
	See --help for each script:
	python src/validation/optuna_hyperparameter_tune.py --help
	"""

	from pathlib import Path

	if __name__ == '__main__':
	output_path = Path(__file__).parent / 'README.md'
	with output_path.open('w') as f:
	f.write(__doc__)
	print(f'Saved validation module README to {output_path}')