Spaces:

apook
/

medyx-v2

Sleeping

App Files Files Community

medyx-v2 / docs /EVALUATION_PROGRESS.md

apook

Add full evaluation suite, fix MedQA/ODIR evals, complete technical report

e7c144c 15 days ago

preview code

raw

history blame contribute delete

6.97 kB

	# Evaluation Progress

	System: hypothetico-deductive multi-agent v2
	Last updated: 2026-05-09
	Pipeline: BeliefState synthesis · Groq LLM debate · domain-aware ViT routing

	---

	## Summary

	\| # \| Dataset \| Type \| Cases \| Top-1 \| Top-3 \| Status \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| 1 \| DDxPlus (eval_01) \| Structured symptom (baseline) \| 500 \| 1.4% \| 47.2% \| ✅ Done (pre-fix baseline) \|
	\| 2 \| DDxPlus (eval_02) \| Structured symptom (fixed) \| 100 \| 68.0% \| 86.0% \| ✅ Done \|
	\| 3 \| MedQA-USMLE \| Real USMLE MCQ \| 200 \| 0.0% \| 0.0% \| ✅ Done (task mismatch — see note) \|
	\| 4 \| Chest X-ray Pneumonia \| Medical imaging (binary) \| 100 \| 95.0% \| 95.0% \| ✅ Done \|
	\| 5 \| CheXpert Plus \| Medical imaging (multi-label) \| — \| — \| — \| ⏳ Pending \|
	\| 6 \| CUPCase \| Real clinical case reports \| — \| — \| — \| ⏳ Pending \|
	\| 7 \| MedXpertQA-MM \| Multimodal clinical QA \| — \| — \| — \| ⏳ Pending \|

	---

	## Completed Evaluations

	### ✅ Eval 1 — DDxPlus (pre-fix baseline)
	- File: `results/eval_01_ddxplus.json`
	- Script: `evaluate.py`
	- Cases: 500 · Top-1: 1.4% · Top-3: 47.2%
	- What went wrong: Three pipeline bugs: (a) DeepSeek JSON inside `<think>` stripped before parsing → empty `llm_parsed`; (b) empty hypothesis → lab gave −0.20 shift to correct disease; (c) lab `NORMAL_RANGES` category names ("Infection / Leukemia", "Diabetes / Hyperglycemia") dominated `debate_result["ranked"]`, overriding clinical disease names in synthesis. Avg confidence reported as 100% (normalization artefact).
	- Fixed in: v2 pipeline — see `agents/orchestrator.py`, `app.py`, `agents/clinical_reasoner.py`

	---

	### ✅ Eval 2 — DDxPlus (fixed pipeline)
	- File: `results/eval_02_ddxplus.json`
	- Script: `evaluate.py`
	- Cases: 100 · Top-1: 68.0% · Top-3: 86.0% · Runtime: 49.4s (0.49s/case)
	- Agents active: ClinicalReasonerAgent (keyword fallback), LabAgent (early return — no labs in DDxPlus), ImageAgent (no_imaging), HistoryAgent, DebateAgent (vote-only), SynthesisAgent (BeliefState)
	- Best diseases (F1=1.0): Tuberculosis, Spontaneous pneumothorax, Scombroid food poisoning, Panic attack, PSVT, Allergic sinusitis, Anemia, Atrial fibrillation, Cluster headache, GERD, Localized edema, Acute pulmonary edema
	- Worst diseases (F1=0.0): Acute rhinosinusitis, Bronchitis, Chronic rhinosinusitis, HIV (initial infection), Influenza
	- Key finding: Respiratory diseases (14% category top-1) are the main weakness — Bronchitis/Pneumonia/Influenza share overlapping keyword sets and are misclassified as "COPD exacerbation"

	---

	### ✅ Eval 3 — MedQA-USMLE
	- File: `results/eval_02_medqa.json`
	- Script: `eval_medqa.py`
	- Cases: 200 · Top-1: 0.0% · Top-3: 0.0% · Runtime: 403.9s (2.02s/case)
	- Agents active: ClinicalReasonerAgent (keyword fallback), all other agents (early return), DebateAgent (vote-only)
	- Why 0%: Structural task-format mismatch. Our pipeline outputs disease names; MedQA answer options are drugs ("Ketotifen eye drops"), mechanisms ("Cross-linking of DNA"), pathology descriptions ("Lactose-fermenting gram-negative rods"), and ethics choices — not disease names. Scoring 0% is expected and does not indicate a pipeline bug.
	- Most common prediction: `acute otitis media` — its keywords (fever, cough, nasal congestion, ear) appear in many USMLE vignettes.
	- Fix required: MCQ answer mapper layer post-pipeline (cosine similarity: disease → MCQ option)

	---

	### ✅ Eval 4 — Chest X-ray Pneumonia (ImageAgent isolated)
	- File: `results/eval_03_chestxray.json`
	- Script: `eval_chestxray.py`
	- Cases: 100 (shuffled seed=42) · Top-1: 95.0% · Runtime: 39.8s (0.4s/case)
	- Model: `nickmuchi/vit-finetuned-chest-xray-pneumonia` (2-class: NORMAL / PNEUMONIA)
	- Dataset note: Originally requested NIH ChestX-ray14 (`alkzar90/NIH-Chest-X-ray-dataset`) — unavailable due to deprecated dataset loading script. Used `hf-vision/chest-xray-pneumonia` (same source images, Kaggle chest X-ray dataset).
	- Per-class F1: PNEUMONIA 0.955 · NORMAL 0.944
	- Errors: 5 — 3 false positives (NORMAL → PNEUMONIA), 2 false negatives (PNEUMONIA → NORMAL)
	- Bug fixed during eval: `_detect_domain()` was routing nearly-square chest X-rays to the brain model (outputs "meningioma tumor", "no tumor"). Fixed with dark-border heuristic (`border_mean < 30`).

	---

	## Pending Evaluations

	### ⏳ Eval 5 — CheXpert Plus (multi-label chest X-ray)
	- Dataset: `stanfordmlgroup/CheXpert` or compatible HuggingFace mirror
	- Goal: Test ImageAgent with the 14-class DenseNet upgrade across multi-label pathologies
	- Prerequisite: Replace `_REGISTRY["chest"]` with `nickmuchi/densenet-finetuned-chest-xray-classification`
	- Expected metrics: Per-class AUC and F1 for 14 CheXpert labels (Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, etc.)
	- Script to create: `eval_chexpert.py`

	---

	### ⏳ Eval 6 — CUPCase (real clinical case reports)
	- Dataset: Real clinical case report corpus (narrative free-text, ground-truth diagnosis in report conclusion)
	- Goal: Test full pipeline (LLM-enabled) on actual physician-written cases — the closest proxy to real-world deployment
	- Prerequisite: Enable Groq LLM (`_groq_enabled = True`), add rate-limit retry wrapper
	- Expected metrics: Top-1/3 accuracy; also measure reasoning_chain quality vs. ground truth diagnosis path
	- Script to create: `eval_cupcase.py`

	---

	### ⏳ Eval 7 — MedXpertQA-MM (multimodal clinical QA)
	- Dataset: MedXpertQA multimodal variant — clinical questions with both image and text
	- Goal: Test the full multi-agent pipeline (text + imaging) end-to-end on cases where both modalities are required
	- Prerequisites: Multi-class imaging model (Eval 5 prerequisite), MCQ answer mapper (fixes MedQA gap), Groq LLM enabled
	- Expected metrics: Top-1 accuracy on multimodal questions; ablation study (text-only vs. text+image)
	- Script to create: `eval_medxpertqa_mm.py`

	---

	## Improvement Roadmap (by expected impact)

	\| Priority \| Change \| Affected Evals \| Expected Δ \| Effort \|
	\|---\|---\|---\|---\|---\|
	\| 🔴 1 \| Enable Groq LLM in eval scripts (add retry wrapper) \| DDxPlus, MedQA \| +15–25% DDxPlus; +20–35% MedQA \| Low \|
	\| 🔴 2 \| Swap chest ViT → 14-class DenseNet-121 \| CheXpert Plus, NIH \| Enables 12 new disease classes \| Low \|
	\| 🟡 3 \| MCQ answer mapper (sentence-transformers cosine sim) \| MedQA \| +30–50% on diagnosis questions \| Medium \|
	\| 🟡 4 \| Expand keyword vocabulary (rare diseases, autoimmune) \| DDxPlus, CUPCase \| +5–10% DDxPlus \| Medium \|
	\| 🟢 5 \| Vignette lab extractor (regex + LLM pre-pass) \| MedQA, CUPCase \| Lab critique activates on prose values \| Medium \|
	\| 🟢 6 \| Respiratory keyword disambiguation (Bronchitis vs COPD) \| DDxPlus \| Fixes 14% respiratory category \| Low \|

	# Evaluation Progress

	System: hypothetico-deductive multi-agent v2
	Last updated: 2026-05-09
	Pipeline: BeliefState synthesis · Groq LLM debate · domain-aware ViT routing

	---

	## Summary

	\| # \| Dataset \| Type \| Cases \| Top-1 \| Top-3 \| Status \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| 1 \| DDxPlus (eval_01) \| Structured symptom (baseline) \| 500 \| 1.4% \| 47.2% \| ✅ Done (pre-fix baseline) \|
	\| 2 \| DDxPlus (eval_02) \| Structured symptom (fixed) \| 100 \| 68.0% \| 86.0% \| ✅ Done \|
	\| 3 \| MedQA-USMLE \| Real USMLE MCQ \| 200 \| 0.0% \| 0.0% \| ✅ Done (task mismatch — see note) \|
	\| 4 \| Chest X-ray Pneumonia \| Medical imaging (binary) \| 100 \| 95.0% \| 95.0% \| ✅ Done \|
	\| 5 \| CheXpert Plus \| Medical imaging (multi-label) \| — \| — \| — \| ⏳ Pending \|
	\| 6 \| CUPCase \| Real clinical case reports \| — \| — \| — \| ⏳ Pending \|
	\| 7 \| MedXpertQA-MM \| Multimodal clinical QA \| — \| — \| — \| ⏳ Pending \|

	---

	## Completed Evaluations

	### ✅ Eval 1 — DDxPlus (pre-fix baseline)
	- File: `results/eval_01_ddxplus.json`
	- Script: `evaluate.py`
	- Cases: 500 · Top-1: 1.4% · Top-3: 47.2%
	- What went wrong: Three pipeline bugs: (a) DeepSeek JSON inside `<think>` stripped before parsing → empty `llm_parsed`; (b) empty hypothesis → lab gave −0.20 shift to correct disease; (c) lab `NORMAL_RANGES` category names ("Infection / Leukemia", "Diabetes / Hyperglycemia") dominated `debate_result["ranked"]`, overriding clinical disease names in synthesis. Avg confidence reported as 100% (normalization artefact).
	- Fixed in: v2 pipeline — see `agents/orchestrator.py`, `app.py`, `agents/clinical_reasoner.py`

	---

	### ✅ Eval 2 — DDxPlus (fixed pipeline)
	- File: `results/eval_02_ddxplus.json`
	- Script: `evaluate.py`
	- Cases: 100 · Top-1: 68.0% · Top-3: 86.0% · Runtime: 49.4s (0.49s/case)
	- Agents active: ClinicalReasonerAgent (keyword fallback), LabAgent (early return — no labs in DDxPlus), ImageAgent (no_imaging), HistoryAgent, DebateAgent (vote-only), SynthesisAgent (BeliefState)
	- Best diseases (F1=1.0): Tuberculosis, Spontaneous pneumothorax, Scombroid food poisoning, Panic attack, PSVT, Allergic sinusitis, Anemia, Atrial fibrillation, Cluster headache, GERD, Localized edema, Acute pulmonary edema
	- Worst diseases (F1=0.0): Acute rhinosinusitis, Bronchitis, Chronic rhinosinusitis, HIV (initial infection), Influenza
	- Key finding: Respiratory diseases (14% category top-1) are the main weakness — Bronchitis/Pneumonia/Influenza share overlapping keyword sets and are misclassified as "COPD exacerbation"

	---

	### ✅ Eval 3 — MedQA-USMLE
	- File: `results/eval_02_medqa.json`
	- Script: `eval_medqa.py`
	- Cases: 200 · Top-1: 0.0% · Top-3: 0.0% · Runtime: 403.9s (2.02s/case)
	- Agents active: ClinicalReasonerAgent (keyword fallback), all other agents (early return), DebateAgent (vote-only)
	- Why 0%: Structural task-format mismatch. Our pipeline outputs disease names; MedQA answer options are drugs ("Ketotifen eye drops"), mechanisms ("Cross-linking of DNA"), pathology descriptions ("Lactose-fermenting gram-negative rods"), and ethics choices — not disease names. Scoring 0% is expected and does not indicate a pipeline bug.
	- Most common prediction: `acute otitis media` — its keywords (fever, cough, nasal congestion, ear) appear in many USMLE vignettes.
	- Fix required: MCQ answer mapper layer post-pipeline (cosine similarity: disease → MCQ option)

	---

	### ✅ Eval 4 — Chest X-ray Pneumonia (ImageAgent isolated)
	- File: `results/eval_03_chestxray.json`
	- Script: `eval_chestxray.py`
	- Cases: 100 (shuffled seed=42) · Top-1: 95.0% · Runtime: 39.8s (0.4s/case)
	- Model: `nickmuchi/vit-finetuned-chest-xray-pneumonia` (2-class: NORMAL / PNEUMONIA)
	- Dataset note: Originally requested NIH ChestX-ray14 (`alkzar90/NIH-Chest-X-ray-dataset`) — unavailable due to deprecated dataset loading script. Used `hf-vision/chest-xray-pneumonia` (same source images, Kaggle chest X-ray dataset).
	- Per-class F1: PNEUMONIA 0.955 · NORMAL 0.944
	- Errors: 5 — 3 false positives (NORMAL → PNEUMONIA), 2 false negatives (PNEUMONIA → NORMAL)
	- Bug fixed during eval: `_detect_domain()` was routing nearly-square chest X-rays to the brain model (outputs "meningioma tumor", "no tumor"). Fixed with dark-border heuristic (`border_mean < 30`).

	---

	## Pending Evaluations

	### ⏳ Eval 5 — CheXpert Plus (multi-label chest X-ray)
	- Dataset: `stanfordmlgroup/CheXpert` or compatible HuggingFace mirror
	- Goal: Test ImageAgent with the 14-class DenseNet upgrade across multi-label pathologies
	- Prerequisite: Replace `_REGISTRY["chest"]` with `nickmuchi/densenet-finetuned-chest-xray-classification`
	- Expected metrics: Per-class AUC and F1 for 14 CheXpert labels (Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, etc.)
	- Script to create: `eval_chexpert.py`

	---

	### ⏳ Eval 6 — CUPCase (real clinical case reports)
	- Dataset: Real clinical case report corpus (narrative free-text, ground-truth diagnosis in report conclusion)
	- Goal: Test full pipeline (LLM-enabled) on actual physician-written cases — the closest proxy to real-world deployment
	- Prerequisite: Enable Groq LLM (`_groq_enabled = True`), add rate-limit retry wrapper
	- Expected metrics: Top-1/3 accuracy; also measure reasoning_chain quality vs. ground truth diagnosis path
	- Script to create: `eval_cupcase.py`

	---

	### ⏳ Eval 7 — MedXpertQA-MM (multimodal clinical QA)
	- Dataset: MedXpertQA multimodal variant — clinical questions with both image and text
	- Goal: Test the full multi-agent pipeline (text + imaging) end-to-end on cases where both modalities are required
	- Prerequisites: Multi-class imaging model (Eval 5 prerequisite), MCQ answer mapper (fixes MedQA gap), Groq LLM enabled
	- Expected metrics: Top-1 accuracy on multimodal questions; ablation study (text-only vs. text+image)
	- Script to create: `eval_medxpertqa_mm.py`

	---

	## Improvement Roadmap (by expected impact)

	\| Priority \| Change \| Affected Evals \| Expected Δ \| Effort \|
	\|---\|---\|---\|---\|---\|
	\| 🔴 1 \| Enable Groq LLM in eval scripts (add retry wrapper) \| DDxPlus, MedQA \| +15–25% DDxPlus; +20–35% MedQA \| Low \|
	\| 🔴 2 \| Swap chest ViT → 14-class DenseNet-121 \| CheXpert Plus, NIH \| Enables 12 new disease classes \| Low \|
	\| 🟡 3 \| MCQ answer mapper (sentence-transformers cosine sim) \| MedQA \| +30–50% on diagnosis questions \| Medium \|
	\| 🟡 4 \| Expand keyword vocabulary (rare diseases, autoimmune) \| DDxPlus, CUPCase \| +5–10% DDxPlus \| Medium \|
	\| 🟢 5 \| Vignette lab extractor (regex + LLM pre-pass) \| MedQA, CUPCase \| Lab critique activates on prose values \| Medium \|
	\| 🟢 6 \| Respiratory keyword disambiguation (Bronchitis vs COPD) \| DDxPlus \| Fixes 14% respiratory category \| Low \|