medyx-v2 / docs /EVALUATION_PROGRESS.md
apook's picture
Add full evaluation suite, fix MedQA/ODIR evals, complete technical report
e7c144c

Evaluation Progress

System: hypothetico-deductive multi-agent v2
Last updated: 2026-05-09
Pipeline: BeliefState synthesis Β· Groq LLM debate Β· domain-aware ViT routing


Summary

# Dataset Type Cases Top-1 Top-3 Status
1 DDxPlus (eval_01) Structured symptom (baseline) 500 1.4% 47.2% βœ… Done (pre-fix baseline)
2 DDxPlus (eval_02) Structured symptom (fixed) 100 68.0% 86.0% βœ… Done
3 MedQA-USMLE Real USMLE MCQ 200 0.0% 0.0% βœ… Done (task mismatch β€” see note)
4 Chest X-ray Pneumonia Medical imaging (binary) 100 95.0% 95.0% βœ… Done
5 CheXpert Plus Medical imaging (multi-label) β€” β€” β€” ⏳ Pending
6 CUPCase Real clinical case reports β€” β€” β€” ⏳ Pending
7 MedXpertQA-MM Multimodal clinical QA β€” β€” β€” ⏳ Pending

Completed Evaluations

βœ… Eval 1 β€” DDxPlus (pre-fix baseline)

  • File: results/eval_01_ddxplus.json
  • Script: evaluate.py
  • Cases: 500 Β· Top-1: 1.4% Β· Top-3: 47.2%
  • What went wrong: Three pipeline bugs: (a) DeepSeek JSON inside <think> stripped before parsing β†’ empty llm_parsed; (b) empty hypothesis β†’ lab gave βˆ’0.20 shift to correct disease; (c) lab NORMAL_RANGES category names ("Infection / Leukemia", "Diabetes / Hyperglycemia") dominated debate_result["ranked"], overriding clinical disease names in synthesis. Avg confidence reported as 100% (normalization artefact).
  • Fixed in: v2 pipeline β€” see agents/orchestrator.py, app.py, agents/clinical_reasoner.py

βœ… Eval 2 β€” DDxPlus (fixed pipeline)

  • File: results/eval_02_ddxplus.json
  • Script: evaluate.py
  • Cases: 100 Β· Top-1: 68.0% Β· Top-3: 86.0% Β· Runtime: 49.4s (0.49s/case)
  • Agents active: ClinicalReasonerAgent (keyword fallback), LabAgent (early return β€” no labs in DDxPlus), ImageAgent (no_imaging), HistoryAgent, DebateAgent (vote-only), SynthesisAgent (BeliefState)
  • Best diseases (F1=1.0): Tuberculosis, Spontaneous pneumothorax, Scombroid food poisoning, Panic attack, PSVT, Allergic sinusitis, Anemia, Atrial fibrillation, Cluster headache, GERD, Localized edema, Acute pulmonary edema
  • Worst diseases (F1=0.0): Acute rhinosinusitis, Bronchitis, Chronic rhinosinusitis, HIV (initial infection), Influenza
  • Key finding: Respiratory diseases (14% category top-1) are the main weakness β€” Bronchitis/Pneumonia/Influenza share overlapping keyword sets and are misclassified as "COPD exacerbation"

βœ… Eval 3 β€” MedQA-USMLE

  • File: results/eval_02_medqa.json
  • Script: eval_medqa.py
  • Cases: 200 Β· Top-1: 0.0% Β· Top-3: 0.0% Β· Runtime: 403.9s (2.02s/case)
  • Agents active: ClinicalReasonerAgent (keyword fallback), all other agents (early return), DebateAgent (vote-only)
  • Why 0%: Structural task-format mismatch. Our pipeline outputs disease names; MedQA answer options are drugs ("Ketotifen eye drops"), mechanisms ("Cross-linking of DNA"), pathology descriptions ("Lactose-fermenting gram-negative rods"), and ethics choices β€” not disease names. Scoring 0% is expected and does not indicate a pipeline bug.
  • Most common prediction: acute otitis media β€” its keywords (fever, cough, nasal congestion, ear) appear in many USMLE vignettes.
  • Fix required: MCQ answer mapper layer post-pipeline (cosine similarity: disease β†’ MCQ option)

βœ… Eval 4 β€” Chest X-ray Pneumonia (ImageAgent isolated)

  • File: results/eval_03_chestxray.json
  • Script: eval_chestxray.py
  • Cases: 100 (shuffled seed=42) Β· Top-1: 95.0% Β· Runtime: 39.8s (0.4s/case)
  • Model: nickmuchi/vit-finetuned-chest-xray-pneumonia (2-class: NORMAL / PNEUMONIA)
  • Dataset note: Originally requested NIH ChestX-ray14 (alkzar90/NIH-Chest-X-ray-dataset) β€” unavailable due to deprecated dataset loading script. Used hf-vision/chest-xray-pneumonia (same source images, Kaggle chest X-ray dataset).
  • Per-class F1: PNEUMONIA 0.955 Β· NORMAL 0.944
  • Errors: 5 β€” 3 false positives (NORMAL β†’ PNEUMONIA), 2 false negatives (PNEUMONIA β†’ NORMAL)
  • Bug fixed during eval: _detect_domain() was routing nearly-square chest X-rays to the brain model (outputs "meningioma tumor", "no tumor"). Fixed with dark-border heuristic (border_mean < 30).

Pending Evaluations

⏳ Eval 5 β€” CheXpert Plus (multi-label chest X-ray)

  • Dataset: stanfordmlgroup/CheXpert or compatible HuggingFace mirror
  • Goal: Test ImageAgent with the 14-class DenseNet upgrade across multi-label pathologies
  • Prerequisite: Replace _REGISTRY["chest"] with nickmuchi/densenet-finetuned-chest-xray-classification
  • Expected metrics: Per-class AUC and F1 for 14 CheXpert labels (Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, etc.)
  • Script to create: eval_chexpert.py

⏳ Eval 6 β€” CUPCase (real clinical case reports)

  • Dataset: Real clinical case report corpus (narrative free-text, ground-truth diagnosis in report conclusion)
  • Goal: Test full pipeline (LLM-enabled) on actual physician-written cases β€” the closest proxy to real-world deployment
  • Prerequisite: Enable Groq LLM (_groq_enabled = True), add rate-limit retry wrapper
  • Expected metrics: Top-1/3 accuracy; also measure reasoning_chain quality vs. ground truth diagnosis path
  • Script to create: eval_cupcase.py

⏳ Eval 7 β€” MedXpertQA-MM (multimodal clinical QA)

  • Dataset: MedXpertQA multimodal variant β€” clinical questions with both image and text
  • Goal: Test the full multi-agent pipeline (text + imaging) end-to-end on cases where both modalities are required
  • Prerequisites: Multi-class imaging model (Eval 5 prerequisite), MCQ answer mapper (fixes MedQA gap), Groq LLM enabled
  • Expected metrics: Top-1 accuracy on multimodal questions; ablation study (text-only vs. text+image)
  • Script to create: eval_medxpertqa_mm.py

Improvement Roadmap (by expected impact)

Priority Change Affected Evals Expected Ξ” Effort
πŸ”΄ 1 Enable Groq LLM in eval scripts (add retry wrapper) DDxPlus, MedQA +15–25% DDxPlus; +20–35% MedQA Low
πŸ”΄ 2 Swap chest ViT β†’ 14-class DenseNet-121 CheXpert Plus, NIH Enables 12 new disease classes Low
🟑 3 MCQ answer mapper (sentence-transformers cosine sim) MedQA +30–50% on diagnosis questions Medium
🟑 4 Expand keyword vocabulary (rare diseases, autoimmune) DDxPlus, CUPCase +5–10% DDxPlus Medium
🟒 5 Vignette lab extractor (regex + LLM pre-pass) MedQA, CUPCase Lab critique activates on prose values Medium
🟒 6 Respiratory keyword disambiguation (Bronchitis vs COPD) DDxPlus Fixes 14% respiratory category Low