Evaluation Progress
System: hypothetico-deductive multi-agent v2
Last updated: 2026-05-09
Pipeline: BeliefState synthesis Β· Groq LLM debate Β· domain-aware ViT routing
Summary
| # | Dataset | Type | Cases | Top-1 | Top-3 | Status |
|---|---|---|---|---|---|---|
| 1 | DDxPlus (eval_01) | Structured symptom (baseline) | 500 | 1.4% | 47.2% | β Done (pre-fix baseline) |
| 2 | DDxPlus (eval_02) | Structured symptom (fixed) | 100 | 68.0% | 86.0% | β Done |
| 3 | MedQA-USMLE | Real USMLE MCQ | 200 | 0.0% | 0.0% | β Done (task mismatch β see note) |
| 4 | Chest X-ray Pneumonia | Medical imaging (binary) | 100 | 95.0% | 95.0% | β Done |
| 5 | CheXpert Plus | Medical imaging (multi-label) | β | β | β | β³ Pending |
| 6 | CUPCase | Real clinical case reports | β | β | β | β³ Pending |
| 7 | MedXpertQA-MM | Multimodal clinical QA | β | β | β | β³ Pending |
Completed Evaluations
β Eval 1 β DDxPlus (pre-fix baseline)
- File:
results/eval_01_ddxplus.json - Script:
evaluate.py - Cases: 500 Β· Top-1: 1.4% Β· Top-3: 47.2%
- What went wrong: Three pipeline bugs: (a) DeepSeek JSON inside
<think>stripped before parsing β emptyllm_parsed; (b) empty hypothesis β lab gave β0.20 shift to correct disease; (c) labNORMAL_RANGEScategory names ("Infection / Leukemia", "Diabetes / Hyperglycemia") dominateddebate_result["ranked"], overriding clinical disease names in synthesis. Avg confidence reported as 100% (normalization artefact). - Fixed in: v2 pipeline β see
agents/orchestrator.py,app.py,agents/clinical_reasoner.py
β Eval 2 β DDxPlus (fixed pipeline)
- File:
results/eval_02_ddxplus.json - Script:
evaluate.py - Cases: 100 Β· Top-1: 68.0% Β· Top-3: 86.0% Β· Runtime: 49.4s (0.49s/case)
- Agents active: ClinicalReasonerAgent (keyword fallback), LabAgent (early return β no labs in DDxPlus), ImageAgent (no_imaging), HistoryAgent, DebateAgent (vote-only), SynthesisAgent (BeliefState)
- Best diseases (F1=1.0): Tuberculosis, Spontaneous pneumothorax, Scombroid food poisoning, Panic attack, PSVT, Allergic sinusitis, Anemia, Atrial fibrillation, Cluster headache, GERD, Localized edema, Acute pulmonary edema
- Worst diseases (F1=0.0): Acute rhinosinusitis, Bronchitis, Chronic rhinosinusitis, HIV (initial infection), Influenza
- Key finding: Respiratory diseases (14% category top-1) are the main weakness β Bronchitis/Pneumonia/Influenza share overlapping keyword sets and are misclassified as "COPD exacerbation"
β Eval 3 β MedQA-USMLE
- File:
results/eval_02_medqa.json - Script:
eval_medqa.py - Cases: 200 Β· Top-1: 0.0% Β· Top-3: 0.0% Β· Runtime: 403.9s (2.02s/case)
- Agents active: ClinicalReasonerAgent (keyword fallback), all other agents (early return), DebateAgent (vote-only)
- Why 0%: Structural task-format mismatch. Our pipeline outputs disease names; MedQA answer options are drugs ("Ketotifen eye drops"), mechanisms ("Cross-linking of DNA"), pathology descriptions ("Lactose-fermenting gram-negative rods"), and ethics choices β not disease names. Scoring 0% is expected and does not indicate a pipeline bug.
- Most common prediction:
acute otitis mediaβ its keywords (fever, cough, nasal congestion, ear) appear in many USMLE vignettes. - Fix required: MCQ answer mapper layer post-pipeline (cosine similarity: disease β MCQ option)
β Eval 4 β Chest X-ray Pneumonia (ImageAgent isolated)
- File:
results/eval_03_chestxray.json - Script:
eval_chestxray.py - Cases: 100 (shuffled seed=42) Β· Top-1: 95.0% Β· Runtime: 39.8s (0.4s/case)
- Model:
nickmuchi/vit-finetuned-chest-xray-pneumonia(2-class: NORMAL / PNEUMONIA) - Dataset note: Originally requested NIH ChestX-ray14 (
alkzar90/NIH-Chest-X-ray-dataset) β unavailable due to deprecated dataset loading script. Usedhf-vision/chest-xray-pneumonia(same source images, Kaggle chest X-ray dataset). - Per-class F1: PNEUMONIA 0.955 Β· NORMAL 0.944
- Errors: 5 β 3 false positives (NORMAL β PNEUMONIA), 2 false negatives (PNEUMONIA β NORMAL)
- Bug fixed during eval:
_detect_domain()was routing nearly-square chest X-rays to the brain model (outputs "meningioma tumor", "no tumor"). Fixed with dark-border heuristic (border_mean < 30).
Pending Evaluations
β³ Eval 5 β CheXpert Plus (multi-label chest X-ray)
- Dataset:
stanfordmlgroup/CheXpertor compatible HuggingFace mirror - Goal: Test ImageAgent with the 14-class DenseNet upgrade across multi-label pathologies
- Prerequisite: Replace
_REGISTRY["chest"]withnickmuchi/densenet-finetuned-chest-xray-classification - Expected metrics: Per-class AUC and F1 for 14 CheXpert labels (Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, etc.)
- Script to create:
eval_chexpert.py
β³ Eval 6 β CUPCase (real clinical case reports)
- Dataset: Real clinical case report corpus (narrative free-text, ground-truth diagnosis in report conclusion)
- Goal: Test full pipeline (LLM-enabled) on actual physician-written cases β the closest proxy to real-world deployment
- Prerequisite: Enable Groq LLM (
_groq_enabled = True), add rate-limit retry wrapper - Expected metrics: Top-1/3 accuracy; also measure reasoning_chain quality vs. ground truth diagnosis path
- Script to create:
eval_cupcase.py
β³ Eval 7 β MedXpertQA-MM (multimodal clinical QA)
- Dataset: MedXpertQA multimodal variant β clinical questions with both image and text
- Goal: Test the full multi-agent pipeline (text + imaging) end-to-end on cases where both modalities are required
- Prerequisites: Multi-class imaging model (Eval 5 prerequisite), MCQ answer mapper (fixes MedQA gap), Groq LLM enabled
- Expected metrics: Top-1 accuracy on multimodal questions; ablation study (text-only vs. text+image)
- Script to create:
eval_medxpertqa_mm.py
Improvement Roadmap (by expected impact)
| Priority | Change | Affected Evals | Expected Ξ | Effort |
|---|---|---|---|---|
| π΄ 1 | Enable Groq LLM in eval scripts (add retry wrapper) | DDxPlus, MedQA | +15β25% DDxPlus; +20β35% MedQA | Low |
| π΄ 2 | Swap chest ViT β 14-class DenseNet-121 | CheXpert Plus, NIH | Enables 12 new disease classes | Low |
| π‘ 3 | MCQ answer mapper (sentence-transformers cosine sim) | MedQA | +30β50% on diagnosis questions | Medium |
| π‘ 4 | Expand keyword vocabulary (rare diseases, autoimmune) | DDxPlus, CUPCase | +5β10% DDxPlus | Medium |
| π’ 5 | Vignette lab extractor (regex + LLM pre-pass) | MedQA, CUPCase | Lab critique activates on prose values | Medium |
| π’ 6 | Respiratory keyword disambiguation (Bronchitis vs COPD) | DDxPlus | Fixes 14% respiratory category | Low |