File size: 6,970 Bytes
e7c144c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# Evaluation Progress

**System:** hypothetico-deductive multi-agent v2  
**Last updated:** 2026-05-09  
**Pipeline:** BeliefState synthesis Β· Groq LLM debate Β· domain-aware ViT routing

---

## Summary

| # | Dataset | Type | Cases | Top-1 | Top-3 | Status |
|---|---|---|---|---|---|---|
| 1 | DDxPlus (eval_01) | Structured symptom (baseline) | 500 | 1.4% | 47.2% | βœ… Done (pre-fix baseline) |
| 2 | DDxPlus (eval_02) | Structured symptom (fixed) | 100 | **68.0%** | **86.0%** | βœ… Done |
| 3 | MedQA-USMLE | Real USMLE MCQ | 200 | 0.0% | 0.0% | βœ… Done (task mismatch β€” see note) |
| 4 | Chest X-ray Pneumonia | Medical imaging (binary) | 100 | **95.0%** | **95.0%** | βœ… Done |
| 5 | CheXpert Plus | Medical imaging (multi-label) | β€” | β€” | β€” | ⏳ Pending |
| 6 | CUPCase | Real clinical case reports | β€” | β€” | β€” | ⏳ Pending |
| 7 | MedXpertQA-MM | Multimodal clinical QA | β€” | β€” | β€” | ⏳ Pending |

---

## Completed Evaluations

### βœ… Eval 1 β€” DDxPlus (pre-fix baseline)
- **File:** `results/eval_01_ddxplus.json`
- **Script:** `evaluate.py`
- **Cases:** 500 Β· **Top-1:** 1.4% Β· **Top-3:** 47.2%
- **What went wrong:** Three pipeline bugs: (a) DeepSeek JSON inside `<think>` stripped before parsing β†’ empty `llm_parsed`; (b) empty hypothesis β†’ lab gave βˆ’0.20 shift to correct disease; (c) lab `NORMAL_RANGES` category names ("Infection / Leukemia", "Diabetes / Hyperglycemia") dominated `debate_result["ranked"]`, overriding clinical disease names in synthesis. Avg confidence reported as 100% (normalization artefact).
- **Fixed in:** v2 pipeline β€” see `agents/orchestrator.py`, `app.py`, `agents/clinical_reasoner.py`

---

### βœ… Eval 2 β€” DDxPlus (fixed pipeline)
- **File:** `results/eval_02_ddxplus.json`
- **Script:** `evaluate.py`
- **Cases:** 100 Β· **Top-1:** 68.0% Β· **Top-3:** 86.0% Β· **Runtime:** 49.4s (0.49s/case)
- **Agents active:** ClinicalReasonerAgent (keyword fallback), LabAgent (early return β€” no labs in DDxPlus), ImageAgent (no_imaging), HistoryAgent, DebateAgent (vote-only), SynthesisAgent (BeliefState)
- **Best diseases (F1=1.0):** Tuberculosis, Spontaneous pneumothorax, Scombroid food poisoning, Panic attack, PSVT, Allergic sinusitis, Anemia, Atrial fibrillation, Cluster headache, GERD, Localized edema, Acute pulmonary edema
- **Worst diseases (F1=0.0):** Acute rhinosinusitis, Bronchitis, Chronic rhinosinusitis, HIV (initial infection), Influenza
- **Key finding:** Respiratory diseases (14% category top-1) are the main weakness β€” Bronchitis/Pneumonia/Influenza share overlapping keyword sets and are misclassified as "COPD exacerbation"

---

### βœ… Eval 3 β€” MedQA-USMLE
- **File:** `results/eval_02_medqa.json`
- **Script:** `eval_medqa.py`
- **Cases:** 200 Β· **Top-1:** 0.0% Β· **Top-3:** 0.0% Β· **Runtime:** 403.9s (2.02s/case)
- **Agents active:** ClinicalReasonerAgent (keyword fallback), all other agents (early return), DebateAgent (vote-only)
- **Why 0%:** Structural task-format mismatch. Our pipeline outputs disease names; MedQA answer options are drugs ("Ketotifen eye drops"), mechanisms ("Cross-linking of DNA"), pathology descriptions ("Lactose-fermenting gram-negative rods"), and ethics choices β€” not disease names. Scoring 0% is expected and does not indicate a pipeline bug.
- **Most common prediction:** `acute otitis media` β€” its keywords (fever, cough, nasal congestion, ear) appear in many USMLE vignettes.
- **Fix required:** MCQ answer mapper layer post-pipeline (cosine similarity: disease β†’ MCQ option)

---

### βœ… Eval 4 β€” Chest X-ray Pneumonia (ImageAgent isolated)
- **File:** `results/eval_03_chestxray.json`
- **Script:** `eval_chestxray.py`
- **Cases:** 100 (shuffled seed=42) Β· **Top-1:** 95.0% Β· **Runtime:** 39.8s (0.4s/case)
- **Model:** `nickmuchi/vit-finetuned-chest-xray-pneumonia` (2-class: NORMAL / PNEUMONIA)
- **Dataset note:** Originally requested NIH ChestX-ray14 (`alkzar90/NIH-Chest-X-ray-dataset`) β€” unavailable due to deprecated dataset loading script. Used `hf-vision/chest-xray-pneumonia` (same source images, Kaggle chest X-ray dataset).
- **Per-class F1:** PNEUMONIA 0.955 Β· NORMAL 0.944
- **Errors:** 5 β€” 3 false positives (NORMAL β†’ PNEUMONIA), 2 false negatives (PNEUMONIA β†’ NORMAL)
- **Bug fixed during eval:** `_detect_domain()` was routing nearly-square chest X-rays to the brain model (outputs "meningioma tumor", "no tumor"). Fixed with dark-border heuristic (`border_mean < 30`).

---

## Pending Evaluations

### ⏳ Eval 5 β€” CheXpert Plus (multi-label chest X-ray)
- **Dataset:** `stanfordmlgroup/CheXpert` or compatible HuggingFace mirror
- **Goal:** Test ImageAgent with the 14-class DenseNet upgrade across multi-label pathologies
- **Prerequisite:** Replace `_REGISTRY["chest"]` with `nickmuchi/densenet-finetuned-chest-xray-classification`
- **Expected metrics:** Per-class AUC and F1 for 14 CheXpert labels (Atelectasis, Cardiomegaly, Consolidation, Edema, Pleural Effusion, etc.)
- **Script to create:** `eval_chexpert.py`

---

### ⏳ Eval 6 β€” CUPCase (real clinical case reports)
- **Dataset:** Real clinical case report corpus (narrative free-text, ground-truth diagnosis in report conclusion)
- **Goal:** Test full pipeline (LLM-enabled) on actual physician-written cases β€” the closest proxy to real-world deployment
- **Prerequisite:** Enable Groq LLM (`_groq_enabled = True`), add rate-limit retry wrapper
- **Expected metrics:** Top-1/3 accuracy; also measure reasoning_chain quality vs. ground truth diagnosis path
- **Script to create:** `eval_cupcase.py`

---

### ⏳ Eval 7 β€” MedXpertQA-MM (multimodal clinical QA)
- **Dataset:** MedXpertQA multimodal variant β€” clinical questions with both image and text
- **Goal:** Test the full multi-agent pipeline (text + imaging) end-to-end on cases where both modalities are required
- **Prerequisites:** Multi-class imaging model (Eval 5 prerequisite), MCQ answer mapper (fixes MedQA gap), Groq LLM enabled
- **Expected metrics:** Top-1 accuracy on multimodal questions; ablation study (text-only vs. text+image)
- **Script to create:** `eval_medxpertqa_mm.py`

---

## Improvement Roadmap (by expected impact)

| Priority | Change | Affected Evals | Expected Ξ” | Effort |
|---|---|---|---|---|
| πŸ”΄ 1 | Enable Groq LLM in eval scripts (add retry wrapper) | DDxPlus, MedQA | +15–25% DDxPlus; +20–35% MedQA | Low |
| πŸ”΄ 2 | Swap chest ViT β†’ 14-class DenseNet-121 | CheXpert Plus, NIH | Enables 12 new disease classes | Low |
| 🟑 3 | MCQ answer mapper (sentence-transformers cosine sim) | MedQA | +30–50% on diagnosis questions | Medium |
| 🟑 4 | Expand keyword vocabulary (rare diseases, autoimmune) | DDxPlus, CUPCase | +5–10% DDxPlus | Medium |
| 🟒 5 | Vignette lab extractor (regex + LLM pre-pass) | MedQA, CUPCase | Lab critique activates on prose values | Medium |
| 🟒 6 | Respiratory keyword disambiguation (Bronchitis vs COPD) | DDxPlus | Fixes 14% respiratory category | Low |