Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
MedPanel Benchmark Results
Single MedGemma vs MedPanel Multi-Agent System
Overview
This benchmark compares two approaches on the same 10 clinical test cases:
- Baseline: Single MedGemma-4B call with a standard diagnostic prompt
- MedPanel: Full 5-agent pipeline (Radiologist + Internist + Evidence Reviewer + Devil's Advocate + Orchestrator)
All cases were selected to represent real-world diagnostic difficulty β the kind of presentations where errors actually happen in clinical practice.
Overall Results
| Metric | Single MedGemma | MedPanel |
|---|---|---|
| Overall Accuracy | 70% (7/10) | 100% (10/10) |
| Easy Cases | 67% (2/3) | 100% (3/3) |
| Medium Cases | 67% (2/3) | 100% (3/3) |
| Hard Cases | 75% (3/4) | 100% (4/4) |
| Dangerous Misses Caught | 0 | 3 out of 10 |
| Improvement | β | +30 percentage points |
Case-by-Case Results
| # | Condition | Difficulty | Single MedGemma | MedPanel | Devil's Advocate Helped |
|---|---|---|---|---|---|
| 1 | Tuberculosis | Hard | β | β | β |
| 2 | Meningitis | Hard | β | β | β |
| 3 | Myocardial Infarction | Easy | β | β | β |
| 4 | Interstitial Lung Disease | Medium | β | β | β |
| 5 | Hypoglycemia | Easy | β | β | β |
| 6 | Ectopic Pregnancy | Hard | β said appendicitis | β | β |
| 7 | Spinal Cord Compression | Medium | β said stenosis | β | β |
| 8 | Malaria | Medium | β | β | β |
| 9 | Subarachnoid Hemorrhage | Hard | β | β | β |
| 10 | Cholecystitis | Easy | β said colic | β | β |
Devil's Advocate Impact
The Devil's Advocate agent was the deciding factor in 3 out of 10 cases (30%).
| Case | Single Agent Said | Devil's Advocate Flagged |
|---|---|---|
| Ectopic Pregnancy | Appendicitis | Ectopic pregnancy β tubal rupture risk |
| Spinal Cord Compression | Stenosis | Cord compression β urgent surgical review |
| Cholecystitis | Colic | Cholecystitis β risk of perforation |
Breakdown by Difficulty
Easy Cases (3 total)
| Metric | Single | MedPanel |
|---|---|---|
| Accuracy | 67% | 100% |
| Missed | Cholecystitis | β |
Medium Cases (3 total)
| Metric | Single | MedPanel |
|---|---|---|
| Accuracy | 67% | 100% |
| Missed | Spinal Cord Compression | β |
Hard Cases (4 total)
| Metric | Single | MedPanel |
|---|---|---|
| Accuracy | 75% | 100% |
| Missed | Ectopic Pregnancy | β |
Even "easy" cases like Cholecystitis were missed β the problem isn't just complexity. Confident wrong answers happen at every level.
Methodology
Evaluation:
- Each case run through both systems independently
- Ground truth checked against predicted output
- Synonym matching used for equivalent terms
(e.g.
"MI"="Myocardial Infarction","TB"="Tuberculosis") - Case marked correct only if ground truth appeared in primary output
Single agent prompt:
You are a medical AI assistant. Analyze this clinical case
and provide your primary diagnosis.
Clinical case: {notes}
Provide a clear, concise primary diagnosis.
MedPanel pipeline:
- π©» Radiologist Agent β image analysis
- π©Ί Internist Agent β clinical notes analysis
- π Evidence Reviewer β live PubMed RAG (PubMedBERT + FAISS)
- π Devil's Advocate β adversarial challenge
- π― Orchestrator β final synthesis
Model: google/medgemma-4b-it (all agents)
Embeddings: pritamdeka/S-PubMedBert-MS-MARCO
Infrastructure: HuggingFace Spaces, T4 GPU
Average latency: ~50 seconds per full MedPanel run
Key Takeaway
Single MedGemma is confidently wrong 30% of the time on difficult cases. MedPanel catches 100% by forcing agents to argue before committing to a diagnosis. The Devil's Advocate alone accounts for all 3 dangerous misses caught.
Benchmark conducted as part of the Google MedGemma Impact Challenge 2025. β οΈ Research only. Not for clinical use without validation and regulatory approval.