benchmark_test / BENCHMARKS.md
Yogeshwarirj's picture
Create BENCHMARKS.md
bf4b5ad verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

MedPanel Benchmark Results

Single MedGemma vs MedPanel Multi-Agent System


Overview

This benchmark compares two approaches on the same 10 clinical test cases:

  • Baseline: Single MedGemma-4B call with a standard diagnostic prompt
  • MedPanel: Full 5-agent pipeline (Radiologist + Internist + Evidence Reviewer + Devil's Advocate + Orchestrator)

All cases were selected to represent real-world diagnostic difficulty β€” the kind of presentations where errors actually happen in clinical practice.


Overall Results

Metric Single MedGemma MedPanel
Overall Accuracy 70% (7/10) 100% (10/10)
Easy Cases 67% (2/3) 100% (3/3)
Medium Cases 67% (2/3) 100% (3/3)
Hard Cases 75% (3/4) 100% (4/4)
Dangerous Misses Caught 0 3 out of 10
Improvement β€” +30 percentage points

Case-by-Case Results

# Condition Difficulty Single MedGemma MedPanel Devil's Advocate Helped
1 Tuberculosis Hard βœ… βœ… ❌
2 Meningitis Hard βœ… βœ… ❌
3 Myocardial Infarction Easy βœ… βœ… ❌
4 Interstitial Lung Disease Medium βœ… βœ… ❌
5 Hypoglycemia Easy βœ… βœ… ❌
6 Ectopic Pregnancy Hard ❌ said appendicitis βœ… βœ…
7 Spinal Cord Compression Medium ❌ said stenosis βœ… βœ…
8 Malaria Medium βœ… βœ… ❌
9 Subarachnoid Hemorrhage Hard βœ… βœ… ❌
10 Cholecystitis Easy ❌ said colic βœ… βœ…

Devil's Advocate Impact

The Devil's Advocate agent was the deciding factor in 3 out of 10 cases (30%).

Case Single Agent Said Devil's Advocate Flagged
Ectopic Pregnancy Appendicitis Ectopic pregnancy β€” tubal rupture risk
Spinal Cord Compression Stenosis Cord compression β€” urgent surgical review
Cholecystitis Colic Cholecystitis β€” risk of perforation

Breakdown by Difficulty

Easy Cases (3 total)

Metric Single MedPanel
Accuracy 67% 100%
Missed Cholecystitis β€”

Medium Cases (3 total)

Metric Single MedPanel
Accuracy 67% 100%
Missed Spinal Cord Compression β€”

Hard Cases (4 total)

Metric Single MedPanel
Accuracy 75% 100%
Missed Ectopic Pregnancy β€”

Even "easy" cases like Cholecystitis were missed β€” the problem isn't just complexity. Confident wrong answers happen at every level.


Methodology

Evaluation:

  • Each case run through both systems independently
  • Ground truth checked against predicted output
  • Synonym matching used for equivalent terms (e.g. "MI" = "Myocardial Infarction", "TB" = "Tuberculosis")
  • Case marked correct only if ground truth appeared in primary output

Single agent prompt:

You are a medical AI assistant. Analyze this clinical case
and provide your primary diagnosis.
Clinical case: {notes}
Provide a clear, concise primary diagnosis.

MedPanel pipeline:

  1. 🩻 Radiologist Agent β€” image analysis
  2. 🩺 Internist Agent β€” clinical notes analysis
  3. πŸ“š Evidence Reviewer β€” live PubMed RAG (PubMedBERT + FAISS)
  4. 😈 Devil's Advocate β€” adversarial challenge
  5. 🎯 Orchestrator β€” final synthesis

Model: google/medgemma-4b-it (all agents) Embeddings: pritamdeka/S-PubMedBert-MS-MARCO Infrastructure: HuggingFace Spaces, T4 GPU Average latency: ~50 seconds per full MedPanel run


Key Takeaway

Single MedGemma is confidently wrong 30% of the time on difficult cases. MedPanel catches 100% by forcing agents to argue before committing to a diagnosis. The Devil's Advocate alone accounts for all 3 dangerous misses caught.


Benchmark conducted as part of the Google MedGemma Impact Challenge 2025. ⚠️ Research only. Not for clinical use without validation and regulatory approval.