Spaces:

Yogeshwarirj
/

benchmark_test

Sleeping

App Files Files Community

benchmark_test / BENCHMARKS.md

Yogeshwarirj

Create BENCHMARKS.md

bf4b5ad verified 2 months ago

preview code

raw

history blame contribute delete

4.09 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

MedPanel Benchmark Results

Single MedGemma vs MedPanel Multi-Agent System

Overview

This benchmark compares two approaches on the same 10 clinical test cases:

Baseline: Single MedGemma-4B call with a standard diagnostic prompt
MedPanel: Full 5-agent pipeline (Radiologist + Internist + Evidence Reviewer + Devil's Advocate + Orchestrator)

All cases were selected to represent real-world diagnostic difficulty — the kind of presentations where errors actually happen in clinical practice.

Overall Results

Metric	Single MedGemma	MedPanel
Overall Accuracy	70% (7/10)	100% (10/10)
Easy Cases	67% (2/3)	100% (3/3)
Medium Cases	67% (2/3)	100% (3/3)
Hard Cases	75% (3/4)	100% (4/4)
Dangerous Misses Caught	0	3 out of 10
Improvement	—	+30 percentage points

Case-by-Case Results

#	Condition	Difficulty	Single MedGemma	MedPanel	Devil's Advocate Helped
1	Tuberculosis	Hard	✅	✅	❌
2	Meningitis	Hard	✅	✅	❌
3	Myocardial Infarction	Easy	✅	✅	❌
4	Interstitial Lung Disease	Medium	✅	✅	❌
5	Hypoglycemia	Easy	✅	✅	❌
6	Ectopic Pregnancy	Hard	❌ said appendicitis	✅	✅
7	Spinal Cord Compression	Medium	❌ said stenosis	✅	✅
8	Malaria	Medium	✅	✅	❌
9	Subarachnoid Hemorrhage	Hard	✅	✅	❌
10	Cholecystitis	Easy	❌ said colic	✅	✅

Devil's Advocate Impact

The Devil's Advocate agent was the deciding factor in 3 out of 10 cases (30%).

Case	Single Agent Said	Devil's Advocate Flagged
Ectopic Pregnancy	Appendicitis	Ectopic pregnancy — tubal rupture risk
Spinal Cord Compression	Stenosis	Cord compression — urgent surgical review
Cholecystitis	Colic	Cholecystitis — risk of perforation

Breakdown by Difficulty

Easy Cases (3 total)

Metric	Single	MedPanel
Accuracy	67%	100%
Missed	Cholecystitis	—

Medium Cases (3 total)

Metric	Single	MedPanel
Accuracy	67%	100%
Missed	Spinal Cord Compression	—

Hard Cases (4 total)

Metric	Single	MedPanel
Accuracy	75%	100%
Missed	Ectopic Pregnancy	—

Even "easy" cases like Cholecystitis were missed — the problem isn't just complexity. Confident wrong answers happen at every level.

Methodology

Evaluation:

Each case run through both systems independently
Ground truth checked against predicted output
Synonym matching used for equivalent terms (e.g. "MI" = "Myocardial Infarction", "TB" = "Tuberculosis")
Case marked correct only if ground truth appeared in primary output

Single agent prompt:

You are a medical AI assistant. Analyze this clinical case
and provide your primary diagnosis.
Clinical case: {notes}
Provide a clear, concise primary diagnosis.

MedPanel pipeline:

🩻 Radiologist Agent — image analysis
🩺 Internist Agent — clinical notes analysis
📚 Evidence Reviewer — live PubMed RAG (PubMedBERT + FAISS)
😈 Devil's Advocate — adversarial challenge
🎯 Orchestrator — final synthesis

Model: google/medgemma-4b-it (all agents) Embeddings: pritamdeka/S-PubMedBert-MS-MARCO Infrastructure: HuggingFace Spaces, T4 GPU Average latency: ~50 seconds per full MedPanel run

Key Takeaway

Single MedGemma is confidently wrong 30% of the time on difficult cases. MedPanel catches 100% by forcing agents to argue before committing to a diagnosis. The Devil's Advocate alone accounts for all 3 dangerous misses caught.

Benchmark conducted as part of the Google MedGemma Impact Challenge 2025. ⚠️ Research only. Not for clinical use without validation and regulatory approval.