Spaces:

yipengsun
/

diagnostic-devils-advocate

Running on Zero

App Files Files Community

diagnostic-devils-advocate / README.md

yipengsun

Add caption and credit for workflow diagram

be83bd3 verified 14 days ago

preview code

raw

history blame contribute delete

10.4 kB

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

metadata

title: Diagnostic Devil's Advocate
emoji: 🩺
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 6.4.0
app_file: app.py
pinned: false
license: cc-by-4.0
tags:
  - medgemma
  - medical-imaging
  - multi-agent
  - cognitive-bias
  - radiology

Diagnostic Devil's Advocate

AI-Powered Cognitive Debiasing for Medical Image Interpretation

Why This Exists

Diagnostic errors affect an estimated 12 million adults annually in the U.S., with cognitive biases implicated in up to 74% of cases. (Singh et al., 2014)

Doctors are not wrong because they lack knowledge -- they are wrong because the human brain takes shortcuts. A physician who sees "young patient + chest pain after trauma" anchors on rib contusion and stops looking. The pneumothorax goes unseen. The patient deteriorates.

Diagnostic Devil's Advocate acts as an adversarial second opinion. It does not replace the physician -- it challenges them: "Have you considered what happens if you're wrong?"

Pipeline

Four agents, each with a distinct adversarial role, orchestrated by LangGraph as a linear StateGraph:

Figure 1: Multi-agent pipeline for cognitive debiasing in medical image interpretation.
_{Diagram generated with Nano Banana Pro}

Key Design Choices

Blinded first agent -- the Diagnostician never sees the doctor's diagnosis, preventing the AI from anchoring on the same conclusion
Dual-source analysis -- every agent considers both the medical image and clinical context (vitals, labs, risk factors), because many dangerous conditions have subtle imaging but obvious clinical red flags
MedSigLIP verification -- zero-shot image classification grounds the bias analysis in visual evidence, not just language reasoning
MedASR voice input -- MedASR enables hands-free clinical context entry via speech-to-text, designed for busy clinical workflows where typing is impractical
Prompt repetition -- implements the prompt repetition technique from Google Research to improve output quality and consistency in non-reasoning LLMs
Collegial tone -- the Consultant writes as a consulting colleague ("Have you considered..."), not a critic

Model Stack

Model	Params	Role	VRAM
MedGemma 1.5 4B-IT	4B	Multimodal image + text analysis	~4 GB (4-bit)
MedGemma 27B Text-IT	27B	Consultant deep reasoning (optional)	~54 GB
MedSigLIP-448	0.9B	Zero-shot sign verification	~3 GB
MedASR	105M	Medical speech-to-text	~0.5 GB

The full pipeline requires ~8 GB VRAM and runs on any 12 GB+ CUDA GPU. All models load locally via Transformers with 4-bit quantization -- zero API costs, fully offline-capable.

Getting Started

# Clone
git clone https://github.com/sypsyp97/diagnostic-devils-advocate
cd diagnostic-devils-advocate

# Install
pip install -r requirements.txt

# Login to Hugging Face (gated models)
huggingface-cli login

# Run
python app.py                                    # 4B quantized (default)
USE_27B=true QUANTIZE_4B=false python app.py     # with 27B Consultant
ENABLE_MEDASR=false python app.py                # without voice input

The app launches at http://localhost:7860.

Environment Variables

Variable	Default	Description
`USE_27B`	`false`	Enable 27B model for the Consultant agent
`QUANTIZE_4B`	`true`	4-bit quantize the 4B model
`ENABLE_MEDASR`	`true`	Enable voice input via MedASR
`HF_TOKEN`	--	Hugging Face token (or use `huggingface-cli login`)
`ENABLE_PROMPT_REPETITION`	`true`	Prompt repetition for improved output quality
`MODEL_LOCAL_DIR`	--	Local directory for pre-downloaded models
`DEVICE`	`cuda`	Compute device

Project Structure

diagnostic-devils-advocate/
├── app.py                     # Gradio entry point
├── config.py                  # Model & environment config
├── requirements.txt
├── agents/
│   ├── prompts.py             # All agent prompt templates
│   ├── graph.py               # LangGraph StateGraph pipeline
│   ├── output_parser.py       # JSON parsing (json_repair + llm-output-parser)
│   ├── diagnostician.py       # Agent 1: Blinded analysis
│   ├── bias_detector.py       # Agent 2: Bias detection + MedSigLIP
│   ├── devil_advocate.py      # Agent 3: Adversarial challenge
│   └── consultant.py          # Agent 4: Consultation synthesis
├── models/
│   ├── medgemma_client.py     # MedGemma 4B/27B inference
│   ├── medsiglip_client.py    # MedSigLIP zero-shot classification
│   ├── medasr_client.py       # MedASR speech-to-text
│   └── utils.py               # Image preprocessing, token stripping
├── ui/
│   ├── components.py          # Gradio layout
│   ├── callbacks.py           # UI event handlers
│   └── css.py                 # Custom styling
└── data/
    └── demo_cases/            # Composite clinical scenarios

Disclaimer

This is a research prototype built for the MedGemma Impact Challenge. It is NOT intended for clinical decision-making. All demo cases are educational composites. Medical images are sourced from the University of Saskatchewan Teaching Collection (CC-BY-NC-SA 4.0).

References

Diagnostic Error & Cognitive Bias

Singh H, Meyer AND, Thomas EJ. "The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations." BMJ Quality & Safety, 2014;23(9):727--731. doi:10.1136/bmjqs-2013-002627
Croskerry P. "The importance of cognitive errors in diagnosis and strategies to minimize them." Academic Medicine, 2003;78(8):775--780. doi:10.1097/00001888-200308000-00003
Vally ZI, Khammissa RAG, Feller G, et al. "Errors in clinical diagnosis: a narrative review." Journal of International Medical Research, 2023;51(8):03000605231162798. doi:10.1177/03000605231162798
Staal J, Hooftman J, Gunput STG, et al. "Effect on diagnostic accuracy of cognitive reasoning tools for the workplace setting: systematic review and meta-analysis." BMJ Quality & Safety, 2022;31(12):899--910. doi:10.1136/bmjqs-2022-014865

AI-Assisted Debiasing & Multi-Agent Systems

Brown C, Nazeer R, Gibbs A, et al. "Breaking Bias: The Role of Artificial Intelligence in Improving Clinical Decision-Making." Cureus, 2023;15(3):e36415. doi:10.7759/cureus.36415
Tang X, Zou A, Zhang Z, et al. "MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning." Findings of ACL, 2024:599--621. arXiv:2311.10537
Kim Y, Park C, Jeong H, et al. "MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making." NeurIPS, 2024. arXiv:2404.15155
Chen X, Yi H, You M, et al. "Enhancing diagnostic capability with multi-agents conversational large language models." npj Digital Medicine, 2025;8:159. doi:10.1038/s41746-025-01550-0

Medical Vision-Language Models & Prompt Engineering

Jang J, Kyung D, Kim SH, et al. "Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders." Scientific Reports, 2024;14:23199. doi:10.1038/s41598-024-73695-z
Leviathan Y, Kalman M, Matias Y. "Prompt Repetition Improves Non-Reasoning LLMs." arXiv:2512.14982, Google Research, 2025.
Zaghir J, Naguib M, Bjelogrlic M, et al. "Prompt Engineering Paradigms for Medical Applications: Scoping Review." Journal of Medical Internet Research, 2024;26:e60501. doi:10.2196/60501
Sellergren A, Kazemzadeh S, Jaroensri T, et al. "MedGemma Technical Report." arXiv:2507.05201, Google, 2025.

Built with Google Health AI Developer Foundations for the MedGemma Impact Challenge