File size: 10,353 Bytes
c0fff99
 
550f866
c0fff99
 
 
fd400e1
c0fff99
 
1c7953f
c0fff99
 
 
 
 
 
 
 
 
 
61e710d
c0fff99
61e710d
c0fff99
935ec7f
5d22d0d
 
 
 
935ec7f
c0fff99
 
 
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
2544091
550f866
be83bd3
 
 
 
2544091
c0fff99
61e710d
c0fff99
61e710d
 
 
955d204
70ea87e
61e710d
c0fff99
61e710d
c0fff99
61e710d
c0fff99
61e710d
 
 
 
 
 
c0fff99
61e710d
c0fff99
61e710d
c0fff99
 
 
 
61e710d
1c7953f
c0fff99
 
61e710d
c0fff99
 
61e710d
c0fff99
 
61e710d
 
 
 
c0fff99
 
 
 
61e710d
 
c0fff99
 
 
1c7953f
c0fff99
 
61e710d
c0fff99
61e710d
c0fff99
 
61e710d
 
 
 
c0fff99
 
 
61e710d
 
c0fff99
 
61e710d
 
 
 
 
 
 
c0fff99
61e710d
 
 
 
c0fff99
61e710d
 
 
1c7953f
61e710d
c0fff99
 
61e710d
 
 
 
c0fff99
 
195dabb
61e710d
 
c0fff99
 
 
61e710d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0fff99
 
 
 
 
955d204
c0fff99
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
title: Diagnostic Devil's Advocate
emoji: "🩺"
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: "6.4.0"
app_file: app.py
pinned: false
license: cc-by-4.0
tags:
  - medgemma
  - medical-imaging
  - multi-agent
  - cognitive-bias
  - radiology
---

<div align="center">

# Diagnostic Devil's Advocate

**AI-Powered Cognitive Debiasing for Medical Image Interpretation**

[![Hugging Face Space](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Space-FFD21E?style=for-the-badge)](https://huggingface.co/spaces/yipengsun/diagnostic-devils-advocate)
[![MedGemma](https://img.shields.io/badge/MedGemma_1.5-4285F4?style=for-the-badge&logo=google&logoColor=white)](https://huggingface.co/google/medgemma-1.5-4b-it)
[![MedSigLIP](https://img.shields.io/badge/MedSigLIP-34A853?style=for-the-badge&logo=google&logoColor=white)](https://huggingface.co/google/medsiglip-448)
[![LangGraph](https://img.shields.io/badge/LangGraph-1C3C3C?style=for-the-badge)](https://langchain-ai.github.io/langgraph/)
[![Gradio](https://img.shields.io/badge/Gradio-F97316?style=for-the-badge)](https://gradio.app)
[![License](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey?style=for-the-badge)](LICENSE)

</div>

---

## Why This Exists

> Diagnostic errors affect an estimated **12 million** adults annually in the U.S., with cognitive biases implicated in up to **74%** of cases. ([Singh et al., 2014](https://pubmed.ncbi.nlm.nih.gov/24742777/))

Doctors are not wrong because they lack knowledge -- they are wrong because the human brain takes shortcuts. A physician who sees "young patient + chest pain after trauma" anchors on **rib contusion** and stops looking. The pneumothorax goes unseen. The patient deteriorates.

**Diagnostic Devil's Advocate** acts as an adversarial second opinion. It does not replace the physician -- it challenges them: *"Have you considered what happens if you're wrong?"*

---

## Pipeline

Four agents, each with a distinct adversarial role, orchestrated by [LangGraph](https://langchain-ai.github.io/langgraph/) as a linear `StateGraph`:

<div align="center">
<img src="assets/workflow.jpg" alt="Workflow Diagram" width="100%">
<br>
<em>Figure 1: Multi-agent pipeline for cognitive debiasing in medical image interpretation.</em>
<br>
<sub>Diagram generated with <a href="https://gemini.google/overview/image-generation/">Nano Banana Pro</a></sub>
</div>

### Key Design Choices

- **Blinded first agent** -- the Diagnostician never sees the doctor's diagnosis, preventing the AI from anchoring on the same conclusion
- **Dual-source analysis** -- every agent considers both the medical image and clinical context (vitals, labs, risk factors), because many dangerous conditions have subtle imaging but obvious clinical red flags
- **MedSigLIP verification** -- zero-shot image classification grounds the bias analysis in visual evidence, not just language reasoning
- **MedASR voice input** -- [MedASR](https://huggingface.co/google/medasr) enables hands-free clinical context entry via speech-to-text, designed for busy clinical workflows where typing is impractical
- **Prompt repetition** -- implements the [prompt repetition technique](https://arxiv.org/abs/2512.14982) from Google Research to improve output quality and consistency in non-reasoning LLMs
- **Collegial tone** -- the Consultant writes as a consulting colleague (*"Have you considered..."*), not a critic

---

## Model Stack

| Model | Params | Role | VRAM |
|:------|:------:|:-----|:----:|
| [MedGemma 1.5 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) | 4B | Multimodal image + text analysis | ~4 GB (4-bit) |
| [MedGemma 27B Text-IT](https://huggingface.co/google/medgemma-27b-text-it) | 27B | Consultant deep reasoning (optional) | ~54 GB |
| [MedSigLIP-448](https://huggingface.co/google/medsiglip-448) | 0.9B | Zero-shot sign verification | ~3 GB |
| [MedASR](https://huggingface.co/google/medasr) | 105M | Medical speech-to-text | ~0.5 GB |

The full pipeline requires **~8 GB VRAM** and runs on any 12 GB+ CUDA GPU. All models load locally via [Transformers](https://huggingface.co/docs/transformers) with [4-bit quantization](https://huggingface.co/docs/bitsandbytes) -- **zero API costs, fully offline-capable**.

---

## Getting Started

```bash
# Clone
git clone https://github.com/sypsyp97/diagnostic-devils-advocate
cd diagnostic-devils-advocate

# Install
pip install -r requirements.txt

# Login to Hugging Face (gated models)
huggingface-cli login

# Run
python app.py                                    # 4B quantized (default)
USE_27B=true QUANTIZE_4B=false python app.py     # with 27B Consultant
ENABLE_MEDASR=false python app.py                # without voice input
```

The app launches at `http://localhost:7860`.

<details>
<summary><b>Environment Variables</b></summary>

| Variable | Default | Description |
|:---------|:--------|:------------|
| `USE_27B` | `false` | Enable 27B model for the Consultant agent |
| `QUANTIZE_4B` | `true` | 4-bit quantize the 4B model |
| `ENABLE_MEDASR` | `true` | Enable voice input via MedASR |
| `HF_TOKEN` | -- | Hugging Face token (or use `huggingface-cli login`) |
| `ENABLE_PROMPT_REPETITION` | `true` | [Prompt repetition](https://arxiv.org/abs/2512.14982) for improved output quality |
| `MODEL_LOCAL_DIR` | -- | Local directory for pre-downloaded models |
| `DEVICE` | `cuda` | Compute device |

</details>

<details>
<summary><b>Project Structure</b></summary>

```
diagnostic-devils-advocate/
β”œβ”€β”€ app.py                     # Gradio entry point
β”œβ”€β”€ config.py                  # Model & environment config
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ prompts.py             # All agent prompt templates
β”‚   β”œβ”€β”€ graph.py               # LangGraph StateGraph pipeline
β”‚   β”œβ”€β”€ output_parser.py       # JSON parsing (json_repair + llm-output-parser)
β”‚   β”œβ”€β”€ diagnostician.py       # Agent 1: Blinded analysis
β”‚   β”œβ”€β”€ bias_detector.py       # Agent 2: Bias detection + MedSigLIP
β”‚   β”œβ”€β”€ devil_advocate.py      # Agent 3: Adversarial challenge
β”‚   └── consultant.py          # Agent 4: Consultation synthesis
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ medgemma_client.py     # MedGemma 4B/27B inference
β”‚   β”œβ”€β”€ medsiglip_client.py    # MedSigLIP zero-shot classification
β”‚   β”œβ”€β”€ medasr_client.py       # MedASR speech-to-text
β”‚   └── utils.py               # Image preprocessing, token stripping
β”œβ”€β”€ ui/
β”‚   β”œβ”€β”€ components.py          # Gradio layout
β”‚   β”œβ”€β”€ callbacks.py           # UI event handlers
β”‚   └── css.py                 # Custom styling
└── data/
    └── demo_cases/            # Composite clinical scenarios
```

</details>

---

## Disclaimer

> **This is a research prototype built for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge). It is NOT intended for clinical decision-making.** All demo cases are educational composites. Medical images are sourced from the University of Saskatchewan Teaching Collection (CC-BY-NC-SA 4.0).

---

## References

<details>
<summary><b>Diagnostic Error & Cognitive Bias</b></summary>

- Singh H, Meyer AND, Thomas EJ. "The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations." *BMJ Quality & Safety*, 2014;23(9):727--731. [doi:10.1136/bmjqs-2013-002627](https://pubmed.ncbi.nlm.nih.gov/24742777/)
- Croskerry P. "The importance of cognitive errors in diagnosis and strategies to minimize them." *Academic Medicine*, 2003;78(8):775--780. [doi:10.1097/00001888-200308000-00003](https://pubmed.ncbi.nlm.nih.gov/12915363/)
- Vally ZI, Khammissa RAG, Feller G, et al. "Errors in clinical diagnosis: a narrative review." *Journal of International Medical Research*, 2023;51(8):03000605231162798. [doi:10.1177/03000605231162798](https://pubmed.ncbi.nlm.nih.gov/37602466/)
- Staal J, Hooftman J, Gunput STG, et al. "Effect on diagnostic accuracy of cognitive reasoning tools for the workplace setting: systematic review and meta-analysis." *BMJ Quality & Safety*, 2022;31(12):899--910. [doi:10.1136/bmjqs-2022-014865](https://pubmed.ncbi.nlm.nih.gov/36396150/)

</details>

<details>
<summary><b>AI-Assisted Debiasing & Multi-Agent Systems</b></summary>

- Brown C, Nazeer R, Gibbs A, et al. "Breaking Bias: The Role of Artificial Intelligence in Improving Clinical Decision-Making." *Cureus*, 2023;15(3):e36415. [doi:10.7759/cureus.36415](https://pubmed.ncbi.nlm.nih.gov/37090406/)
- Tang X, Zou A, Zhang Z, et al. "MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning." *Findings of ACL*, 2024:599--621. [arXiv:2311.10537](https://arxiv.org/abs/2311.10537)
- Kim Y, Park C, Jeong H, et al. "MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making." *NeurIPS*, 2024. [arXiv:2404.15155](https://arxiv.org/abs/2404.15155)
- Chen X, Yi H, You M, et al. "Enhancing diagnostic capability with multi-agents conversational large language models." *npj Digital Medicine*, 2025;8:159. [doi:10.1038/s41746-025-01550-0](https://pubmed.ncbi.nlm.nih.gov/40082662/)

</details>

<details>
<summary><b>Medical Vision-Language Models & Prompt Engineering</b></summary>

- Jang J, Kyung D, Kim SH, et al. "Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders." *Scientific Reports*, 2024;14:23199. [doi:10.1038/s41598-024-73695-z](https://pubmed.ncbi.nlm.nih.gov/39369048/)
- Leviathan Y, Kalman M, Matias Y. "Prompt Repetition Improves Non-Reasoning LLMs." [arXiv:2512.14982](https://arxiv.org/abs/2512.14982), Google Research, 2025.
- Zaghir J, Naguib M, Bjelogrlic M, et al. "Prompt Engineering Paradigms for Medical Applications: Scoping Review." *Journal of Medical Internet Research*, 2024;26:e60501. [doi:10.2196/60501](https://pubmed.ncbi.nlm.nih.gov/39255030/)
- Sellergren A, Kazemzadeh S, Jaroensri T, et al. "MedGemma Technical Report." [arXiv:2507.05201](https://arxiv.org/abs/2507.05201), Google, 2025.

</details>

---

<div align="center">

Built with [Google Health AI Developer Foundations](https://developers.google.com/health-ai-developer-foundations) for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge)

</div>