yipengsun commited on
Commit
61e710d
Β·
1 Parent(s): 1c7953f

Refactor code structure for improved readability and maintainability

Browse files
Files changed (1) hide show
  1. README.md +99 -158
README.md CHANGED
@@ -18,228 +18,169 @@ tags:
18
 
19
  <div align="center">
20
 
21
- # 🩺 Diagnostic Devil's Advocate
22
 
23
- ### AI-Powered Cognitive Debiasing for Clinical Diagnosis
24
 
25
- **A multi-agent system that challenges medical diagnoses to catch what doctors might miss.**
26
-
27
- [![MedGemma](https://img.shields.io/badge/MedGemma-4B%20%7C%2027B-4285F4?style=for-the-badge&logo=google&logoColor=white)](https://huggingface.co/google/medgemma-1.5-4b-it)
28
- [![MedSigLIP](https://img.shields.io/badge/MedSigLIP-448-34A853?style=for-the-badge&logo=google&logoColor=white)](https://huggingface.co/google/medsiglip-448)
29
- [![LangGraph](https://img.shields.io/badge/LangGraph-Agent%20Pipeline-1C3C3C?style=for-the-badge&logo=langchain&logoColor=white)](https://langchain-ai.github.io/langgraph/)
30
- [![Gradio](https://img.shields.io/badge/Gradio-UI-F97316?style=for-the-badge&logo=gradio&logoColor=white)](https://gradio.app)
31
-
32
- [Live Demo](#getting-started) &bull; [Architecture](#architecture) &bull; [Demo Cases](#demo-cases) &bull; [Technical Details](#technical-details)
33
-
34
- ---
35
 
36
  </div>
37
 
38
- ## The Problem
39
-
40
- > *Diagnostic errors affect an estimated **12 million** adults annually in the U.S. alone, with cognitive biases β€” [anchoring](https://en.wikipedia.org/wiki/Anchoring_(cognitive_bias)), [premature closure](https://en.wikipedia.org/wiki/Premature_closure), [confirmation bias](https://en.wikipedia.org/wiki/Confirmation_bias) β€” implicated in up to **74%** of cases.* ([Singh et al., BMJ Quality & Safety, 2014](https://qualitysafety.bmj.com/content/23/9/727))
41
-
42
- Doctors are not wrong because they lack knowledge. They are wrong because the human brain takes shortcuts β€” and in medicine, shortcuts kill. A physician who sees "young patient + chest pain after trauma" anchors on **rib contusion** and stops looking. The pneumothorax on the X-ray goes unseen. The patient deteriorates.
43
 
44
- **Diagnostic Devil's Advocate** is a system that acts as an adversarial second opinion. It does not replace the physician β€” it challenges them. It asks: *"Have you considered what happens if you're wrong?"*
45
 
46
- ## How It Works
47
 
48
- The system runs a **4-agent pipeline** orchestrated by [LangGraph](https://langchain-ai.github.io/langgraph/) where each agent has a distinct adversarial role. Every agent analyzes **both the medical image and the full clinical context** (history, vitals, labs, exam findings) β€” because some dangerous conditions (aortic dissection, pulmonary embolism) may show subtle or no imaging signs but have obvious clinical red flags. Critically, the first agent does this **without seeing the doctor's diagnosis**, preventing the AI itself from being [anchored](https://en.wikipedia.org/wiki/Anchoring_(cognitive_bias)).
49
 
50
- ### The Four Agents
51
 
52
- | Agent | Role | Model | Key Design Choice |
53
- |:------|:-----|:------|:------------------|
54
- | **Diagnostician** | Independent image + clinical analysis | [MedGemma 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) (multimodal) | **Blinded** β€” never sees the doctor's diagnosis. Tags each finding as `imaging`, `clinical`, or `both` to distinguish evidence sources. |
55
- | **Bias Detector** | Compare doctor vs. AI findings | [MedGemma 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) + [MedSigLIP](https://huggingface.co/google/medsiglip-448) | Uses **zero-shot image classification** to verify radiological signs. Flags clinical red flags ignored by either assessment. |
56
- | **Devil's Advocate** | Adversarial challenge | [MedGemma 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) | Deliberately contrarian β€” uses both imaging and clinical evidence to argue for **[must-not-miss diagnoses](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6775443/)** |
57
- | **Consultant** | Synthesize final report | [MedGemma 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) or [27B Text-IT](https://huggingface.co/google/medgemma-27b-text-it) | Writes as a **collegial consultant**: *"Have you considered..."* not *"You are wrong."* Only this agent optionally upgrades to 27B for deeper reasoning. |
58
 
59
- ## Architecture
60
 
61
- The pipeline is orchestrated by [LangGraph](https://langchain-ai.github.io/langgraph/) as a linear `StateGraph`:
62
 
63
  **Gradio UI** (image upload, diagnosis input, clinical context, [MedASR](https://huggingface.co/google/medasr) voice input)
64
  β†’ **Diagnostician** β€” receives image + clinical context but **NOT** the doctor's diagnosis; tags findings by source (`imaging` / `clinical` / `both`)
65
- β†’ **Bias Detector** β€” now receives the doctor's diagnosis, compares it against independent findings using image, clinical data, and [MedSigLIP](https://huggingface.co/google/medsiglip-448) sign verification
66
  β†’ **Devil's Advocate** β€” challenges the working diagnosis using both imaging and clinical evidence for must-not-miss alternatives
67
  β†’ **Consultant** β€” synthesizes a collegial consultation note
68
  β†’ **Output** (consultation report, alternative diagnoses, recommended workup)
69
 
70
- ### MedSigLIP Sign Verification
71
-
72
- The Bias Detector doesn't just rely on text reasoning β€” it uses [**MedSigLIP-448**](https://huggingface.co/google/medsiglip-448) for objective visual verification. For each radiological sign mentioned by the Diagnostician (e.g., "pleural effusion", "cardiomegaly", "pneumothorax"), MedSigLIP performs [zero-shot binary classification](https://huggingface.co/tasks/zero-shot-image-classification): it compares the logits of `"chest radiograph showing [sign]"` vs `"normal chest radiograph with no [sign]"`. A logit difference > 2 is classified as "likely present", grounding the bias analysis in **visual evidence** rather than pure language reasoning.
73
-
74
- ## Demo Cases
75
-
76
- Three composite clinical scenarios covering the most dangerous diagnostic error patterns:
77
-
78
- <table>
79
- <tr>
80
- <td width="33%" valign="top">
81
-
82
- ### Case 1: Missed Pneumothorax
83
- **🏷️ TRAUMA**
84
-
85
- 32M, motorcycle collision. Doctor diagnoses **rib contusion**, discharges patient. Supine CXR actually shows a **left pneumothorax** with rib fractures.
86
-
87
- **Bias**: [Satisfaction of search](https://radiopaedia.org/articles/satisfaction-of-search) β€” found the rib fractures, stopped looking.
88
-
89
- </td>
90
- <td width="33%" valign="top">
91
-
92
- ### Case 2: Aortic Dissection β†’ "GERD"
93
- **🏷️ VASCULAR**
94
-
95
- 58M, hypertensive, tearing chest pain. Doctor diagnoses **acid reflux**, prescribes antacids. Blood pressure asymmetry (178/102 R vs 146/88 L) and D-dimer 4,850 suggest **Stanford type B dissection**.
96
-
97
- **Bias**: [Anchoring](https://en.wikipedia.org/wiki/Anchoring_(cognitive_bias)) + [availability heuristic](https://en.wikipedia.org/wiki/Availability_heuristic) β€” common diagnosis assumed first.
98
-
99
- </td>
100
- <td width="33%" valign="top">
101
-
102
- ### Case 3: Postpartum PE β†’ "Anxiety"
103
- **🏷️ POSTPARTUM**
104
-
105
- 29F, day 5 post C-section, dyspnea and tachycardia. Doctor orders **psychiatric consult**. SpO2 91%, ABG shows respiratory alkalosis β€” classic **pulmonary embolism**.
106
-
107
- **Bias**: [Premature closure](https://en.wikipedia.org/wiki/Premature_closure) + [framing effect](https://en.wikipedia.org/wiki/Framing_effect_(psychology)) β€” young woman = anxiety.
108
-
109
- </td>
110
- </tr>
111
- </table>
112
-
113
- > All cases are educational composites synthesized from published literature. See [`data/demo_cases/SOURCES.md`](data/demo_cases/SOURCES.md) for full citations.
114
-
115
- ## Technical Details
116
-
117
- ### Model Stack
118
-
119
- | Model | Parameters | Role | Loading |
120
- |:------|:----------|:-----|:--------|
121
- | [MedGemma 1.5 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) | 4B | Multimodal image+text analysis | 4-bit quantized (~4GB VRAM) or BF16 (~8GB) |
122
- | [MedGemma 27B Text-IT](https://huggingface.co/google/medgemma-27b-text-it) | 27B | Consultant deep reasoning (optional) | BF16 (~54GB VRAM) |
123
- | [MedSigLIP-448](https://huggingface.co/google/medsiglip-448) | 0.9B | Zero-shot sign verification | FP32 (~3GB VRAM) |
124
- | [MedASR](https://huggingface.co/google/medasr) | 105M | Medical speech-to-text | FP32 (~0.5GB VRAM) |
125
 
126
- ### Hardware
 
 
 
127
 
128
- The full pipeline (4B 4-bit + MedSigLIP + MedASR) requires **~8 GB VRAM** and runs on any CUDA GPU with 12GB+ memory. All models load locally via [Transformers](https://huggingface.co/docs/transformers) with [4-bit quantization](https://huggingface.co/docs/bitsandbytes) β€” **zero API costs, fully offline-capable**.
129
-
130
- ### Key Technical Decisions
131
-
132
- - **Blinded Diagnostician**: The first agent never sees the doctor's diagnosis. This prevents the AI from anchoring on the same conclusion, enabling genuine independent analysis.
133
-
134
- - **Dual-source analysis (imaging + clinical)**: All agents analyze both the medical image and the full clinical context (vitals, labs, risk factors). Each Diagnostician finding is tagged with its source (`imaging`, `clinical`, or `both`). This is critical because many must-not-miss diagnoses β€” aortic dissection (BP asymmetry), pulmonary embolism (low SpO2, elevated D-dimer) β€” may have subtle or absent imaging signs but glaring clinical red flags.
135
-
136
- - **Structured JSON output**: All agents output structured JSON parsed by [`json_repair`](https://github.com/mangiucugna/json_repair), which handles LLM output quirks (missing commas, truncation, markdown wrapping).
137
 
138
- - **Thinking token stripping**: MedGemma wraps internal reasoning in `<unused94>...<unused95>` tags ([model card](https://huggingface.co/google/medgemma-27b-text-it#thinking-mode)). These are stripped via regex before display.
139
 
140
- - **Adaptive model routing**: The first three agents (Diagnostician, Bias Detector, Devil's Advocate) always use 4B-IT for multimodal image+text analysis. Only the Consultant (text-only synthesis) optionally upgrades to 27B when `USE_27B=true` for deeper clinical reasoning. `generate_with_image()` always uses 4B (only model with vision).
 
 
 
 
 
141
 
142
- - **Collegial tone**: The Consultant is prompted to write as a consulting colleague, not a critic. Research shows physicians respond better to [collaborative challenge than confrontation](https://pubmed.ncbi.nlm.nih.gov/28493811/).
143
 
144
- - **Prompt Repetition**: All agents use the prompt repetition technique from [*"Prompt Repetition Improves Non-Reasoning LLMs"*](https://arxiv.org/abs/2512.14982) (Google Research, 2025). The user prompt is repeated with a transition phrase (`<query> Let me repeat the request: <query>`), which won **47 out of 70** benchmark-model combinations with **zero losses** β€” at nearly zero cost (only increases prefill tokens, no extra generation). Controllable via `ENABLE_PROMPT_REPETITION` env var.
145
 
146
  ## Getting Started
147
 
148
- ### Prerequisites
149
-
150
- - Python 3.11+
151
- - CUDA-capable GPU (12GB+ VRAM)
152
- - [Hugging Face account](https://huggingface.co) with access to gated models (MedGemma, MedSigLIP, MedASR)
153
-
154
- ### Installation
155
-
156
  ```bash
157
- # Clone the repository
158
  git clone https://github.com/sypsyp97/diagnostic-devils-advocate
159
  cd diagnostic-devils-advocate
160
 
161
- # Install dependencies
162
  pip install -r requirements.txt
163
 
164
- # Login to Hugging Face (required for gated models)
165
  huggingface-cli login
166
- ```
167
-
168
- ### Running
169
-
170
- ```bash
171
- # Standard launch (4B quantized, 12GB GPU)
172
- python app.py
173
-
174
- # With 27B reasoning model (A100 80GB required)
175
- USE_27B=true QUANTIZE_4B=false python app.py
176
 
177
- # Disable voice input
178
- ENABLE_MEDASR=false python app.py
 
 
179
  ```
180
 
181
  The app launches at `http://localhost:7860`.
182
 
183
- ### Environment Variables
 
184
 
185
  | Variable | Default | Description |
186
  |:---------|:--------|:------------|
187
  | `USE_27B` | `false` | Enable 27B model for the Consultant agent |
188
  | `QUANTIZE_4B` | `true` | 4-bit quantize the 4B model |
189
  | `ENABLE_MEDASR` | `true` | Enable voice input via MedASR |
190
- | `HF_TOKEN` | β€” | Hugging Face token (or use `huggingface-cli login`) |
191
  | `ENABLE_PROMPT_REPETITION` | `true` | [Prompt repetition](https://arxiv.org/abs/2512.14982) for improved output quality |
192
- | `MODEL_LOCAL_DIR` | β€” | Local directory for pre-downloaded models |
193
  | `DEVICE` | `cuda` | Compute device |
194
 
195
- ## Project Structure
 
 
 
196
 
197
  ```
198
  diagnostic-devils-advocate/
199
- β”œβ”€β”€ app.py # Gradio entry point
200
- β”œβ”€β”€ config.py # Model selection & environment config
201
  β”œβ”€β”€ requirements.txt
202
- β”‚
203
  β”œβ”€β”€ agents/
204
- β”‚ β”œβ”€β”€ state.py # LangGraph TypedDict state definitions
205
- β”‚ β”œβ”€β”€ prompts.py # All agent prompt templates
206
- β”‚ β”œβ”€β”€ graph.py # LangGraph StateGraph pipeline
207
- β”‚ β”œβ”€β”€ output_parser.py # JSON parsing with json_repair + llm-output-parser
208
- β”‚ β”œβ”€β”€ diagnostician.py # Agent 1: Blinded image + clinical analysis
209
- β”‚ β”œβ”€β”€ bias_detector.py # Agent 2: Bias detection + MedSigLIP
210
- β”‚ β”œβ”€β”€ devil_advocate.py # Agent 3: Adversarial challenge
211
- β”‚ └── consultant.py # Agent 4: Consultation note synthesis
212
- β”‚
213
  β”œβ”€β”€ models/
214
- β”‚ β”œβ”€β”€ medgemma_client.py # MedGemma 4B/27B inference client
215
- β”‚ β”œβ”€β”€ medsiglip_client.py # MedSigLIP zero-shot classification
216
- β”‚ β”œβ”€β”€ medasr_client.py # MedASR speech-to-text
217
- β”‚ └── utils.py # Image preprocessing, token stripping
218
- β”‚
219
  β”œβ”€β”€ ui/
220
- β”‚ β”œβ”€β”€ components.py # Gradio layout & progress visualization
221
- β”‚ β”œβ”€β”€ callbacks.py # UI event handlers & pipeline integration
222
- β”‚ └── css.py # Custom styling (responsive design)
223
- β”‚
224
  └── data/
225
- └── demo_cases/ # 3 composite clinical scenarios
226
- └── SOURCES.md # Full literature citations
227
  ```
228
 
 
 
 
 
229
  ## Disclaimer
230
 
231
- > **This is a research prototype built for the MedGemma Impact Challenge. It is NOT intended for clinical decision-making.** All demo cases are educational composites. Medical images are sourced from the University of Saskatchewan Teaching Collection (CC-BY-NC-SA 4.0).
 
 
232
 
233
  ## References
234
 
235
- - Singh H, et al. "The frequency of diagnostic errors in outpatient care." [*BMJ Quality & Safety*, 2014](https://qualitysafety.bmj.com/content/23/9/727)
236
- - Graber ML, et al. "Cognitive interventions to reduce diagnostic error." [*BMJ Quality & Safety*, 2012](https://qualitysafety.bmj.com/content/21/7/535)
237
- - Croskerry P. "The importance of cognitive errors in diagnosis." [*Academic Medicine*, 2003](https://pubmed.ncbi.nlm.nih.gov/12915371/)
238
- - Ball CG, et al. "Incidence, risk factors, and outcomes for occult pneumothoraces." [*J Trauma*, 2005](https://pubmed.ncbi.nlm.nih.gov/16374282/)
239
- - Hansen MS, et al. "Frequency of misdiagnosis of acute aortic dissection." [*Am J Cardiol*, 2007](https://pubmed.ncbi.nlm.nih.gov/17350380/)
240
- - Ivgi M, et al. "Prompt Repetition Improves Non-Reasoning LLMs." [*arXiv:2512.14982*](https://arxiv.org/abs/2512.14982), Google Research, 2025
241
- - Google Health AI. [Health AI Developer Foundations (HAI-DEF)](https://developers.google.com/health-ai)
242
- - Yang J, et al. [MedGemma: Medical AI model](https://huggingface.co/collections/google/health-ai-developer-foundations-68544906f8a0a10f7d30ade8) β€” Hugging Face Collection
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
 
244
  ---
245
 
 
18
 
19
  <div align="center">
20
 
21
+ # Diagnostic Devil's Advocate
22
 
23
+ **AI-Powered Cognitive Debiasing for Medical Image Interpretation**
24
 
25
+ [![MedGemma](https://img.shields.io/badge/MedGemma_1.5-4B_|_27B-4285F4?style=flat-square&logo=google&logoColor=white)](https://huggingface.co/google/medgemma-1.5-4b-it)
26
+ [![MedSigLIP](https://img.shields.io/badge/MedSigLIP-448-34A853?style=flat-square&logo=google&logoColor=white)](https://huggingface.co/google/medsiglip-448)
27
+ [![LangGraph](https://img.shields.io/badge/LangGraph-Agents-1C3C3C?style=flat-square)](https://langchain-ai.github.io/langgraph/)
28
+ [![Gradio](https://img.shields.io/badge/Gradio-UI-F97316?style=flat-square)](https://gradio.app)
29
+ [![License](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey?style=flat-square)](LICENSE)
 
 
 
 
 
30
 
31
  </div>
32
 
33
+ ---
 
 
 
 
34
 
35
+ ## Why This Exists
36
 
37
+ > Diagnostic errors affect an estimated **12 million** adults annually in the U.S., with cognitive biases implicated in up to **74%** of cases. ([Singh et al., 2014](https://pubmed.ncbi.nlm.nih.gov/24742777/))
38
 
39
+ Doctors are not wrong because they lack knowledge -- they are wrong because the human brain takes shortcuts. A physician who sees "young patient + chest pain after trauma" anchors on **rib contusion** and stops looking. The pneumothorax goes unseen. The patient deteriorates.
40
 
41
+ **Diagnostic Devil's Advocate** acts as an adversarial second opinion. It does not replace the physician -- it challenges them: *"Have you considered what happens if you're wrong?"*
42
 
43
+ ---
 
 
 
 
 
44
 
45
+ ## Pipeline
46
 
47
+ Four agents, each with a distinct adversarial role, orchestrated by [LangGraph](https://langchain-ai.github.io/langgraph/) as a linear `StateGraph`:
48
 
49
  **Gradio UI** (image upload, diagnosis input, clinical context, [MedASR](https://huggingface.co/google/medasr) voice input)
50
  β†’ **Diagnostician** β€” receives image + clinical context but **NOT** the doctor's diagnosis; tags findings by source (`imaging` / `clinical` / `both`)
51
+ β†’ **Bias Detector** β€” receives the doctor's diagnosis, compares it against independent findings using image, clinical data, and [MedSigLIP](https://huggingface.co/google/medsiglip-448) sign verification
52
  β†’ **Devil's Advocate** β€” challenges the working diagnosis using both imaging and clinical evidence for must-not-miss alternatives
53
  β†’ **Consultant** β€” synthesizes a collegial consultation note
54
  β†’ **Output** (consultation report, alternative diagnoses, recommended workup)
55
 
56
+ ### Key Design Choices
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
+ - **Blinded first agent** -- the Diagnostician never sees the doctor's diagnosis, preventing the AI from anchoring on the same conclusion
59
+ - **Dual-source analysis** -- every agent considers both the medical image and clinical context (vitals, labs, risk factors), because many dangerous conditions have subtle imaging but obvious clinical red flags
60
+ - **MedSigLIP verification** -- zero-shot image classification grounds the bias analysis in visual evidence, not just language reasoning
61
+ - **Collegial tone** -- the Consultant writes as a consulting colleague (*"Have you considered..."*), not a critic
62
 
63
+ ---
 
 
 
 
 
 
 
 
64
 
65
+ ## Model Stack
66
 
67
+ | Model | Params | Role | VRAM |
68
+ |:------|:------:|:-----|:----:|
69
+ | [MedGemma 1.5 4B-IT](https://huggingface.co/google/medgemma-1.5-4b-it) | 4B | Multimodal image + text analysis | ~4 GB (4-bit) |
70
+ | [MedGemma 27B Text-IT](https://huggingface.co/google/medgemma-27b-text-it) | 27B | Consultant deep reasoning (optional) | ~54 GB |
71
+ | [MedSigLIP-448](https://huggingface.co/google/medsiglip-448) | 0.9B | Zero-shot sign verification | ~3 GB |
72
+ | [MedASR](https://huggingface.co/google/medasr) | 105M | Medical speech-to-text | ~0.5 GB |
73
 
74
+ The full pipeline requires **~8 GB VRAM** and runs on any 12 GB+ CUDA GPU. All models load locally via [Transformers](https://huggingface.co/docs/transformers) with [4-bit quantization](https://huggingface.co/docs/bitsandbytes) -- **zero API costs, fully offline-capable**.
75
 
76
+ ---
77
 
78
  ## Getting Started
79
 
 
 
 
 
 
 
 
 
80
  ```bash
81
+ # Clone
82
  git clone https://github.com/sypsyp97/diagnostic-devils-advocate
83
  cd diagnostic-devils-advocate
84
 
85
+ # Install
86
  pip install -r requirements.txt
87
 
88
+ # Login to Hugging Face (gated models)
89
  huggingface-cli login
 
 
 
 
 
 
 
 
 
 
90
 
91
+ # Run
92
+ python app.py # 4B quantized (default)
93
+ USE_27B=true QUANTIZE_4B=false python app.py # with 27B Consultant
94
+ ENABLE_MEDASR=false python app.py # without voice input
95
  ```
96
 
97
  The app launches at `http://localhost:7860`.
98
 
99
+ <details>
100
+ <summary><b>Environment Variables</b></summary>
101
 
102
  | Variable | Default | Description |
103
  |:---------|:--------|:------------|
104
  | `USE_27B` | `false` | Enable 27B model for the Consultant agent |
105
  | `QUANTIZE_4B` | `true` | 4-bit quantize the 4B model |
106
  | `ENABLE_MEDASR` | `true` | Enable voice input via MedASR |
107
+ | `HF_TOKEN` | -- | Hugging Face token (or use `huggingface-cli login`) |
108
  | `ENABLE_PROMPT_REPETITION` | `true` | [Prompt repetition](https://arxiv.org/abs/2512.14982) for improved output quality |
109
+ | `MODEL_LOCAL_DIR` | -- | Local directory for pre-downloaded models |
110
  | `DEVICE` | `cuda` | Compute device |
111
 
112
+ </details>
113
+
114
+ <details>
115
+ <summary><b>Project Structure</b></summary>
116
 
117
  ```
118
  diagnostic-devils-advocate/
119
+ β”œβ”€β”€ app.py # Gradio entry point
120
+ β”œβ”€β”€ config.py # Model & environment config
121
  β”œβ”€β”€ requirements.txt
 
122
  β”œβ”€β”€ agents/
123
+ β”‚ β”œβ”€β”€ prompts.py # All agent prompt templates
124
+ β”‚ β”œβ”€β”€ graph.py # LangGraph StateGraph pipeline
125
+ β”‚ β”œβ”€β”€ output_parser.py # JSON parsing (json_repair + llm-output-parser)
126
+ β”‚ β”œβ”€β”€ diagnostician.py # Agent 1: Blinded analysis
127
+ β”‚ β”œβ”€β”€ bias_detector.py # Agent 2: Bias detection + MedSigLIP
128
+ β”‚ β”œβ”€β”€ devil_advocate.py # Agent 3: Adversarial challenge
129
+ β”‚ └── consultant.py # Agent 4: Consultation synthesis
 
 
130
  β”œβ”€β”€ models/
131
+ β”‚ β”œβ”€β”€ medgemma_client.py # MedGemma 4B/27B inference
132
+ β”‚ β”œβ”€β”€ medsiglip_client.py # MedSigLIP zero-shot classification
133
+ β”‚ β”œβ”€β”€ medasr_client.py # MedASR speech-to-text
134
+ β”‚ └── utils.py # Image preprocessing, token stripping
 
135
  β”œβ”€β”€ ui/
136
+ β”‚ β”œβ”€β”€ components.py # Gradio layout
137
+ β”‚ β”œβ”€β”€ callbacks.py # UI event handlers
138
+ β”‚ └── css.py # Custom styling
 
139
  └── data/
140
+ └── demo_cases/ # Composite clinical scenarios
 
141
  ```
142
 
143
+ </details>
144
+
145
+ ---
146
+
147
  ## Disclaimer
148
 
149
+ > **This is a research prototype built for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/medgemma-impact-challenge). It is NOT intended for clinical decision-making.** All demo cases are educational composites. Medical images are sourced from the University of Saskatchewan Teaching Collection (CC-BY-NC-SA 4.0).
150
+
151
+ ---
152
 
153
  ## References
154
 
155
+ <details>
156
+ <summary><b>Diagnostic Error & Cognitive Bias</b></summary>
157
+
158
+ - Singh H, Meyer AND, Thomas EJ. "The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations." *BMJ Quality & Safety*, 2014;23(9):727--731. [doi:10.1136/bmjqs-2013-002627](https://pubmed.ncbi.nlm.nih.gov/24742777/)
159
+ - Croskerry P. "The importance of cognitive errors in diagnosis and strategies to minimize them." *Academic Medicine*, 2003;78(8):775--780. [doi:10.1097/00001888-200308000-00003](https://pubmed.ncbi.nlm.nih.gov/12915363/)
160
+ - Vally ZI, Khammissa RAG, Feller G, et al. "Errors in clinical diagnosis: a narrative review." *Journal of International Medical Research*, 2023;51(8):03000605231162798. [doi:10.1177/03000605231162798](https://pubmed.ncbi.nlm.nih.gov/37602466/)
161
+ - Staal J, Hooftman J, Gunput STG, et al. "Effect on diagnostic accuracy of cognitive reasoning tools for the workplace setting: systematic review and meta-analysis." *BMJ Quality & Safety*, 2022;31(12):899--910. [doi:10.1136/bmjqs-2022-014865](https://pubmed.ncbi.nlm.nih.gov/36396150/)
162
+
163
+ </details>
164
+
165
+ <details>
166
+ <summary><b>AI-Assisted Debiasing & Multi-Agent Systems</b></summary>
167
+
168
+ - Brown C, Nazeer R, Gibbs A, et al. "Breaking Bias: The Role of Artificial Intelligence in Improving Clinical Decision-Making." *Cureus*, 2023;15(3):e36415. [doi:10.7759/cureus.36415](https://pubmed.ncbi.nlm.nih.gov/37090406/)
169
+ - Tang X, Zou A, Zhang Z, et al. "MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning." *Findings of ACL*, 2024:599--621. [arXiv:2311.10537](https://arxiv.org/abs/2311.10537)
170
+ - Kim Y, Park C, Jeong H, et al. "MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making." *NeurIPS*, 2024. [arXiv:2404.15155](https://arxiv.org/abs/2404.15155)
171
+ - Chen X, Yi H, You M, et al. "Enhancing diagnostic capability with multi-agents conversational large language models." *npj Digital Medicine*, 2025;8:159. [doi:10.1038/s41746-025-01550-0](https://pubmed.ncbi.nlm.nih.gov/40082662/)
172
+
173
+ </details>
174
+
175
+ <details>
176
+ <summary><b>Medical Vision-Language Models & Prompt Engineering</b></summary>
177
+
178
+ - Jang J, Kyung D, Kim SH, et al. "Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders." *Scientific Reports*, 2024;14:23199. [doi:10.1038/s41598-024-73695-z](https://pubmed.ncbi.nlm.nih.gov/39369048/)
179
+ - Leviathan Y, Kalman M, Matias Y. "Prompt Repetition Improves Non-Reasoning LLMs." [arXiv:2512.14982](https://arxiv.org/abs/2512.14982), Google Research, 2025.
180
+ - Zaghir J, Naguib M, Bjelogrlic M, et al. "Prompt Engineering Paradigms for Medical Applications: Scoping Review." *Journal of Medical Internet Research*, 2024;26:e60501. [doi:10.2196/60501](https://pubmed.ncbi.nlm.nih.gov/39255030/)
181
+ - Sellergren A, Kazemzadeh S, Jaroensri T, et al. "MedGemma Technical Report." [arXiv:2507.05201](https://arxiv.org/abs/2507.05201), Google, 2025.
182
+
183
+ </details>
184
 
185
  ---
186