bshepp commited on
Commit
eaea340
·
1 Parent(s): fdd7dd5

Clean repo for portfolio: archive competition internals, update README and CLAUDE.md

Browse files
.gitignore CHANGED
@@ -49,6 +49,9 @@ models/*.pt
49
  models/*.onnx
50
  models/*.safetensors
51
 
 
 
 
52
  # Notebooks checkpoints
53
  .ipynb_checkpoints/
54
 
 
49
  models/*.onnx
50
  models/*.safetensors
51
 
52
+ # Archive (competition internals, planning docs)
53
+ archive/
54
+
55
  # Notebooks checkpoints
56
  .ipynb_checkpoints/
57
 
CLAUDE.md CHANGED
@@ -6,46 +6,20 @@
6
 
7
  ## Project Overview
8
 
9
- **CDS Agent** is an agentic clinical decision support system built for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge) (Kaggle / Google Research). It orchestrates MedGemma across a multi-step pipeline to produce clinical decision support reports.
10
 
11
- **Deadline:** February 24, 2026, 11:59 PM UTC
12
 
13
  ---
14
 
15
- ## Track System — READ THIS
16
-
17
- This project uses an **experimental track system** to evaluate multiple diagnostic accuracy strategies in strict isolation. Each track is an independent pipeline variant with its own files, configuration, and results.
18
-
19
- **The track registry is in [TRACKS.md](TRACKS.md).** That file is the single source of truth for:
20
- - Which tracks exist and what they do
21
- - Which files belong to which track
22
- - File tagging conventions
23
- - Isolation rules
24
-
25
- ### Track Isolation Rules (Summary)
26
-
27
- 1. **Every file owned by a track MUST have a track tag on line 1** — a comment identifying its track ID (e.g., `# [Track B: RAG Variants]`). The exact format depends on the file type.
28
- 2. **Never modify a file owned by one track to benefit another.** Shared code lives in `src/backend/tracks/shared/`.
29
- 3. **The baseline pipeline (`src/backend/app/`) is Track A.** Experimental tracks extend or wrap Track A code — they do NOT modify it.
30
- 4. **Results from each track are stored separately** under `src/backend/tracks/<track_dir>/results/`.
31
- 5. **Cross-track comparison** is performed only via shared utilities in `src/backend/tracks/shared/`.
32
-
33
- See **[TRACKS.md](TRACKS.md)** for the complete specification.
34
-
35
- ---
36
-
37
- ## Critical Files
38
 
39
  | File | Purpose |
40
  |------|---------|
41
- | **[TRACKS.md](TRACKS.md)** | Track registry, file ownership, isolation rules — **start here for experimental work** |
42
- | **[EXPERIMENT_PLAN.md](EXPERIMENT_PLAN.md)** | 4-phase execution plan for accuracy optimization — **the step-by-step playbook** |
43
- | [TODO.md](TODO.md) | Session-level action items and project status |
44
  | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history and decisions |
45
- | [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition rules, timeline, and submission checklist |
46
- | [docs/kaggle_writeup.md](docs/kaggle_writeup.md) | Final writeup content for Kaggle submission |
47
- | [docs/video_script.md](docs/video_script.md) | 3-minute demo video narration script |
48
  | [docs/architecture.md](docs/architecture.md) | System architecture and design decisions |
 
 
49
 
50
  ---
51
 
@@ -53,28 +27,35 @@ See **[TRACKS.md](TRACKS.md)** for the complete specification.
53
 
54
  ```
55
  medgemma_impact_challenge/
56
- ├── CLAUDE.md You are here
57
- ├── TRACKS.md Track registry and isolation rules
58
- ├── TODO.md ← Next-session action items
59
- ├── DEVELOPMENT_LOG.md ← Build history
60
  ├── src/backend/
61
- │ ├── app/ Track A (Baseline) — production pipeline
62
- │ │ ├── agent/orchestrator.py
63
- │ │ ├── services/medgemma.py
64
- │ │ ├── tools/ 6 pipeline tools
65
- │ │ ├── models/schemas.py
66
- │ │ ── data/clinical_guidelines.json
67
- │ ├── tracks/ ← Experimental tracks
68
- │ │ ├── shared/ ← Cross-track utilities (cost tracking, comparison)
69
- │ │ ├── rag_variants/ ← Track B: Chunking & embedding experiments
70
- │ │ ── iterative/ ← Track C: Serial iterative refinement
71
- │ │ ├── arbitrated/ Track D: Parallel specialists + arbiter
72
- │ │ ├── combined/ ← Track E: Composition of per-axis winners (Phase 3)
73
- │ │ ── prompt_arch/ Track F: Prompt architecture variants (Phase 2)
74
- ├── voting/ ← Track G: Multi-sample voting (Phase 2)
75
- │ │ ── verification/ Track H: Evidence verification (Phase 2)
76
- ── validation/ Validation framework (shared across all tracks)
77
- ── src/frontend/ ← Next.js frontend (not track-specific)
 
 
 
 
 
 
 
 
 
78
  ```
79
 
80
  ---
@@ -84,8 +65,6 @@ medgemma_impact_challenge/
84
  - **Python style:** Pydantic v2 for all data models, async throughout, type hints everywhere
85
  - **LLM calls:** Always go through `app/services/medgemma.py` — never instantiate the OpenAI SDK directly
86
  - **Structured output:** Use `medgemma.generate_structured(prompt, response_model)` with Pydantic models
87
- - **Temperature conventions:** 0.1 for safety-critical/extraction, 0.20.3 for reasoning/synthesis
88
  - **Error handling:** Graceful degradation — return partial results rather than crashing
89
  - **No framework dependencies:** Custom orchestrator, no LangChain/LlamaIndex
90
- - **Windows compatibility:** ASCII characters only in console output (no box-drawing or Unicode symbols)
91
- - **Track tagging:** Line 1 of every track-owned file must carry the track tag comment
 
6
 
7
  ## Project Overview
8
 
9
+ **CDS Agent** is an agentic clinical decision support system that orchestrates MedGemma 27B across a 6-step pipeline to produce clinical decision support reports from free-text patient cases. Originally built for the MedGemma Impact Challenge (Kaggle / Google Research).
10
 
11
+ **Live demo:** [demo.briansheppard.com](https://demo.briansheppard.com)
12
 
13
  ---
14
 
15
+ ## Key Files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  | File | Purpose |
18
  |------|---------|
 
 
 
19
  | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history and decisions |
 
 
 
20
  | [docs/architecture.md](docs/architecture.md) | System architecture and design decisions |
21
+ | [docs/test_results.md](docs/test_results.md) | Detailed test results and benchmarks |
22
+ | [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HF Endpoint deployment guide |
23
 
24
  ---
25
 
 
27
 
28
  ```
29
  medgemma_impact_challenge/
30
+ ├── CLAUDE.md <- You are here
31
+ ├── DEVELOPMENT_LOG.md <- Build history
 
 
32
  ├── src/backend/
33
+ │ ├── app/ <- Production pipeline
34
+ │ │ ├── agent/orchestrator.py <- 6-step pipeline orchestrator
35
+ │ │ ├── services/medgemma.py <- LLM service (OpenAI-compatible)
36
+ │ │ ├── tools/ <- 6 pipeline tools
37
+ │ │ ├── patient_parser.py Step 1: Free-text -> structured data
38
+ │ │ │ ├── clinical_reasoning.py Step 2: Differential diagnosis
39
+ │ │ ├── drug_interactions.py Step 3: OpenFDA + RxNorm APIs
40
+ │ │ ├── guideline_retrieval.py Step 4: RAG over ChromaDB
41
+ │ │ ├── conflict_detection.py Step 5: Guideline vs patient gaps
42
+ │ │ │ └── synthesis.py Step 6: CDS report generation
43
+ │ │ ├── models/schemas.py <- Pydantic data models
44
+ │ │ ├── data/clinical_guidelines.json <- 62 guidelines, 14 specialties
45
+ │ │ ── api/ <- REST + WebSocket endpoints
46
+ │ ├── tracks/ <- Experimental pipeline variants
47
+ │ │ ── shared/ <- Cross-track utilities
48
+ │ ├── rag_variants/ <- Chunking & embedding experiments
49
+ │ │ ├── iterative/ <- Serial iterative refinement
50
+ │ │ └── arbitrated/ <- Parallel specialists + arbiter
51
+ │ └── validation/ <- External dataset validation framework
52
+ │ ├── harness_medqa.py <- MedQA (USMLE) diagnostic accuracy
53
+ │ ├── harness_mtsamples.py <- MTSamples parse quality
54
+ │ └── harness_pmc.py <- PMC Case Reports diagnostic accuracy
55
+ └── src/frontend/ <- Next.js 14 + React 18 + TypeScript
56
+ └── src/
57
+ ├── components/ <- PatientInput, AgentPipeline, CDSReport
58
+ └── hooks/ <- WebSocket state management
59
  ```
60
 
61
  ---
 
65
  - **Python style:** Pydantic v2 for all data models, async throughout, type hints everywhere
66
  - **LLM calls:** Always go through `app/services/medgemma.py` — never instantiate the OpenAI SDK directly
67
  - **Structured output:** Use `medgemma.generate_structured(prompt, response_model)` with Pydantic models
68
+ - **Temperature conventions:** 0.1 for safety-critical/extraction, 0.2-0.3 for reasoning/synthesis
69
  - **Error handling:** Graceful degradation — return partial results rather than crashing
70
  - **No framework dependencies:** Custom orchestrator, no LangChain/LlamaIndex
 
 
EXPERIMENT_PLAN.md DELETED
@@ -1,806 +0,0 @@
1
- # EXPERIMENT_PLAN.md — 4-Phase Accuracy Optimization Plan
2
-
3
- > **Purpose:** Step-by-step execution plan for an AI agent or human to follow.
4
- > Each step is atomic, has clear inputs/outputs, and explicit success criteria.
5
- >
6
- > **Context:** Baseline accuracy is 36% top-1 on 50-case MedQA (seed=42). Our
7
- > goal is to find the best composite strategy before the Feb 24, 2026 deadline.
8
- >
9
- > **Prerequisite reading:** `CLAUDE.md` → `TRACKS.md` → this file.
10
-
11
- ---
12
-
13
- ## Infrastructure Prerequisites
14
-
15
- Before ANY phase, ensure:
16
-
17
- 1. **HF Endpoint is running.**
18
- - Go to https://ui.endpoints.huggingface.co → `medgemma-27b-cds` → Resume
19
- - Wait until status shows "Running" (5–15 min cold start)
20
- - Cost: ~$2.50/hr — **pause when done**
21
-
22
- 2. **Virtual environment is active.**
23
- ```powershell
24
- cd f:\kaggle\medgemma_impact_challenge\src\backend
25
- .\venv\Scripts\Activate.ps1
26
- ```
27
-
28
- 3. **Dependencies installed.**
29
- ```powershell
30
- pip install -r requirements.txt
31
- pip install sentence-transformers # Needed for Track B embedding variants
32
- ```
33
-
34
- 4. **Environment variables set.**
35
- - `.env` file in `src/backend/` must have `HF_TOKEN`, `MEDGEMMA_API_KEY`, `MEDGEMMA_BASE_URL`
36
- - Verify: `python -c "from app.config import Settings; s = Settings(); print(s.medgemma_base_url)"`
37
-
38
- 5. **Quick health check.** Run 1 case through baseline to confirm the endpoint responds:
39
- ```powershell
40
- python -m validation.run_validation --medqa --max-cases 1
41
- ```
42
- **Success:** Pipeline returns a `CDSReport` without timeout errors.
43
-
44
- ---
45
-
46
- ## Phase 1 — Independent Axis Sweeps
47
-
48
- **Goal:** Find the best single-axis configuration for B, C, and D independently.
49
- **Estimated cost:** ~$15–25 of endpoint time (6–10 hours)
50
- **Estimated cases:** 50 per config × (10 + 4 + 4) = 900 total pipeline runs
51
-
52
- ### Phase 1A — Track B: RAG Variants
53
-
54
- **What we're testing:** Which retrieval configuration gets the best documents in front of the model?
55
-
56
- #### Step 1A.1: Smoke Test (3 cases × 10 variants = 30 runs)
57
-
58
- ```powershell
59
- cd f:\kaggle\medgemma_impact_challenge\src\backend
60
- python -m tracks.rag_variants.run_variants --max-cases 3
61
- ```
62
-
63
- **Check for:**
64
- - [ ] All 10 variants complete without errors
65
- - [ ] Each variant produces a result JSON in `tracks/rag_variants/results/`
66
- - [ ] MedCPT and MPNet embedding models download successfully
67
- - [ ] Reranking variant (B9) loads the cross-encoder model
68
- - [ ] Output shows a comparison table with per-variant scores
69
-
70
- **If any variant fails:** Fix the error, then re-run with `--variant <id>` to test just that one:
71
- ```powershell
72
- python -m tracks.rag_variants.run_variants --variant B6_medcpt --max-cases 3
73
- ```
74
-
75
- **Common failure modes:**
76
- - `sentence-transformers` not installed → `pip install sentence-transformers`
77
- - MedCPT download fails → check `HF_TOKEN` is set
78
- - ChromaDB lock → delete `tracks/rag_variants/data/chroma/` and retry
79
-
80
- #### Step 1A.2: Full Sweep (50 cases × 10 variants = 500 runs)
81
-
82
- ```powershell
83
- python -m tracks.rag_variants.run_variants
84
- ```
85
-
86
- **Expected runtime:** 3–5 hours (50 cases × 10 variants, ~2 min/case with API latency)
87
-
88
- **Output:** Results in `tracks/rag_variants/results/` — one JSON per variant.
89
-
90
- #### Step 1A.3: Identify B*
91
-
92
- Read the comparison table printed at the end, or run:
93
- ```powershell
94
- python -m tracks.shared.compare --tracks B --dataset medqa
95
- ```
96
-
97
- **Record the winner:**
98
- ```
99
- B* = ____________ (variant_id)
100
- B* top-1 accuracy = _____%
101
- B* improvement over B0_baseline = +_____%
102
- ```
103
-
104
- **Decision rules:**
105
- - If the best variant beats B0 by <2%, retrieval isn't the bottleneck. Note this, but still carry B* forward.
106
- - If multiple variants tie within 1%, prefer the one with lower latency/complexity.
107
- - If reranking (B9) wins, note the added latency cost.
108
-
109
- ---
110
-
111
- ### Phase 1B — Track C: Iterative Refinement
112
-
113
- **What we're testing:** Does repeated self-critique improve diagnostic accuracy? At what point do returns diminish?
114
-
115
- #### Step 1B.1: Smoke Test (3 cases × 4 configs = 12 runs)
116
-
117
- ```powershell
118
- python -m tracks.iterative.run_iterative --max-cases 3
119
- ```
120
-
121
- **Check for:**
122
- - [ ] All 4 configs complete without errors
123
- - [ ] Per-iteration accuracy and cost data is printed
124
- - [ ] Convergence detection works (C0_2rounds should always run all 2 iterations; C2_5rounds might converge early)
125
- - [ ] Cost ledger populates correctly
126
-
127
- **If a config hangs:** Likely an LLM timeout. Check that the endpoint is warm. The iterative track makes 2-10× more LLM calls per case than baseline.
128
-
129
- #### Step 1B.2: Full Sweep (50 cases × 4 configs)
130
-
131
- ```powershell
132
- python -m tracks.iterative.run_iterative
133
- ```
134
-
135
- **Expected runtime:** 2–4 hours (C0 fastest, C3 slowest)
136
-
137
- **Output:** Results in `tracks/iterative/results/`
138
-
139
- #### Step 1B.3: Identify C*
140
-
141
- ```powershell
142
- python -m tracks.shared.compare --tracks C --dataset medqa
143
- ```
144
-
145
- **Record the winner:**
146
- ```
147
- C* = ____________ (config_id)
148
- C* top-1 accuracy = _____%
149
- C* avg iterations used = _____
150
- C* cost per case = $_____
151
- C* improvement over baseline = +_____%
152
- ```
153
-
154
- **Key data to extract:** The per-iteration accuracy curve. Plot or record:
155
- ```
156
- Iteration 0 (baseline): ___% top-1
157
- Iteration 1 (first critique): ___% top-1
158
- Iteration 2: ___% top-1
159
- Iteration 3: ___% top-1 (if applicable)
160
- ...
161
- ```
162
-
163
- **Decision rules:**
164
- - The winning config is the one with the best accuracy/cost ratio, not necessarily the one with the highest absolute accuracy.
165
- - If C2_5rounds converges at iteration 2 in most cases, the extra rounds aren't helping — C1_3rounds is probably enough.
166
- - If C3_aggressive loses accuracy (the critic is too harsh), note this as a failure mode.
167
-
168
- ---
169
-
170
- ### Phase 1C — Track D: Arbitrated Parallel
171
-
172
- **What we're testing:** Do multiple specialist perspectives, coordinated by an arbiter, find diagnoses a generalist misses?
173
-
174
- #### Step 1C.1: Smoke Test (3 cases × 4 configs = 12 runs)
175
-
176
- ```powershell
177
- python -m tracks.arbitrated.run_arbitrated --max-cases 3
178
- ```
179
-
180
- **Check for:**
181
- - [ ] All 4 configs complete without errors
182
- - [ ] Specialist outputs show domain-specific reasoning (cardiologist emphasizes cardiac, etc.)
183
- - [ ] Arbiter merge output is a coherent consensus differential, not just concatenation
184
- - [ ] For multi-round configs (D2, D3): tailored resubmission prompts are generated
185
- - [ ] For multi-round configs: second-round specialist outputs differ from first round
186
- - [ ] Cost tracking shows escalating cost with more specialists/rounds
187
-
188
- **If the arbiter produces garbage:** The merge prompt may need tuning. Check `tracks/arbitrated/arbiter.py` ARBITER_MERGE_PROMPT.
189
-
190
- #### Step 1C.2: Full Sweep (50 cases × 4 configs)
191
-
192
- ```powershell
193
- python -m tracks.arbitrated.run_arbitrated
194
- ```
195
-
196
- **Expected runtime:** 3–6 hours (D0 fastest, D3 slowest — D3 runs 5 specialists × 2 rounds = 12 LLM calls/case)
197
-
198
- **Output:** Results in `tracks/arbitrated/results/`
199
-
200
- #### Step 1C.3: Identify D*
201
-
202
- ```powershell
203
- python -m tracks.shared.compare --tracks D --dataset medqa
204
- ```
205
-
206
- **Record the winner:**
207
- ```
208
- D* = ____________ (config_id)
209
- D* top-1 accuracy = _____%
210
- D* cost per case = $_____
211
- D* improvement over baseline = +_____%
212
- ```
213
-
214
- **Additional data to record:**
215
- ```
216
- Per-specialist contribution analysis:
217
- Cardiologist: Contributed unique correct dx in ___% of cases
218
- Neurologist: ____%
219
- ID Specialist: ____%
220
- General Internist: ____%
221
- Emergency Med: ____%
222
- Arbitration consensus rate: ____% of cases where >3 specialists agreed on top-1
223
- Round 2 lift (if applicable): +____% over round 1
224
- ```
225
-
226
- **Decision rules:**
227
- - If D0 (3-spec, 1-round) matches D3 (5-spec, 2-rounds), the extra cost isn't justified.
228
- - If specialists all agree in round 1, round 2 is wasted computation — future configs can drop it.
229
- - If one specialist consistently disagrees with the correct answer, consider removing it from the ensemble.
230
-
231
- ---
232
-
233
- ### Phase 1D — Cross-Track Comparison
234
-
235
- After all three tracks complete, run the unified comparison:
236
-
237
- ```powershell
238
- python -m tracks.shared.compare --dataset medqa
239
- ```
240
-
241
- **Expected output:**
242
- ```
243
- Cross-Track Comparison: MEDQA
244
- -------------------------------------------------------------
245
- Track Top-1 Top-3 Mentioned Pipeline Cost
246
- -------------------------------------------------------------
247
- A: Baseline 36.0% -- 38.0% 94.0% $X.XX
248
- B: RAG Variants ___% -- ___% ___% $X.XX
249
- C: Iterative ___% -- ___% ___% $X.XX
250
- D: Arbitrated ___% -- ___% ___% $X.XX
251
- -------------------------------------------------------------
252
- ```
253
-
254
- **Record Phase 1 summary:**
255
- ```
256
- B* = __________, accuracy = ____%, delta = +____%
257
- C* = __________, accuracy = ____%, delta = +____%
258
- D* = __________, accuracy = ____%, delta = +____%
259
- Best single axis: Track ___
260
- ```
261
-
262
- **Go/No-Go for Phase 2:**
263
- - If ALL tracks are within 2% of baseline → the model itself may be the bottleneck,
264
- not the pipeline. Consider investigating prompt architecture (Phase 2) more aggressively.
265
- - If ANY single track shows ≥5% lift → strong signal, proceed to Phase 2 and Phase 3.
266
- - If results are noisy (high variance) → increase to 100 cases or use a different seed
267
- to get more statistical power.
268
-
269
- ---
270
-
271
- ## Phase 2 — New Axes (F, G, H)
272
-
273
- **Goal:** Test 3 lightweight axes that are cheap to implement and orthogonal to B/C/D.
274
- **Build these ONLY after Phase 1 data is in.** Phase 1 results inform which axes matter most.
275
-
276
- ### Phase 2A — Track F: Prompt Architecture
277
-
278
- **Axis:** *How* the model is asked to reason, independent of depth (C) or breadth (D).
279
-
280
- **Why:** This is the cheapest axis to test — same token count, different structure. If prompt architecture matters more than retrieval or iteration, we want to know early.
281
-
282
- #### Step 2A.1: Build Track F
283
-
284
- Create `src/backend/tracks/prompt_arch/` with the track system conventions (see TRACKS.md "Adding a New Track").
285
-
286
- **Files to create:**
287
- ```
288
- tracks/prompt_arch/
289
- __init__.py # Track tag, package init
290
- config.py # PromptVariant dataclass + 5 variants
291
- reasoner.py # Modified clinical_reasoning that accepts prompt templates
292
- run_prompt_arch.py # Runner following same pattern as other tracks
293
- results/ # Output directory
294
- ```
295
-
296
- **Variant definitions:**
297
- | ID | Name | Strategy | Prompt Change |
298
- |----|------|----------|---------------|
299
- | F0 | Baseline | Current free-form | No change (control) |
300
- | F1 | Structured Template | Force structured output | System prompt: "For each symptom, list 3 possible causes. Identify diagnoses appearing in ≥2 symptom lists. Rank by frequency of appearance." |
301
- | F2 | Few-Shot | 2 worked examples | Add 2 solved MedQA cases (NOT from test set) to the system prompt as worked examples with reasoning chains |
302
- | F3 | Reverse Reasoning | Falsification | After initial differential: "For each of your top 5 diagnoses, list the findings you would EXPECT. Mark which are present, absent, or unknown in this patient. Re-rank based on match percentage." |
303
- | F4 | Bayesian | Prior updating | "Assign a prior probability to each diagnosis based on prevalence. For each finding, update posterior probability. Show the Bayesian reasoning chain. Final differential ordered by posterior." |
304
-
305
- **Implementation notes:**
306
- - `reasoner.py` should accept a `prompt_template: str` parameter and inject it into the system prompt or user prompt of the clinical reasoning call.
307
- - F0 uses the exact same system prompt as `app/tools/clinical_reasoning.py` — this is the control.
308
- - Few-shot examples (F2) need to come from MedQA TRAIN set, not the 50-case test set. Pick 2 from `validation/data/medqa_test.jsonl` that are NOT in the seed=42 sample, or create synthetic examples from textbook cases.
309
- - F3 and F4 require TWO LLM calls: first the initial differential, then the structured verification/update. This makes them comparable to C in cost but different in mechanism (structured verification vs. open-ended critique).
310
-
311
- #### Step 2A.2: Run Track F
312
-
313
- ```powershell
314
- # Smoke test
315
- python -m tracks.prompt_arch.run_prompt_arch --max-cases 3
316
-
317
- # Full sweep
318
- python -m tracks.prompt_arch.run_prompt_arch
319
- ```
320
-
321
- #### Step 2A.3: Identify F*
322
-
323
- ```
324
- F* = ____________
325
- F* top-1 accuracy = _____%
326
- F* improvement over F0 = +_____%
327
- ```
328
-
329
- ---
330
-
331
- ### Phase 2B — Track G: Multi-Sample Voting (Self-Consistency)
332
-
333
- **Axis:** Statistical diversity via repeated sampling at higher temperature.
334
-
335
- **Why:** Self-consistency is one of the most reliable accuracy boosters in the CoT literature. It's embarrassingly parallel and requires no new prompts — just `asyncio.gather()` over N samples.
336
-
337
- #### Step 2B.1: Build Track G
338
-
339
- Create `src/backend/tracks/voting/`.
340
-
341
- **Files:**
342
- ```
343
- tracks/voting/
344
- __init__.py
345
- config.py # VotingConfig: n_samples, temperature, aggregation_method
346
- voter.py # Generate N reasoning outputs, extract top-k diagnoses, vote
347
- run_voting.py
348
- results/
349
- ```
350
-
351
- **Variant definitions:**
352
- | ID | Samples | Temp | Aggregation | Description |
353
- |----|---------|------|-------------|-------------|
354
- | G0 | 1 | 0.3 | N/A | Control (identical to baseline) |
355
- | G1 | 3 | 0.5 | Majority vote | 3 samples, majority wins |
356
- | G2 | 5 | 0.5 | Majority vote | 5 samples, majority wins |
357
- | G3 | 5 | 0.7 | Weighted vote | 5 samples at higher diversity, weighted by internal consistency |
358
- | G4 | 3 | 0.5 | Best-of-N | 3 samples, pick the one whose differential best matches retrieved guidelines |
359
-
360
- **Implementation notes:**
361
- - `voter.py` calls `medgemma.generate()` N times in parallel with `asyncio.gather()`.
362
- - Temperature must be high enough to get diversity (≥0.5), otherwise all N samples will be nearly identical.
363
- - **Majority vote aggregation:** Extract top-1 diagnosis from each sample. The diagnosis appearing most frequently wins. If tied, use the one from the sample with the longest reasoning (proxy for confidence).
364
- - **Weighted vote (G3):** For each sample, check how many of its diagnoses are mentioned in the retrieved guidelines. Weight = number of guideline-grounded diagnoses. This penalizes hallucinated differentials.
365
- - **Best-of-N (G4):** Score each sample's differential against the retrieved guidelines using fuzzy_match overlap. Pick the highest-scoring sample wholesale.
366
- - Cost scales linearly: G2 costs 5× baseline reasoning per case.
367
-
368
- #### Step 2B.2: Run Track G
369
-
370
- ```powershell
371
- python -m tracks.voting.run_voting --max-cases 3 # smoke
372
- python -m tracks.voting.run_voting # full
373
- ```
374
-
375
- #### Step 2B.3: Identify G*
376
-
377
- ```
378
- G* = ____________
379
- G* top-1 accuracy = _____%
380
- G* cost multiplier vs baseline = _____×
381
- ```
382
-
383
- ---
384
-
385
- ### Phase 2C — Track H: Evidence Verification (Post-Hoc Grounding)
386
-
387
- **Axis:** A structured fact-check pass that re-ranks the differential based on evidence alignment.
388
-
389
- **Why:** The model might rank a diagnosis #1 that isn't actually supported by the evidence. H catches this. It's different from C (which is open-ended self-critique) — H is specifically checking "does the evidence support this ranking?"
390
-
391
- #### Step 2C.1: Build Track H
392
-
393
- Create `src/backend/tracks/verification/`.
394
-
395
- **Files:**
396
- ```
397
- tracks/verification/
398
- __init__.py
399
- config.py # VerificationConfig
400
- verifier.py # Post-hoc evidence grounding check
401
- run_verification.py
402
- results/
403
- ```
404
-
405
- **Method for each case:**
406
- 1. Run baseline pipeline → get differential with top-5 diagnoses
407
- 2. For EACH diagnosis in the differential, make ONE LLM call:
408
- ```
409
- Patient findings: {summary}
410
- Retrieved guidelines: {relevant_guidelines}
411
- Diagnosis under review: {diagnosis_name}
412
-
413
- Task: List the specific findings from this patient that SUPPORT this diagnosis,
414
- the findings that ARGUE AGAINST it, and the findings that are NEUTRAL.
415
- Give a grounding score from 0-10 based on evidence alignment.
416
- ```
417
- 3. Re-rank the differential by grounding score (descending)
418
- 4. Use the re-ranked differential for scoring
419
-
420
- **Variant definitions:**
421
- | ID | Method | LLM Calls | Description |
422
- |----|--------|-----------|-------------|
423
- | H0 | None | 0 extra | Control |
424
- | H1 | Top-5 re-rank | 5 extra | Verify and re-rank all 5 diagnoses |
425
- | H2 | Top-3 re-rank | 3 extra | Verify only top 3 (cheaper) |
426
- | H3 | Eliminate-only | 5 extra | Don't re-rank — just DROP any diagnosis with score ≤3 and promote the rest |
427
-
428
- **Implementation notes:**
429
- - Use `medgemma.generate_structured()` with a Pydantic model for the grounding output:
430
- ```python
431
- class GroundingResult(BaseModel):
432
- diagnosis: str
433
- supporting_findings: List[str]
434
- opposing_findings: List[str]
435
- neutral_findings: List[str]
436
- grounding_score: int # 0-10
437
- ```
438
- - Temperature: 0.1 (this is extraction/evaluation, not generation)
439
- - Each verification call is independent → run all 5 in parallel with `asyncio.gather()`
440
-
441
- #### Step 2C.2: Run Track H
442
-
443
- ```powershell
444
- python -m tracks.verification.run_verification --max-cases 3
445
- python -m tracks.verification.run_verification
446
- ```
447
-
448
- #### Step 2C.3: Identify H*
449
-
450
- ```
451
- H* = ____________
452
- H* top-1 accuracy = _____%
453
- H* improvement over baseline = +_____%
454
- ```
455
-
456
- ---
457
-
458
- ### Phase 2D — Phase 2 Cross-Comparison
459
-
460
- After F, G, H are done:
461
-
462
- ```powershell
463
- python -m tracks.shared.compare --dataset medqa
464
- ```
465
-
466
- Update the shared compare.py to include tracks E/F/G/H before running (add entries to `TRACK_DIRS`).
467
-
468
- **Record Phase 2 summary:**
469
- ```
470
- F* = __________, accuracy = ____%, delta = +____%
471
- G* = __________, accuracy = ____%, delta = +____%, cost = _____×
472
- H* = __________, accuracy = ____%, delta = +____%
473
- ```
474
-
475
- **Rank all 6 axes by accuracy lift:**
476
- ```
477
- 1. Track ___ : +____% (cost: ___×)
478
- 2. Track ___ : +____% (cost: ___×)
479
- 3. Track ___ : +____% (cost: ___×)
480
- 4. Track ___ : +____% (cost: ___×)
481
- 5. Track ___ : +____% (cost: ___×)
482
- 6. Track ___ : +____% (cost: ___×)
483
- ```
484
-
485
- ---
486
-
487
- ## Phase 3 — Composition (Track E: Combined)
488
-
489
- **Goal:** Wire the per-axis winners together and test whether gains are additive.
490
- **Only start this after Phase 1 and Phase 2 data is in hand.**
491
-
492
- ### Step 3.1: Build Track E
493
-
494
- Create `src/backend/tracks/combined/`.
495
-
496
- **Files:**
497
- ```
498
- tracks/combined/
499
- __init__.py
500
- config.py # CombinedConfig: which B*/C*/D*/F*/G*/H* to compose
501
- pipeline.py # The composite pipeline that wires winners together
502
- run_combined.py
503
- results/
504
- ```
505
-
506
- **CombinedConfig should reference winner IDs from Phase 1 and 2:**
507
- ```python
508
- @dataclass
509
- class CombinedConfig:
510
- config_id: str
511
- rag_variant_id: Optional[str] # B* winner (or None = baseline retrieval)
512
- iterative_config_id: Optional[str] # C* winner (or None = no iteration)
513
- arbitrated_config_id: Optional[str] # D* winner (or None = single generalist)
514
- prompt_variant_id: Optional[str] # F* winner (or None = default prompt)
515
- voting_config_id: Optional[str] # G* winner (or None = single sample)
516
- verification_config_id: Optional[str] # H* winner (or None = no verification)
517
- composition_pattern: str # "E1", "E2", or "E3"
518
- description: str = ""
519
- ```
520
-
521
- ### Step 3.2: Implement 3 Composition Patterns
522
-
523
- **Pattern E1: Breadth-then-Depth** (recommended starting point)
524
- ```
525
- Parse
526
- → B* retriever (swap guideline retrieval)
527
- → F* prompt template (swap reasoning prompt)
528
- → D* specialists in parallel (each uses F* prompt)
529
- → D* arbiter merge → consensus differential
530
- → C* iterative refinement on consensus
531
- → H* evidence verification on refined output
532
- → G* voting: run the above N times and vote (if G* ≠ G0)
533
- → Drug Check + Conflict Detection
534
- → Synthesis
535
- ```
536
-
537
- **Pattern E2: Depth-within-Breadth**
538
- ```
539
- Parse
540
- → B* retriever
541
- → D* specialists, each with F* prompt, each running C* internal iteration
542
- → D* arbiter merge over refined specialist outputs
543
- → H* evidence verification
544
- → G* voting over the above
545
- → Drug Check + Conflict Detection
546
- → Synthesis
547
- ```
548
-
549
- **Pattern E3: Bookend (full loop)**
550
- ```
551
- Parse
552
- → B* retriever
553
- → D* specialists (round 1, F* prompt)
554
- → D* arbiter merge → rough consensus
555
- → C* iterative refinement on consensus
556
- → D* specialists again (round 2, with refined consensus as additional context)
557
- → D* arbiter re-merge → final differential
558
- → H* evidence verification
559
- → G* voting
560
- → Drug Check + Conflict Detection
561
- → Synthesis
562
- ```
563
-
564
- **Implementation guidance:**
565
- - Import existing track modules — do NOT duplicate code
566
- ```python
567
- from tracks.rag_variants.retriever import VariantRetriever
568
- from tracks.rag_variants.config import VARIANTS
569
- from tracks.iterative.refiner import IterativeRefiner
570
- from tracks.iterative.config import CONFIGS as ITERATIVE_CONFIGS
571
- from tracks.arbitrated.specialists import run_specialists_parallel
572
- from tracks.arbitrated.arbiter import Arbiter
573
- from tracks.arbitrated.config import CONFIGS as ARBITRATED_CONFIGS
574
- ```
575
- - The orchestrator's tools are swappable: `orchestrator.guideline_retrieval = variant_retriever`
576
- - Use a single `CostLedger` that spans ALL stages so the total cost is tracked
577
-
578
- ### Step 3.3: Run Compositions
579
-
580
- ```powershell
581
- # Start with E1 (simplest)
582
- python -m tracks.combined.run_combined --pattern E1 --max-cases 3 # smoke
583
- python -m tracks.combined.run_combined --pattern E1 # full 50 cases
584
-
585
- # Then E2 and E3 if E1 shows promise
586
- python -m tracks.combined.run_combined --pattern E2 --max-cases 10
587
- python -m tracks.combined.run_combined --pattern E3 --max-cases 10
588
- ```
589
-
590
- ### Step 3.4: Evaluate Composition
591
-
592
- **Record:**
593
- ```
594
- E1 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s
595
- E2 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s
596
- E3 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s
597
-
598
- Best single track: Track ___ at ____%
599
- Best composition: Pattern ___ at ____%
600
- Composition lift vs best single track: +____%
601
- ```
602
-
603
- **Key questions to answer:**
604
- 1. Are the gains from B/C/D/F/G/H additive when composed? (If E1 ≈ best single track, they're not.)
605
- 2. Which pattern gives the best accuracy/cost ratio?
606
- 3. Is there a simpler 2-axis composition (e.g., B+C only) that gets 80% of the E1 benefit at 30% of the cost?
607
-
608
- ### Step 3.5: Test Partial Compositions
609
-
610
- Based on the Phase 1+2 ranking, test 2-axis combos of the top 3 axes:
611
-
612
- ```
613
- E_BC: B* + C* only (better retrieval + iteration)
614
- E_BD: B* + D* only (better retrieval + specialists)
615
- E_BF: B* + F* only (better retrieval + prompt architecture)
616
- E_CD: C* + D* only (iteration + specialists)
617
- E_BH: B* + H* only (better retrieval + verification)
618
- ```
619
-
620
- This tells us which pairs compose well and which interfere. Run each at 50 cases.
621
-
622
- **Record pair interaction matrix:**
623
- ```
624
- B* C* D* F* G* H*
625
- B* - ____% ____% ____% ____% ____%
626
- C* - ____% ____% ____% ____%
627
- D* - ____% ____% ____%
628
- F* - ____% ____%
629
- G* - ____%
630
- H* -
631
- ```
632
- (Each cell = top-1 accuracy of that 2-axis composition)
633
-
634
- ---
635
-
636
- ## Phase 4 — Cherry-Pick and Finalize
637
-
638
- **Goal:** Take the best composition from Phase 3 and apply any remaining optimizations.
639
-
640
- ### Step 4.1: Lock the Winner
641
-
642
- Based on Phase 3 data, select the final pipeline configuration:
643
-
644
- ```
645
- FINAL CONFIG:
646
- Retrieval: ____________ (B variant or baseline)
647
- Prompt: ____________ (F variant or baseline)
648
- Reasoning: ____________ (D config, or single generalist)
649
- Iteration: ____________ (C config, or none)
650
- Verification: ____________ (H config, or none)
651
- Voting: ____________ (G config, or single sample)
652
- Composition: ____________ (E pattern)
653
- Top-1 accuracy: ____%
654
- Cost per case: $____
655
- Runtime per case: ____s
656
- ```
657
-
658
- ### Step 4.2: 100-Case Validation
659
-
660
- Run the final config against an expanded dataset to confirm the result isn't a fluke:
661
-
662
- ```powershell
663
- # If possible, run 100 MedQA cases (load more from the JSONL)
664
- python -m tracks.combined.run_combined --pattern <winner> --max-cases 100
665
- ```
666
-
667
- **If 100-case accuracy is within ±3% of 50-case accuracy:** The result is stable.
668
- **If it drops by >5%:** We overfit to the 50-case sample. Re-evaluate.
669
-
670
- ### Step 4.3: Run Complementary Benchmarks
671
-
672
- Run the winner through MTSamples and PMC harnesses (if available) to show generalization:
673
-
674
- ```powershell
675
- # These may need adaptation to work with the combined pipeline
676
- python -m validation.run_validation --mtsamples --max-cases 20
677
- python -m validation.run_validation --pmc --max-cases 10
678
- ```
679
-
680
- ### Step 4.4: Update Submission Materials
681
-
682
- 1. **Update `docs/kaggle_writeup.md`** with final accuracy numbers, the winning configuration,
683
- and the experimental journey (which axes mattered, which didn't, composition effects).
684
-
685
- 2. **Update `docs/video_script.md`** if the demo pipeline changed significantly (e.g., if the
686
- best config uses specialists, the video should show the specialist pipeline).
687
-
688
- 3. **Update `docs/architecture.md`** with the final pipeline diagram.
689
-
690
- 4. **Push to GitHub:**
691
- ```powershell
692
- git add -A
693
- git commit -m "Phase 4: Final pipeline configuration - XX% top-1 accuracy"
694
- git push
695
- ```
696
-
697
- ### Step 4.5: Record Demo Video
698
-
699
- Follow `docs/video_script.md` with the FINAL pipeline configuration running live.
700
-
701
- ### Step 4.6: Submit on Kaggle
702
-
703
- Follow `docs/kaggle_writeup.md` submission steps. Include:
704
- - Final writeup with experimental results
705
- - Video link
706
- - GitHub repo link
707
- - (Optional) Live demo URL if deployed
708
-
709
- ---
710
-
711
- ## Decision Log
712
-
713
- Use this section to record key decisions as you execute the plan.
714
-
715
- ### Phase 1 Results
716
- ```
717
- Date: ___________
718
-
719
- B* = ___________ accuracy: ____% delta: +____% latency: ____ms
720
- C* = ___________ accuracy: ____% delta: +____% avg_iters: ____
721
- D* = ___________ accuracy: ____% delta: +____% cost/case: $____
722
-
723
- Best single axis: Track ___
724
- Notes:
725
- ```
726
-
727
- ### Phase 2 Results
728
- ```
729
- Date: ___________
730
-
731
- F* = ___________ accuracy: ____% delta: +____%
732
- G* = ___________ accuracy: ____% delta: +____% cost: ____×
733
- H* = ___________ accuracy: ____% delta: +____%
734
-
735
- Ranked axes (by lift):
736
- 1. ___ 2. ___ 3. ___ 4. ___ 5. ___ 6. ___
737
-
738
- Notes:
739
- ```
740
-
741
- ### Phase 3 Results
742
- ```
743
- Date: ___________
744
-
745
- E1 accuracy: ____% cost/case: $____
746
- E2 accuracy: ____% cost/case: $____
747
- E3 accuracy: ____% cost/case: $____
748
-
749
- Best pair: ___ + ___ accuracy: ____%
750
- Best triple: ___ + ___ + ___ accuracy: ____%
751
-
752
- Notes:
753
- ```
754
-
755
- ### Phase 4 Final
756
- ```
757
- Date: ___________
758
-
759
- Final config: ___________________________
760
- Final accuracy (50-case): ____%
761
- Final accuracy (100-case): ____%
762
- Cost per case: $____
763
- Runtime per case: ____s
764
-
765
- Submitted: [ ] Yes [ ] No
766
- Video recorded: [ ] Yes [ ] No
767
- ```
768
-
769
- ---
770
-
771
- ## Time Budget
772
-
773
- | Phase | Estimated Endpoint Hours | Estimated Wall Clock | Estimated Cost |
774
- |-------|-------------------------|---------------------|---------------|
775
- | Phase 1 (B+C+D) | 8–12 hrs | 1–2 days | $20–30 |
776
- | Phase 2 (F+G+H) | 6–10 hrs | 1–2 days | $15–25 |
777
- | Phase 3 (Compositions) | 4–8 hrs | 1 day | $10–20 |
778
- | Phase 4 (Finalize) | 2–3 hrs | 1 day | $5–8 |
779
- | **Total** | **20–33 hrs** | **4–7 days** | **$50–83** |
780
-
781
- **Deadline:** February 24, 2026, 11:59 PM UTC
782
- **Today:** February 15, 2026
783
- **Available:** ~9 days
784
-
785
- **Suggested schedule:**
786
- - Feb 15–16: Phase 1 (run overnight, collect in morning)
787
- - Feb 17–18: Phase 2 (build F/G/H, run overnight)
788
- - Feb 19–20: Phase 3 (compositions)
789
- - Feb 21–22: Phase 4 (finalize, video, writeup update)
790
- - Feb 23: Buffer day + final submission
791
- - Feb 24: Deadline
792
-
793
- ---
794
-
795
- ## Abort Conditions
796
-
797
- Stop and re-evaluate the strategy if:
798
-
799
- 1. **Endpoint costs exceed $100 total** — we're overspending for marginal gains
800
- 2. **All Phase 1 tracks show <2% lift** — the model, not the pipeline, is the bottleneck. Consider:
801
- - Switching to `medgemma-4b-it` for faster iteration on prompts
802
- - Focusing entirely on prompt architecture (Track F)
803
- - Reducing scope to best-effort with current accuracy + strong writeup
804
- 3. **Phase 3 compositions LOSE accuracy vs single tracks** — negative interaction effects. Simplify back to best single track.
805
- 4. **Consistent pipeline failures (>10% error rate)** — endpoint stability issue. Fix infrastructure before continuing experiments.
806
- 5. **February 22 reached without Phase 3 complete** — lock whatever is best so far and move directly to Phase 4 (finalize + submit). Do not risk missing the deadline for marginal gains.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -14,8 +14,8 @@ custom_domains:
14
 
15
  > An agentic clinical decision support application that orchestrates medical AI with specialized tools to assist clinicians in real time.
16
 
17
- **Origin:** [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge) (Kaggle / Google Research)
18
- **Focus:** Building a genuinely impactful medical application not just a competition entry.
19
 
20
  ---
21
 
@@ -156,73 +156,44 @@ Sources include ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, APA, AAP, ACR, ASH, K
156
 
157
  ```
158
  medgemma_impact_challenge/
159
- ├── README.md # This file
160
- ├── DEVELOPMENT_LOG.md # Chronological build history & decisions
161
- ├── SUBMISSION_GUIDE.md # Competition submission strategy
162
- ├── RULES_SUMMARY.md # Competition rules checklist
163
  ├── docs/
164
- │ ├── architecture.md # System architecture & design decisions
165
- │ ├── test_results.md # Detailed test results & benchmarks
166
- ── writeup_draft.md # Project writeup / summary
167
- │ └── deploy_medgemma_hf.md # MedGemma HF Endpoint deployment guide
168
  ├── src/
169
- │ ├── backend/ # Python FastAPI backend
170
- │ │ ├── .env.template # Environment config template
171
- │ │ ├── .env # Local config (not committed)
172
- │ │ ├── requirements.txt # Python dependencies (28 packages)
173
- │ │ ├── test_e2e.py # End-to-end pipeline test
174
- │ │ ├── test_clinical_cases.py # 22 clinical scenario test suite
175
- │ │ ├── test_rag_quality.py # RAG retrieval quality tests (30 queries)
176
- │ │ ├── test_poll.py # Simple case poller utility
177
- │ │ ── validation/ # External dataset validation framework
178
- │ │ ├── base.py # Core framework (runners, scorers, utilities)
179
- │ │ │ ├── harness_medqa.py # MedQA (USMLE) diagnostic accuracy harness
180
- │ │ │ ├── harness_mtsamples.py # MTSamples parse quality harness
181
- │ │ │ ├── harness_pmc.py # PMC Case Reports diagnostic harness
182
- │ │ │ ├── run_validation.py # Unified CLI runner
183
- │ │ │ ├── analyze_results.py # Question-type categorization & analysis
184
- │ │ │ └── check_progress.py # Checkpoint progress monitor
185
  │ │ └── app/
186
- │ │ ├── main.py # FastAPI entry (CORS, routers, lifespan)
187
- │ │ ├── config.py # Pydantic Settings (ports, models, dirs)
188
- │ │ ├── __init__.py
189
- │ │ ├── models/
190
- │ │ │ └── schemas.py # All Pydantic models (~280 lines)
191
- │ │ ├── agent/
192
- │ │ │ └── orchestrator.py # 6-step pipeline orchestrator (~300 lines)
193
- │ │ ├── services/
194
- │ │ │ └── medgemma.py # LLM service (OpenAI-compatible API)
195
  │ │ ├── tools/
196
- │ │ │ ├── patient_parser.py # Step 1: Free-text → structured data
197
- │ │ │ ├── clinical_reasoning.py # Step 2: Differential diagnosis
198
- │ │ │ ├── drug_interactions.py # Step 3: OpenFDA + RxNorm
199
- │ │ │ ├── guideline_retrieval.py # Step 4: RAG over ChromaDB
200
- │ │ │ ├── conflict_detection.py # Step 5: Guideline vs patient conflicts
201
- │ │ │ └── synthesis.py # Step 6: CDS report generation
202
- │ │ ├── data/
203
- │ │ └── clinical_guidelines.json # 62 guidelines, 14 specialties
204
- └── api/
205
- │ │ ├── health.py # GET /api/health
206
- │ │ ├── cases.py # POST /api/cases/submit, GET /api/cases/{id}
207
- │ │ └── ws.py # WebSocket /ws/agent
208
- │ └── frontend/ # Next.js 14 + React 18 + TypeScript
209
- │ ├── package.json
210
- │ ├── next.config.js # API proxy → backend
211
- │ ├── tailwind.config.js
212
  │ └── src/
213
- │ ├── app/
214
- │ ├── layout.tsx
215
- │ │ ├── page.tsx # Main CDS interface
216
- │ │ └── globals.css
217
- │ ├── components/
218
- │ │ ├── PatientInput.tsx # Patient case input + 3 sample cases
219
- │ │ ├── AgentPipeline.tsx # Real-time step visualization
220
- │ │ └── CDSReport.tsx # Final report renderer
221
- │ └── hooks/
222
- │ └── useAgentWebSocket.ts # WebSocket state management
223
- ├── notebooks/ # Experiment notebooks
224
- ├── models/ # Fine-tuned models (future)
225
- └── demo/ # Video & demo assets
226
  ```
227
 
228
  ---
@@ -344,20 +315,16 @@ curl -X POST http://localhost:8000/api/cases/submit \
344
 
345
  ---
346
 
347
- ## Documentation Index
348
 
349
  | Document | Description |
350
  |----------|-------------|
351
- | [README.md](README.md) | This file — overview, setup, results |
352
  | [docs/architecture.md](docs/architecture.md) | System architecture, pipeline design, design decisions |
353
  | [docs/test_results.md](docs/test_results.md) | Detailed test results, RAG benchmarks, pipeline timing |
 
354
  | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history, problems solved, decisions made |
355
- | [docs/writeup_draft.md](docs/writeup_draft.md) | Project writeup / summary |
356
- | [CONTRIBUTING.md](CONTRIBUTING.md) | How to contribute to the project |
357
  | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
358
- | [TODO.md](TODO.md) | Next-session action items and project state |
359
- | [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition submission strategy |
360
- | [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HuggingFace Endpoint deployment guide |
361
 
362
  ---
363
 
 
14
 
15
  > An agentic clinical decision support application that orchestrates medical AI with specialized tools to assist clinicians in real time.
16
 
17
+ **Live demo:** [demo.briansheppard.com](https://demo.briansheppard.com)
18
+ **Origin:** Built for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge) (Kaggle / Google Research).
19
 
20
  ---
21
 
 
156
 
157
  ```
158
  medgemma_impact_challenge/
159
+ ├── README.md
160
+ ├── CLAUDE.md # AI assistant context
161
+ ├── DEVELOPMENT_LOG.md # Build history & decisions
 
162
  ├── docs/
163
+ │ ├── architecture.md # System architecture & design
164
+ │ ├── test_results.md # Test results & benchmarks
165
+ ── deploy_medgemma_hf.md # HF Endpoint deployment guide
 
166
  ├── src/
167
+ │ ├── backend/
168
+ │ │ ├── requirements.txt
169
+ │ │ ├── test_e2e.py # End-to-end pipeline test
170
+ │ │ ├── test_clinical_cases.py # 22 clinical scenario test suite
171
+ │ │ ├── test_rag_quality.py # RAG retrieval quality tests
172
+ │ │ ├── validation/ # External dataset validation
173
+ │ │ ├── harness_medqa.py # MedQA (USMLE) accuracy
174
+ │ │ ├── harness_mtsamples.py # MTSamples parse quality
175
+ │ │ │ └── harness_pmc.py # PMC Case Reports accuracy
176
+ │ │ ├── tracks/ # Experimental pipeline variants
 
 
 
 
 
 
177
  │ │ └── app/
178
+ │ │ ├── main.py # FastAPI entry point
179
+ │ │ ├── config.py # Settings
180
+ │ │ ├── agent/orchestrator.py # 6-step pipeline orchestrator
181
+ │ │ ├── services/medgemma.py # LLM service (OpenAI-compatible)
182
+ │ │ ── models/schemas.py # Pydantic data models
 
 
 
 
183
  │ │ ├── tools/
184
+ │ │ │ ├── patient_parser.py # Step 1: Free-text → structured data
185
+ │ │ │ ├── clinical_reasoning.py # Step 2: Differential diagnosis
186
+ │ │ │ ├── drug_interactions.py # Step 3: OpenFDA + RxNorm
187
+ │ │ │ ├── guideline_retrieval.py # Step 4: RAG over ChromaDB
188
+ │ │ │ ├── conflict_detection.py # Step 5: Guideline vs patient gaps
189
+ │ │ │ └── synthesis.py # Step 6: CDS report generation
190
+ │ │ ├── data/clinical_guidelines.json # 62 guidelines, 14 specialties
191
+ │ │ └── api/ # REST + WebSocket endpoints
192
+ │ └── frontend/ # Next.js 14 + React 18 + TypeScript
 
 
 
 
 
 
 
193
  │ └── src/
194
+ │ ├── components/ # PatientInput, AgentPipeline, CDSReport
195
+ ── hooks/ # WebSocket state management
196
+ ── Dockerfile # HuggingFace Spaces deployment
 
 
 
 
 
 
 
 
 
 
197
  ```
198
 
199
  ---
 
315
 
316
  ---
317
 
318
+ ## Documentation
319
 
320
  | Document | Description |
321
  |----------|-------------|
 
322
  | [docs/architecture.md](docs/architecture.md) | System architecture, pipeline design, design decisions |
323
  | [docs/test_results.md](docs/test_results.md) | Detailed test results, RAG benchmarks, pipeline timing |
324
+ | [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HuggingFace Endpoint deployment guide |
325
  | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history, problems solved, decisions made |
326
+ | [CONTRIBUTING.md](CONTRIBUTING.md) | How to contribute |
 
327
  | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
 
 
 
328
 
329
  ---
330
 
RULES_SUMMARY.md DELETED
@@ -1,113 +0,0 @@
1
- # Rules Summary & Compliance Checklist
2
-
3
- > Distilled from the full competition rules. When in doubt, refer to the [full rules](rules.txt).
4
-
5
- ---
6
-
7
- ## Eligibility
8
-
9
- - [x] Must have a registered Kaggle account
10
- - [x] Must be 18+ (or age of majority in your jurisdiction)
11
- - [x] Cannot be a resident of: Crimea, DNR, LNR, Cuba, Iran, Syria, or North Korea
12
- - [x] Cannot be under U.S. export controls or sanctions
13
- - [x] Google/Kaggle employees may participate but **cannot win prizes**
14
- - [x] Only **one Kaggle account** per person — no multi-accounting
15
-
16
- ---
17
-
18
- ## Team Rules
19
-
20
- | Rule | Detail |
21
- |------|--------|
22
- | Max team size | **5 members** |
23
- | Team mergers | Allowed before merger deadline |
24
- | Submissions per team | **1** (can be edited and re-submitted) |
25
- | Account requirement | Each member needs their own Kaggle account |
26
- | Must confirm membership | Respond to team notification message |
27
-
28
- ---
29
-
30
- ## Submission Rules
31
-
32
- - **One submission per team** — this single entry covers Main Track + one special award
33
- - Submission format: **Kaggle Writeup** attached to the competition page
34
- - Can un-submit, edit, and re-submit unlimited times before deadline
35
- - Must be received before **February 24, 2026 at 11:59 PM UTC**
36
-
37
- ### Private Resources Warning
38
- > If you attach a **private Kaggle Resource** to your public Writeup, it will **automatically become public** after the deadline.
39
-
40
- ---
41
-
42
- ## Data & External Resources
43
-
44
- | Rule | Detail |
45
- |------|--------|
46
- | Competition data | **None provided** |
47
- | External data | Allowed — must be publicly available & free for all participants |
48
- | HAI-DEF models | Subject to [HAI-DEF Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms) |
49
- | Proprietary datasets | Not allowed if cost exceeds "Reasonableness Standard" |
50
- | AutoML tools | Allowed if properly licensed |
51
- | Open source | Must use OSI-approved licenses |
52
-
53
- ---
54
-
55
- ## Code Sharing Rules
56
-
57
- | Type | Allowed? | Conditions |
58
- |------|----------|------------|
59
- | **Private sharing** (between teams) | **NO** | Grounds for disqualification |
60
- | **Private sharing** (within team) | Yes | — |
61
- | **Public sharing** | Yes | Must be shared on Kaggle (forums/notebooks) for all participants |
62
-
63
- ---
64
-
65
- ## Winner Obligations
66
-
67
- If you win, you must:
68
-
69
- 1. **Deliver final code** — training code, inference code, environment description
70
- 2. **Grant CC BY 4.0 license** on your winning submission
71
- 3. **Sign prize acceptance documents** within 2 weeks of notification
72
- 4. **Complete tax forms** (W-9 for US, W-8BEN for foreign residents)
73
- 5. **Respond to winner notification** within 1 week
74
-
75
- > If using commercially available software you don't own, you must identify it and explain how to procure it.
76
- > If input data/pretrained models have incompatible licenses, you don't need to grant open source license for those.
77
-
78
- ---
79
-
80
- ## Prize Distribution
81
-
82
- - Monetary prizes split **evenly** among eligible team members (unless team unanimously agrees to different split)
83
- - **All taxes are the winner's responsibility**
84
- - Prizes awarded ~30 days after acceptance documents received
85
- - Prizes **cannot be transferred or assigned**
86
-
87
- ---
88
-
89
- ## Disqualification Risks
90
-
91
- You can be disqualified for:
92
- - Using multiple Kaggle accounts
93
- - Private code sharing outside your team
94
- - Cheating, deception, or unfair practices
95
- - Threatening or harassing other participants
96
- - Not meeting submission requirements
97
- - Providing false personal information
98
- - Using non-publicly-available external data
99
-
100
- ---
101
-
102
- ## Governing Law
103
-
104
- - California law applies
105
- - Disputes litigated in Santa Clara County, California, USA
106
-
107
- ---
108
-
109
- ## Key Contacts
110
-
111
- - **Competition Sponsor:** Google Research — 1600 Amphitheatre Parkway, Mountain View, CA 94043
112
- - **Platform:** Kaggle Inc.
113
- - **Support:** www.kaggle.com/contact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SUBMISSION_GUIDE.md DELETED
@@ -1,152 +0,0 @@
1
- # Submission & Strategy Guide
2
-
3
- ## Timeline at a Glance
4
-
5
- ```
6
- Jan 13 ─────────────────────── Feb 24 ──────────── Mar 17-24
7
- START DEADLINE 11:59 PM UTC RESULTS
8
- ◄────── Build & Iterate ──────►
9
- ```
10
-
11
- **⏰ Days remaining as of Feb 15, 2026: ~9 days**
12
-
13
- ---
14
-
15
- ## Winning Strategy by Track
16
-
17
- ### Main Track ($75K)
18
- Focus on **Execution & Communication (30%)** — this is the highest-weighted criterion. A polished video, clean write-up, and well-organized code can make the difference.
19
-
20
- **Priority order:**
21
- 1. **Execution & Communication (30%)** — Polish everything
22
- 2. **Effective Use of HAI-DEF (20%)** — Show the models are essential, not bolted on
23
- 3. **Product Feasibility (20%)** — Prove it can work in production
24
- 4. **Problem Domain (15%)** — Tell a compelling story about who benefits
25
- 5. **Impact Potential (15%)** — Quantify the impact with clear estimates
26
-
27
- ### Agentic Workflow Prize ($10K)
28
- - Deploy HAI-DEF models as **intelligent agents** or **callable tools**
29
- - Demonstrate a **significant overhaul** of a challenging process
30
- - Show improved efficiency and outcomes via agentic AI
31
-
32
- ### Novel Task Prize ($10K)
33
- - **Fine-tune** a HAI-DEF model for a task it wasn't originally designed for
34
- - The more creative and useful the adaptation, the better
35
- - Document fine-tuning methodology thoroughly
36
-
37
- ### Edge AI Prize ($5K)
38
- - Run a HAI-DEF model on **local/edge hardware** (phone, scanner, etc.)
39
- - Focus on model optimization: quantization, distillation, pruning
40
- - Demonstrate real-world field deployment scenarios
41
-
42
- ---
43
-
44
- ## Submission Checklist
45
-
46
- ### Required Deliverables
47
- - [ ] **Kaggle Writeup** — 3 pages or less, following the template
48
- - [ ] **Video demo** — 3 minutes or less
49
- - [ ] **Public code repository** — linked in writeup
50
- - [ ] Uses **at least one HAI-DEF model** (e.g., MedGemma)
51
- - [ ] Code is **reproducible**
52
-
53
- ### Bonus Deliverables
54
- - [ ] Public interactive live demo app
55
- - [ ] Open-weight Hugging Face model tracing to HAI-DEF
56
-
57
- ### Write-up Quality
58
- - [ ] Clear project name
59
- - [ ] Team members with specialties and roles listed
60
- - [ ] Problem statement addresses "Problem Domain" and "Impact Potential" criteria
61
- - [ ] Overall solution addresses "Effective Use of HAI-DEF Models" criterion
62
- - [ ] Technical details address "Product Feasibility" criterion
63
- - [ ] All links (video, code, demo) are working and accessible
64
-
65
- ### Video Quality
66
- - [ ] 3 minutes or less
67
- - [ ] Demonstrates the application in action
68
- - [ ] Explains the problem and solution clearly
69
- - [ ] Shows HAI-DEF model integration
70
- - [ ] Professional quality (clear audio, good visuals)
71
-
72
- ### Code Quality
73
- - [ ] Well-organized repository structure
74
- - [ ] Clear README with setup instructions
75
- - [ ] Code is commented and readable
76
- - [ ] Dependencies are documented (requirements.txt / environment.yml)
77
- - [ ] Results are reproducible from the repository
78
-
79
- ---
80
-
81
- ## Video Tips (30% of score rides on execution)
82
-
83
- 1. **Open with the problem** (30 sec) — Who suffers? What's broken?
84
- 2. **Show the solution** (90 sec) — Live demo, not just slides
85
- 3. **Explain the tech** (30 sec) — Which HAI-DEF model, how it's used
86
- 4. **Quantify impact** (15 sec) — Numbers, estimates, or projections
87
- 5. **Close strong** (15 sec) — Vision for the future
88
-
89
- ---
90
-
91
- ## Technical Approach Suggestions
92
-
93
- ### Application Ideas Aligned to Criteria
94
-
95
- | Idea | Models | Special Award Fit |
96
- |------|--------|-------------------|
97
- | Clinical note summarizer with agent routing | MedGemma | Agentic Workflow |
98
- | Radiology triage assistant | MedGemma (vision) | Main Track |
99
- | Dermatology screening on mobile | MedGemma (quantized) | Edge AI |
100
- | Pathology slide analysis for rare diseases | MedGemma (fine-tuned) | Novel Task |
101
- | Patient education chatbot | MedGemma | Main Track |
102
- | Lab result interpreter agent pipeline | MedGemma + tools | Agentic Workflow |
103
- | Wound assessment via phone camera | MedGemma (vision, edge) | Edge AI |
104
-
105
- ### Key Technical Considerations
106
-
107
- 1. **Model Selection** — Choose the right HAI-DEF model variant for your task
108
- 2. **Fine-tuning** — Document methodology, hyperparameters, dataset curation
109
- 3. **Evaluation** — Include performance metrics and analysis
110
- 4. **Deployment** — Describe your app stack and how it would scale
111
- 5. **Privacy** — Healthcare data is sensitive; address HIPAA/privacy considerations
112
- 6. **External Data** — Must be publicly available and equally accessible to all participants
113
-
114
- ---
115
-
116
- ## External Data & Tools Rules
117
-
118
- - External data is allowed but must be **publicly available at no cost** to all participants
119
- - Use of HAI-DEF/MedGemma is subject to [HAI-DEF Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms)
120
- - Open source code must use an **OSI-approved license**
121
- - AutoML tools are permitted if properly licensed
122
- - **No private code sharing** outside your team during the competition
123
- - Public code sharing must be done on Kaggle forums/notebooks
124
-
125
- ---
126
-
127
- ## Draft Writeup Workspace
128
-
129
- Use `docs/writeup_draft.md` to iterate on your writeup before submitting on Kaggle:
130
-
131
- ```markdown
132
- ### Project name
133
- [TODO]
134
-
135
- ### Your team
136
- [TODO: Name, specialty, role for each member]
137
-
138
- ### Problem statement
139
- [TODO: Define the problem, who's affected, magnitude, why AI is the right solution]
140
- [TODO: Articulate impact — what changes if this works? How did you estimate impact?]
141
-
142
- ### Overall solution
143
- [TODO: Which HAI-DEF model(s)? Why are they the right choice?]
144
- [TODO: How does the application use them to their fullest potential?]
145
-
146
- ### Technical details
147
- [TODO: Architecture diagram / description]
148
- [TODO: Fine-tuning details (if applicable)]
149
- [TODO: Performance metrics and analysis]
150
- [TODO: Deployment stack and challenges]
151
- [TODO: How this works in practice, not just benchmarks]
152
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TODO.md DELETED
@@ -1,141 +0,0 @@
1
- # TODO — Next Session Action Items
2
-
3
- > **Last updated:** Feb 15, 2026 — Experimental track system built.
4
- > **Read this first** if you're a new AI instance picking up this project.
5
- > **See also:** `CLAUDE.md` (project intelligence) and `TRACKS.md` (track registry).
6
-
7
- ---
8
-
9
- ## High Priority (Do Next)
10
-
11
- ### 1. Run Experimental Tracks
12
-
13
- Three experimental tracks are built and ready to test. See `TRACKS.md` for full details.
14
-
15
- **Track B — RAG Variants** (`src/backend/tracks/rag_variants/`)
16
- ```bash
17
- cd src/backend
18
- python -m tracks.rag_variants.run_variants --max-cases 10 # smoke test
19
- python -m tracks.rag_variants.run_variants # full sweep
20
- ```
21
- Tests 10 configurations: chunking strategies (none, fixed-256, fixed-512, sentence, overlap), embedding models (MiniLM-L6, MiniLM-L12, MPNet, MedCPT), top-k sweep (3, 5, 10), and reranking.
22
-
23
- **Track C — Iterative Refinement** (`src/backend/tracks/iterative/`)
24
- ```bash
25
- python -m tracks.iterative.run_iterative --max-cases 10
26
- python -m tracks.iterative.run_iterative
27
- ```
28
- Tests 4 configurations: 2-round, 3-round, 5-round, and aggressive-critic. Produces cost/benefit data per iteration.
29
-
30
- **Track D — Arbitrated Parallel** (`src/backend/tracks/arbitrated/`)
31
- ```bash
32
- python -m tracks.arbitrated.run_arbitrated --max-cases 10
33
- python -m tracks.arbitrated.run_arbitrated
34
- ```
35
- Tests 4 configurations: 3-specialist/1-round, 5-specialist/1-round, 3-specialist/2-round, 5-specialist/2-round. Specialists: Cardiologist, Neurologist, ID, General IM, Emergency Medicine.
36
-
37
- **Prerequisites:**
38
- - Resume HF Endpoint (`medgemma-27b-cds`) — allow 5–15 min cold start (~$2.50/hr)
39
- - Activate venv: `src/backend/venv/`
40
- - May need: `pip install sentence-transformers` for MedCPT/MPNet/reranking variants
41
-
42
- ### 2. Record the Demo Video
43
-
44
- Video script is ready: `docs/video_script.md`. Need to actually record:
45
- 1. Resume HF Endpoint
46
- 2. Start backend + frontend locally
47
- 3. Record ~3 min screencast following the script
48
- 4. Upload to YouTube/Loom and get the link
49
-
50
- ### 3. Submit on Kaggle
51
-
52
- Kaggle writeup content is ready: `docs/kaggle_writeup.md`. Steps:
53
- 1. Go to competition page → "New Writeup"
54
- 2. Paste writeup content (fill in team name/member info first)
55
- 3. Select tracks: Main Track + Agentic Workflow Prize
56
- 4. Add links: video URL, GitHub repo, (optional) live demo
57
- 5. Click Submit
58
- 6. **Fill in [Your Name] placeholder** in the team table
59
-
60
- ---
61
-
62
- ## Medium Priority
63
-
64
- ### 4. CI Gating on Validation Scores
65
-
66
- Add a GitHub Action or pre-commit check that runs a small validation suite (e.g., 5 MedQA cases) and fails if top-1 accuracy drops below a threshold. This prevents regressions.
67
-
68
- ### 5. PMC Harness Improvements
69
-
70
- The PMC case fetcher currently gets ~5 cases per run. The limiting factor is title-based diagnosis extraction — many PubMed case report titles don't follow parseable patterns. Options:
71
- - Use the full-text XML API (not just abstracts) to extract "final diagnosis" from structured sections
72
- - Add more title regex patterns
73
- - Use the LLM to extract the diagnosis from the abstract itself (meta, but effective)
74
-
75
- ### 6. Calibrated Uncertainty Indicators
76
-
77
- We deliberately removed numeric confidence scores (see Phase 8 in DEVELOPMENT_LOG.md). If revisiting uncertainty communication:
78
- - Consider evidence-strength indicators per recommendation instead of a single composite score
79
- - Look at conformal prediction or test-time compute approaches if fine-tuning
80
- - Do NOT add back uncalibrated float scores — the anchoring bias risk is real
81
-
82
- ---
83
-
84
- ## Low Priority / Future
85
-
86
- ### 7. Model Optimization
87
-
88
- Currently using `google/medgemma-27b-text-it` on 1× A100 80 GB. Options:
89
- - Smaller/quantized models for latency reduction (medgemma-4b-it for lighter steps)
90
- - Specialized models for individual pipeline steps (e.g., a parse-only model)
91
- - Batch inference optimizations
92
-
93
- ### 8. EHR Integration Prototype
94
-
95
- Current input is manual text paste. A FHIR client could auto-populate patient data. This is a significant scope expansion but would dramatically increase real-world usability.
96
-
97
- ### 9. Frontend Polish
98
-
99
- - Loading skeletons during pipeline execution
100
- - Dark mode
101
- - Export report as PDF
102
- - Mobile-responsive layout
103
-
104
- ---
105
-
106
- ## Project State Summary
107
-
108
- | Component | Status | Notes |
109
- |-----------|--------|-------|
110
- | Backend (6-step pipeline) | ✅ Complete | All steps working, conflict detection added |
111
- | Frontend (Next.js) | ✅ Complete | Real-time pipeline viz, CDS report with conflicts |
112
- | RAG (62 guidelines) | ✅ Complete | 30/30 quality test, 100% top-1 accuracy |
113
- | Conflict Detection | ✅ Complete | Integrated into pipeline, frontend, and docs |
114
- | MedGemma HF Endpoint | ✅ Deployed | `medgemma-27b-cds`, 1× A100 80 GB, scale-to-zero, **currently paused** |
115
- | MedQA Validation (50 cases) | ✅ Complete | 36% top-1, 38% mentioned, 94% pipeline success |
116
- | Validation Framework | ✅ Complete | MedQA done; MTSamples + PMC harnesses built but not yet run at scale |
117
- | **Track System** | ✅ **Scaffolded** | **4 tracks (A/B/C/D), shared utils, all runners built — needs experimentation** |
118
- | Track B — RAG Variants | ✅ Built | 10 variants (chunking × embedding × rerank), ready to run |
119
- | Track C — Iterative Refinement | ✅ Built | 4 configs (2/3/5-round + aggressive), ready to run |
120
- | Track D — Arbitrated Parallel | ✅ Built | 4 configs (3/5 specialists × 1/2 rounds), ready to run |
121
- | Documentation (8+ files) | ✅ Audited | All docs updated and cross-checked |
122
- | test_e2e.py | ✅ Fixed | Now asserts 6 steps + conflict_detection |
123
- | GitHub | ✅ Pushed | `bshepp/clinical-decision-support-agent` (master) |
124
- | Kaggle Writeup | ✅ Draft ready | `docs/kaggle_writeup.md` — paste into Kaggle |
125
- | Video Script | ✅ Ready | `docs/video_script.md` — 3 min narration |
126
- | Demo Video | ⬜ Not started | Required for submission |
127
-
128
- **Key files:**
129
- - Backend entry: `src/backend/app/main.py`
130
- - Orchestrator: `src/backend/app/agent/orchestrator.py`
131
- - MedGemma service: `src/backend/app/services/medgemma.py`
132
- - Validation CLI: `src/backend/validation/run_validation.py`
133
- - **Track registry: `TRACKS.md`**
134
- - **Project intelligence: `CLAUDE.md`**
135
- - HF Endpoint guide: `docs/deploy_medgemma_hf.md`
136
- - All docs: `README.md`, `docs/architecture.md`, `docs/test_results.md`, `docs/writeup_draft.md`, `DEVELOPMENT_LOG.md`
137
-
138
- **Infrastructure:**
139
- - HF Endpoint: `medgemma-27b-cds` at `https://lisvpf8if1yhgxn2.us-east-1.aws.endpoints.huggingface.cloud`
140
- - Dev ports: Backend = 8002 (not 8000 — zombie process issue), Frontend = 3000
141
- - Virtual env: `src/backend/venv/`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
TRACKS.md DELETED
@@ -1,194 +0,0 @@
1
- # TRACKS.md — Experimental Track Registry
2
-
3
- > **Single source of truth** for all experimental tracks, their file ownership, tagging conventions, and isolation rules.
4
- > Referenced by [CLAUDE.md](CLAUDE.md). Read that file first for general project context.
5
-
6
- ---
7
-
8
- ## Why Tracks?
9
-
10
- The baseline pipeline (Track A) achieves 36% top-1 diagnostic accuracy on MedQA. To improve this, we are evaluating **multiple independent strategies** in parallel. Each strategy is an isolated "track" with its own code, configuration, and results — so we can compare them fairly without cross-contamination.
11
-
12
- ---
13
-
14
- ## Track Registry
15
-
16
- | ID | Name | Directory | Strategy |
17
- |----|------|-----------|----------|
18
- | **A** | Baseline | `src/backend/app/` | The production 6-step pipeline. No modifications for experiments. |
19
- | **B** | RAG Variants | `src/backend/tracks/rag_variants/` | Test different chunking sizes, segment strategies, and embedding models to optimize guideline retrieval quality and downstream diagnostic accuracy. |
20
- | **C** | Iterative Refinement | `src/backend/tracks/iterative/` | Run the diagnosis step in a serial loop — each iteration critiques and refines the previous output. Continue until the marginal improvement drops below a cost/benefit threshold. Produces a convergence chart. |
21
- | **D** | Arbitrated Parallel | `src/backend/tracks/arbitrated/` | Run multiple specialist reasoning agents in parallel. An arbiter agent evaluates all outputs, tailors resubmission prompts for each specialist based on their strengths/weaknesses, and repeats until the cost/benefit ratio plateaus. Produces a cost/benefit chart. |
22
- | **E** | Combined | `src/backend/tracks/combined/` | Compose per-axis winners from B/C/D/F/G/H. Tests 3 composition patterns (breadth-then-depth, depth-within-breadth, bookend). **Phase 3 — build after Phase 1+2 data.** |
23
- | **F** | Prompt Architecture | `src/backend/tracks/prompt_arch/` | Test how reasoning prompt structure affects accuracy: structured template, few-shot, reverse reasoning, Bayesian framing. **Phase 2.** |
24
- | **G** | Multi-Sample Voting | `src/backend/tracks/voting/` | Self-consistency via repeated sampling + majority/weighted vote. 1/3/5 samples at varying temperatures. **Phase 2.** |
25
- | **H** | Evidence Verification | `src/backend/tracks/verification/` | Post-hoc grounding check: verify each diagnosis against patient evidence, re-rank by grounding score. **Phase 2.** |
26
- | **—** | Shared | `src/backend/tracks/shared/` | Cross-track utilities: cost tracking, comparison harness, chart generation. Not a track itself. |
27
-
28
- ---
29
-
30
- ## File Tagging Convention
31
-
32
- **Every file owned by a track MUST carry a track tag on line 1.** This makes ownership unambiguous when reading any file in isolation.
33
-
34
- ### Format by file type
35
-
36
- | File Type | Tag Format | Example |
37
- |-----------|-----------|---------|
38
- | Python (`.py`) | `# [Track X: Name]` | `# [Track B: RAG Variants]` |
39
- | JSON (`.json`) | First key in object | `{"_track": "Track B: RAG Variants", ...}` |
40
- | Markdown (`.md`) | HTML comment | `<!-- [Track B: RAG Variants] -->` |
41
- | Config (`.env`, `.yaml`) | Comment | `# [Track B: RAG Variants]` |
42
-
43
- ### Track A exception
44
-
45
- Track A files (`src/backend/app/`) were written before the track system existed. They are tagged with `# [Track A: Baseline]` on line 1, but their code is NOT modified for experimental purposes. Experiments extend or wrap Track A code from within their own track directory.
46
-
47
- ---
48
-
49
- ## Isolation Rules
50
-
51
- These rules prevent cross-contamination between experimental tracks:
52
-
53
- ### 1. File Ownership
54
-
55
- - Each file belongs to exactly **one track** (identified by its line-1 tag and directory).
56
- - Files in `src/backend/app/` belong to **Track A**.
57
- - Files in `src/backend/tracks/<dir>/` belong to the corresponding track.
58
- - Files in `src/backend/tracks/shared/` are shared utilities, not owned by any single track.
59
-
60
- ### 2. No Cross-Modification
61
-
62
- - **Never modify a Track A file to serve an experiment.** Instead, import and extend from your track's directory.
63
- - **Never modify a Track B file from Track C code**, and so forth.
64
- - If two tracks need the same utility, put it in `shared/`.
65
-
66
- ### 3. Import Direction
67
-
68
- ```
69
- Track B/C/D code → may import from → Track A (app/) and shared/
70
- Track A code → NEVER imports → Track B/C/D
71
- shared/ code → may import from → Track A (app/) only
72
- ```
73
-
74
- ### 4. Results Isolation
75
-
76
- - Each track stores results in `src/backend/tracks/<dir>/results/`.
77
- - Result filenames include the track ID prefix (e.g., `trackB_medqa_20260215.json`).
78
- - Cross-track comparison is done **only** via `src/backend/tracks/shared/compare.py`.
79
-
80
- ### 5. Configuration Isolation
81
-
82
- - Track-specific parameters live in each track's own config or constants — not in `app/config.py`.
83
- - The shared `app/config.py` provides only baseline/global settings (API keys, endpoints, etc.).
84
-
85
- ---
86
-
87
- ## Track Details
88
-
89
- ### Track A: Baseline
90
-
91
- **Purpose:** The production-ready pipeline. The control group for all experiments.
92
-
93
- **Pipeline:** Parse → Reason → Drug Check → Guideline Retrieval → Conflict Detection → Synthesis
94
-
95
- **Key parameters:**
96
- - Embedding: `all-MiniLM-L6-v2` (384 dims)
97
- - RAG top-k: 5
98
- - No guideline chunking (each guideline = 1 document)
99
- - Clinical reasoning temperature: 0.3
100
- - Synthesis temperature: 0.2
101
- - Single-pass reasoning (no iteration)
102
-
103
- **Baseline accuracy (50-case MedQA):** 36% top-1, 38% mentioned
104
-
105
- ---
106
-
107
- ### Track B: RAG Variants
108
-
109
- **Purpose:** Determine whether retrieval quality improvements translate to better diagnostic accuracy.
110
-
111
- **Experiments:**
112
- 1. **Chunking strategies** — Split each guideline into smaller segments (100-word chunks, 200-word chunks, sentence-level) with configurable overlap
113
- 2. **Embedding models** — Compare `all-MiniLM-L6-v2` (384d) vs `all-mpnet-base-v2` (768d) vs `bge-base-en-v1.5` (768d) vs `medcpt` (medical-specific)
114
- 3. **Top-k variation** — Test k=3, k=5, k=8, k=10 to find optimal retrieval breadth
115
- 4. **Re-ranking** — Add a cross-encoder re-ranking step after initial retrieval
116
-
117
- **Measured outcomes:**
118
- - RAG retrieval accuracy (30-query test suite)
119
- - MedQA diagnostic accuracy (same 50-case seed=42)
120
- - Retrieval latency per query
121
-
122
- **Key files:**
123
- - `src/backend/tracks/rag_variants/config.py` — Variant definitions
124
- - `src/backend/tracks/rag_variants/chunker.py` — Guideline chunking strategies
125
- - `src/backend/tracks/rag_variants/retriever.py` — Modified retrieval with configurable embedding/chunking
126
- - `src/backend/tracks/rag_variants/run_variants.py` — Runner that tests all configurations
127
- - `src/backend/tracks/rag_variants/results/` — Per-variant results
128
-
129
- ---
130
-
131
- ### Track C: Iterative Refinement
132
-
133
- **Purpose:** Determine whether repeated self-critique improves diagnostic accuracy, and find the point of diminishing returns.
134
-
135
- **Method:**
136
- 1. Run baseline clinical reasoning (iteration 0)
137
- 2. Feed the output back along with the patient data and a critique prompt
138
- 3. The model reviews its own differential, identifies weaknesses, and produces a refined version
139
- 4. Repeat until: (a) max iterations reached, or (b) the differential stops changing meaningfully
140
- 5. Track accuracy and LLM cost at each iteration to produce a convergence/cost-benefit chart
141
-
142
- **Measured outcomes:**
143
- - Accuracy at each iteration (top-1, top-3, mentioned)
144
- - LLM token cost at each iteration
145
- - Convergence curve: accuracy vs. cumulative cost
146
- - Iteration at which improvement drops below threshold
147
-
148
- **Key files:**
149
- - `src/backend/tracks/iterative/config.py` — Max iterations, convergence threshold
150
- - `src/backend/tracks/iterative/refiner.py` — Iterative reasoning loop with self-critique
151
- - `src/backend/tracks/iterative/run_iterative.py` — Runner with per-iteration scoring
152
- - `src/backend/tracks/iterative/results/` — Per-iteration results and charts
153
-
154
- ---
155
-
156
- ### Track D: Arbitrated Parallel
157
-
158
- **Purpose:** Determine whether multiple specialist agents, coordinated by an arbiter, outperform a single-pass generalist — and at what cost.
159
-
160
- **Method:**
161
- 1. Run N specialist reasoning agents **in parallel**, each with a domain-specific system prompt (e.g., cardiologist, neurologist, infectious disease specialist)
162
- 2. An **arbiter agent** receives all N specialist outputs plus the patient data
163
- 3. The arbiter evaluates each specialist's differential, identifies agreements and disagreements
164
- 4. The arbiter generates **tailored resubmission prompts** for each specialist — telling the cardiologist "the neurologist raised X, reconsider Y" and vice versa
165
- 5. Specialists run again with the arbiter's feedback
166
- 6. Repeat until: (a) consensus reached, (b) max rounds, or (c) cost/benefit drops below threshold
167
- 7. The arbiter produces the final merged differential
168
- 8. Track accuracy and cost at each round to produce a cost/benefit chart
169
-
170
- **Measured outcomes:**
171
- - Accuracy at each arbitration round (top-1, top-3, mentioned)
172
- - Per-specialist accuracy contribution
173
- - LLM token cost per round (N specialists + 1 arbiter)
174
- - Cost/benefit convergence chart
175
- - Consensus rate across rounds
176
-
177
- **Key files:**
178
- - `src/backend/tracks/arbitrated/config.py` — Specialist definitions, max rounds, threshold
179
- - `src/backend/tracks/arbitrated/specialists.py` — Domain-specific reasoning agents
180
- - `src/backend/tracks/arbitrated/arbiter.py` — Arbiter agent that evaluates and coordinates
181
- - `src/backend/tracks/arbitrated/run_arbitrated.py` — Runner with per-round scoring
182
- - `src/backend/tracks/arbitrated/results/` — Per-round results and charts
183
-
184
- ---
185
-
186
- ## Adding a New Track
187
-
188
- 1. Choose an unused letter ID (E, F, ...).
189
- 2. Create `src/backend/tracks/<dir_name>/` with `__init__.py`.
190
- 3. Add the track to the **Track Registry** table above.
191
- 4. Tag every new file on line 1 with `# [Track X: Name]`.
192
- 5. Store results in `src/backend/tracks/<dir_name>/results/`.
193
- 6. Add a comparison entry in `src/backend/tracks/shared/compare.py`.
194
- 7. Never import from another track's directory — only from `app/` and `shared/`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
VALIDATION_PIPELINE_PLAN.md DELETED
@@ -1,1149 +0,0 @@
1
- # VALIDATION_PIPELINE_PLAN.md — Validation Pipeline Fix Plan
2
-
3
- > **Purpose:** Step-by-step implementation plan for fixing the validation/scoring
4
- > pipeline so accuracy metrics actually reflect the system's capabilities.
5
- >
6
- > **Root cause:** The pipeline forces every MedQA question through differential
7
- > diagnosis generation, but only 7/50 sampled questions are diagnostic. The other
8
- > 43 are treatment, mechanism, lab-finding, ethics, etc. — producing near-zero
9
- > accuracy on questions the pipeline was never designed to answer.
10
- >
11
- > **Expected impact:** Fixes P5+P3+P6 alone should raise measured MedQA accuracy
12
- > from ~36% to 60-70%+. Full implementation of all 7 fixes gives honest,
13
- > stratified metrics and unlocks multi-mode pipeline expansion.
14
- >
15
- > **Implementation order:** Bottom-up through the data flow. Each step locks down
16
- > its interface before the next layer builds on it. No rewrites needed.
17
-
18
- ---
19
-
20
- ## Step 1: P5 — Fix `fuzzy_match()` for Short Answers
21
-
22
- **File:** `src/backend/validation/base.py`
23
- **Functions:** `fuzzy_match()`, `normalize_text()`
24
- **Depends on:** Nothing
25
- **Depended on by:** P4 (type-aware scoring), P6 (MCQ selection comparison)
26
-
27
- ### Problem
28
-
29
- `fuzzy_match()` uses `min(len(c_tokens), len(t_tokens))` as the denominator for
30
- token overlap. For a 1-word target like "Clopidogrel", `min(1, 200) = 1`, so a
31
- single token match gives 100% overlap. But for a 3-word target like "Cross-linking
32
- of DNA", stop-word removal and normalization can reduce the target to 2 tokens,
33
- and if the candidate doesn't contain those specific tokens, it fails — even if
34
- the concept is present in different phrasing.
35
-
36
- The substring check (`normalize_text(target) in normalize_text(candidate)`) works
37
- for exact matches but fails for any morphological variation: "clopidogrel 75mg"
38
- won't substring-match "Clopidogrel" because the candidate is longer.
39
-
40
- Wait — actually the current code does `normalize_text(target) in normalize_text(candidate)`,
41
- which WOULD match "clopidogrel" inside "clopidogrel 75mg daily". The real failure
42
- case is when the answer uses different phrasing than the pipeline output, e.g.:
43
- - Target: "Reassurance and continuous monitoring"
44
- - Pipeline says: "reassure the patient and monitor continuously"
45
- - Neither substring contains the other, and token overlap may be low
46
-
47
- ### Changes
48
-
49
- ```python
50
- # In base.py — replace fuzzy_match() entirely
51
-
52
- def normalize_text(text: str) -> str:
53
- """Lowercase, strip punctuation, normalize whitespace."""
54
- text = text.lower().strip()
55
- text = re.sub(r'[^\w\s]', ' ', text)
56
- text = re.sub(r'\s+', ' ', text)
57
- return text.strip()
58
-
59
-
60
- # Medical stopwords that don't carry diagnostic meaning
61
- _MEDICAL_STOPWORDS = frozenset({
62
- "the", "a", "an", "of", "in", "to", "and", "or", "is", "are", "was",
63
- "were", "be", "been", "with", "for", "on", "at", "by", "from", "this",
64
- "that", "these", "those", "it", "its", "has", "have", "had", "do",
65
- "does", "did", "will", "would", "could", "should", "may", "might",
66
- "most", "likely", "following", "which", "what", "patient", "patients",
67
- })
68
-
69
-
70
- def _content_tokens(text: str) -> set[str]:
71
- """Extract meaningful content tokens, removing medical stopwords."""
72
- tokens = set(normalize_text(text).split())
73
- return tokens - _MEDICAL_STOPWORDS
74
-
75
-
76
- def fuzzy_match(candidate: str, target: str, threshold: float = 0.6) -> bool:
77
- """
78
- Check if candidate text is a fuzzy match for target.
79
-
80
- Strategy (checked in order, first match wins):
81
- 1. Normalized substring containment (either direction)
82
- 2. All content tokens of target appear in candidate (recall=1.0)
83
- 3. Token overlap ratio >= threshold (using content tokens)
84
-
85
- Args:
86
- candidate: Text from the pipeline output (may be long)
87
- target: Ground truth text (usually short)
88
- threshold: Minimum token overlap ratio (0.0-1.0)
89
- """
90
- c_norm = normalize_text(candidate)
91
- t_norm = normalize_text(target)
92
-
93
- if not t_norm:
94
- return False
95
-
96
- # 1. Substring containment (either direction)
97
- if t_norm in c_norm or c_norm in t_norm:
98
- return True
99
-
100
- # 2. All content tokens of target present in candidate
101
- # This catches "clopidogrel" in a 500-word report
102
- t_content = _content_tokens(target)
103
- c_content = _content_tokens(candidate)
104
-
105
- if t_content and t_content.issubset(c_content):
106
- return True
107
-
108
- # 3. Token overlap ratio
109
- if not t_content or not c_content:
110
- return False
111
-
112
- overlap = len(t_content & c_content)
113
- # Use target token count as denominator — "what fraction of
114
- # the target's meaning is present in the candidate?"
115
- recall = overlap / len(t_content)
116
-
117
- return recall >= threshold
118
- ```
119
-
120
- ### Key interface change
121
-
122
- - **Signature stays the same:** `fuzzy_match(candidate, target, threshold) -> bool`
123
- - **Behavior change:** More permissive matching for short targets (all-token-subset check),
124
- slightly different threshold semantics (recall-based instead of min-denominator-based).
125
- This is strictly better — no downstream code breaks.
126
-
127
- ### Tests to write
128
-
129
- ```python
130
- # test_fuzzy_match.py
131
- def test_short_target_substring():
132
- assert fuzzy_match("Start clopidogrel 75mg daily", "Clopidogrel") == True
133
-
134
- def test_short_target_all_tokens():
135
- assert fuzzy_match("The diagnosis is cholesterol embolization syndrome", "Cholesterol embolization") == True
136
-
137
- def test_multi_word_phrasing_variation():
138
- # "Reassurance and continuous monitoring" vs report text
139
- assert fuzzy_match(
140
- "reassure the patient and provide continuous cardiac monitoring",
141
- "Reassurance and continuous monitoring"
142
- ) == True # content tokens: {reassurance, continuous, monitoring} — "reassurance" != "reassure" though
143
-
144
- def test_no_false_positive():
145
- assert fuzzy_match("Acute myocardial infarction", "Pulmonary embolism") == False
146
-
147
- def test_empty_target():
148
- assert fuzzy_match("some text", "") == False
149
- ```
150
-
151
- **Note:** The "reassurance" vs "reassure" case will still fail without stemming.
152
- Add stemming as a future enhancement (e.g., via `nltk.stem.PorterStemmer` or a
153
- simple suffix-stripping function). For now, the all-token-subset check is the
154
- biggest improvement.
155
-
156
- ### Validation
157
-
158
- Run existing test suite — no existing tests should break because matching is
159
- strictly more permissive. Verify on a few known failure cases from the 50-case
160
- run results.
161
-
162
- ---
163
-
164
- ## Step 2: P3 — Preserve the Question Stem
165
-
166
- **File:** `src/backend/validation/harness_medqa.py`
167
- **Functions:** `_extract_vignette()`, `fetch_medqa()`
168
- **Depends on:** Nothing (independent of P5, but listed second for logical flow)
169
- **Depended on by:** P1 (classifier needs the stem), P6 (MCQ step needs the stem + options)
170
-
171
- ### Problem
172
-
173
- `_extract_vignette()` strips the question stem ("Which of the following is the
174
- most likely diagnosis?") from the MedQA question. This means:
175
- 1. The pipeline doesn't know what's being asked — it always defaults to
176
- "generate a differential"
177
- 2. The question classifier (P1) can't classify without the stem
178
- 3. The MCQ step (P6) can't present the original question
179
-
180
- ### Changes
181
-
182
- #### 2a. Refactor `_extract_vignette()` → `_split_question()`
183
-
184
- ```python
185
- # In harness_medqa.py — replace _extract_vignette()
186
-
187
- def _split_question(question: str) -> tuple[str, str]:
188
- """
189
- Split a USMLE question into (clinical_vignette, question_stem).
190
-
191
- The vignette is the clinical narrative. The stem is the actual question
192
- being asked ("Which of the following is the most likely diagnosis?").
193
-
194
- Returns:
195
- (vignette, stem) — stem may be empty if no recognizable stem found.
196
- In that case, vignette contains the full question text.
197
- """
198
- stems = [
199
- r"which of the following",
200
- r"what is the most likely",
201
- r"what is the best next step",
202
- r"what is the most appropriate",
203
- r"what is the diagnosis",
204
- r"the most likely diagnosis is",
205
- r"this patient most likely has",
206
- r"what would be the next step",
207
- r"what is the next best step",
208
- r"what is the underlying",
209
- r"what is the mechanism",
210
- r"what is the pathophysiology",
211
- ]
212
-
213
- text = question.strip()
214
- for stem_pattern in stems:
215
- pattern = re.compile(
216
- rf'(\.?\s*)([A-Z][^.]*{stem_pattern}[^.]*[\?\.]?\s*)$',
217
- re.IGNORECASE,
218
- )
219
- match = pattern.search(text)
220
- if match:
221
- vignette = text[:match.start()].strip()
222
- stem_text = match.group(2).strip()
223
- if len(vignette) > 50: # Sanity check
224
- return vignette, stem_text
225
-
226
- # Fallback: no recognizable stem — return full text as vignette
227
- return text, ""
228
- ```
229
-
230
- #### 2b. Update `fetch_medqa()` to store stem + vignette separately
231
-
232
- ```python
233
- # In fetch_medqa(), replace the case-building loop body:
234
-
235
- vignette, question_stem = _split_question(question)
236
-
237
- cases.append(ValidationCase(
238
- case_id=f"medqa_{i:04d}",
239
- source_dataset="medqa",
240
- input_text=vignette, # Pipeline still gets the vignette
241
- ground_truth={
242
- "correct_answer": answer_text,
243
- "answer_idx": answer_idx,
244
- "options": options,
245
- "full_question": question,
246
- },
247
- metadata={
248
- "question_stem": question_stem, # NEW
249
- "clinical_vignette": vignette, # NEW (same as input_text, explicit)
250
- "full_question_with_stem": question, # NEW (redundant with ground_truth but cleaner access)
251
- },
252
- ))
253
- ```
254
-
255
- ### Key interface change
256
-
257
- - `ValidationCase.metadata` now has 3 new keys: `question_stem`, `clinical_vignette`,
258
- `full_question_with_stem`
259
- - `input_text` is still just the vignette (pipeline input unchanged)
260
- - `_extract_vignette()` is renamed to `_split_question()` returning a tuple
261
- - Old callers of `_extract_vignette()`: only `fetch_medqa()` — update in place
262
-
263
- ### Backward compatibility
264
-
265
- - `input_text` stays the same → pipeline behavior unchanged
266
- - `ground_truth` keeps all existing keys → scoring unchanged
267
- - New data is in `metadata` only → nothing breaks
268
-
269
- ---
270
-
271
- ## Step 3: P1 — Question-Type Classifier
272
-
273
- **New file:** `src/backend/validation/question_classifier.py`
274
- **Depends on:** P3 (needs `metadata["question_stem"]`)
275
- **Depended on by:** P4 (type-aware scoring), P6 (routing), P7 (stratified reporting)
276
-
277
- ### Design
278
-
279
- Two-tier classifier:
280
- 1. **Heuristic classifier** (fast, no LLM call, used by default) — regex on question stem
281
- 2. **LLM classifier** (optional, for ambiguous cases) — ask MedGemma to classify
282
-
283
- Start with heuristic only. It correctly classified our 50-case sample already
284
- (7 diagnostic, 6 treatment, 1 mechanism, 2 lab, 34 other — matching manual review).
285
-
286
- ### Question type enum
287
-
288
- ```python
289
- # In question_classifier.py
290
-
291
- from enum import Enum
292
-
293
- class QuestionType(str, Enum):
294
- DIAGNOSTIC = "diagnostic" # "most likely diagnosis/cause/explanation"
295
- TREATMENT = "treatment" # "most appropriate next step/management/treatment"
296
- MECHANISM = "mechanism" # "mechanism of action", "pathophysiology"
297
- LAB_FINDING = "lab_finding" # "expected finding", "characteristic on agar"
298
- PHARMACOLOGY = "pharmacology" # "drug that targets...", "receptor..."
299
- EPIDEMIOLOGY = "epidemiology" # "risk factor", "prevalence", "incidence"
300
- ETHICS = "ethics" # "most appropriate action" (ethical dilemmas)
301
- ANATOMY = "anatomy" # "structure most likely damaged"
302
- OTHER = "other" # Doesn't fit above categories
303
- ```
304
-
305
- ### Heuristic classifier
306
-
307
- ```python
308
- import re
309
- from typing import Optional
310
- from validation.base import ValidationCase
311
-
312
-
313
- # Pattern → QuestionType mapping (checked in order, first match wins)
314
- _STEM_PATTERNS: list[tuple[str, QuestionType]] = [
315
- # Diagnostic
316
- (r"most likely diagnosis", QuestionType.DIAGNOSTIC),
317
- (r"most likely cause", QuestionType.DIAGNOSTIC),
318
- (r"most likely explanation", QuestionType.DIAGNOSTIC),
319
- (r"what is the diagnosis", QuestionType.DIAGNOSTIC),
320
- (r"diagnosis is", QuestionType.DIAGNOSTIC),
321
- (r"most likely condition", QuestionType.DIAGNOSTIC),
322
- (r"most likely has", QuestionType.DIAGNOSTIC),
323
- (r"most likely suffer", QuestionType.DIAGNOSTIC),
324
-
325
- # Treatment / Management
326
- (r"most appropriate (next step|management|treatment|intervention|therapy|pharmacotherapy)", QuestionType.TREATMENT),
327
- (r"best (next step|initial step|management|treatment)", QuestionType.TREATMENT),
328
- (r"most appropriate action", QuestionType.TREATMENT), # Can be ethics — see below
329
- (r"recommended (treatment|management|therapy)", QuestionType.TREATMENT),
330
-
331
- # Mechanism
332
- (r"mechanism of action", QuestionType.MECHANISM),
333
- (r"pathophysiology", QuestionType.MECHANISM),
334
- (r"mediator.*(responsible|involved)", QuestionType.MECHANISM),
335
- (r"(inhibit|block|activate).*receptor", QuestionType.MECHANISM),
336
- (r"cross-link", QuestionType.MECHANISM),
337
-
338
- # Lab / Findings
339
- (r"most likely finding", QuestionType.LAB_FINDING),
340
- (r"expected (finding|result|value)", QuestionType.LAB_FINDING),
341
- (r"characteristic (finding|feature|appearance)", QuestionType.LAB_FINDING),
342
- (r"(agar|culture|stain|gram|biopsy).*show", QuestionType.LAB_FINDING),
343
- (r"(laboratory|lab).*(result|finding|value)", QuestionType.LAB_FINDING),
344
-
345
- # Pharmacology
346
- (r"drug.*(target|mechanism|receptor|inhibit)", QuestionType.PHARMACOLOGY),
347
- (r"(target|act on|bind).*(receptor|enzyme|channel)", QuestionType.PHARMACOLOGY),
348
-
349
- # Epidemiology
350
- (r"(risk factor|prevalence|incidence|odds ratio|relative risk)", QuestionType.EPIDEMIOLOGY),
351
- (r"most (common|frequent).*(cause|risk|complication)", QuestionType.EPIDEMIOLOGY),
352
-
353
- # Anatomy
354
- (r"(structure|nerve|artery|vein|muscle|ligament).*(damaged|injured|affected|involved)", QuestionType.ANATOMY),
355
-
356
- # Ethics (refine: "most appropriate action" in context of disclosure, consent, etc.)
357
- (r"(tell|inform|disclose|report|consent|refuse|autonomy|confidentiality)", QuestionType.ETHICS),
358
- ]
359
-
360
-
361
- def classify_question(case: ValidationCase) -> QuestionType:
362
- """
363
- Classify a MedQA question by type using heuristics on the question stem.
364
-
365
- Looks at metadata["question_stem"] first, falls back to ground_truth["full_question"].
366
-
367
- Returns:
368
- QuestionType enum value
369
- """
370
- stem = case.metadata.get("question_stem", "")
371
- full_q = case.ground_truth.get("full_question", case.input_text)
372
-
373
- # Classify on stem first (more specific), then full question
374
- for text in [stem, full_q]:
375
- text_lower = text.lower()
376
- for pattern, qtype in _STEM_PATTERNS:
377
- if re.search(pattern, text_lower):
378
- return qtype
379
-
380
- return QuestionType.OTHER
381
-
382
-
383
- def classify_question_from_text(question_text: str) -> QuestionType:
384
- """
385
- Classify a raw question string (no ValidationCase needed).
386
- Useful for ad-hoc classification.
387
- """
388
- text_lower = question_text.lower()
389
- for pattern, qtype in _STEM_PATTERNS:
390
- if re.search(pattern, text_lower):
391
- return qtype
392
- return QuestionType.OTHER
393
-
394
-
395
- # Convenience: which types are "pipeline-appropriate"?
396
- DIAGNOSTIC_TYPES = {QuestionType.DIAGNOSTIC}
397
- PIPELINE_APPROPRIATE_TYPES = {
398
- QuestionType.DIAGNOSTIC,
399
- QuestionType.TREATMENT,
400
- QuestionType.LAB_FINDING,
401
- }
402
- ```
403
-
404
- ### Integration point
405
-
406
- In `fetch_medqa()`, after building each case, classify it:
407
-
408
- ```python
409
- from validation.question_classifier import classify_question
410
-
411
- # After creating the ValidationCase:
412
- case.metadata["question_type"] = classify_question(case).value
413
- ```
414
-
415
- ### Tests
416
-
417
- ```python
418
- def test_diagnostic_classification():
419
- case = make_case(question="...What is the most likely diagnosis?")
420
- assert classify_question(case) == QuestionType.DIAGNOSTIC
421
-
422
- def test_treatment_classification():
423
- case = make_case(question="...What is the most appropriate next step in management?")
424
- assert classify_question(case) == QuestionType.TREATMENT
425
-
426
- def test_mechanism_classification():
427
- case = make_case(question="...mechanism of action...")
428
- assert classify_question(case) == QuestionType.MECHANISM
429
-
430
- def test_ethics_override():
431
- # "most appropriate action" + disclosure keywords → ethics, not treatment
432
- case = make_case(question="...Tell the attending that he cannot fail to disclose this mistake. What is the most appropriate action?")
433
- assert classify_question(case) == QuestionType.ETHICS
434
- ```
435
-
436
- **Note on ethics override:** The pattern order matters. "most appropriate action"
437
- will match TREATMENT first. To handle ethics, we need the ethics patterns to check
438
- for disclosure/consent keywords in the *answer* or full question context. The
439
- current design checks patterns in order — put ethics keyword patterns before the
440
- generic "most appropriate action" treatment pattern, OR do a two-pass: first check
441
- for ethics keywords, then fall through to treatment.
442
-
443
- **Decision:** Use a two-pass approach. If the question contains ethics keywords
444
- AND a treatment-like stem, classify as ETHICS. Otherwise classify as TREATMENT.
445
- Implement this in `classify_question()` with a special-case check.
446
-
447
- ---
448
-
449
- ## Step 4: P4 — Question-Type-Aware Scoring
450
-
451
- **File:** `src/backend/validation/base.py` (new function) + `src/backend/validation/harness_medqa.py` (refactor scoring block)
452
- **Depends on:** P5 (correct fuzzy_match), P1 (question_type in metadata)
453
- **Depended on by:** P7 (stratified reporting)
454
-
455
- ### Problem
456
-
457
- `diagnosis_in_differential()` always searches the same fields in the same order
458
- regardless of question type. Treatment answers get looked up in the differential
459
- (wrong place), and mechanism answers get looked up everywhere (unlikely to match).
460
-
461
- ### Design: `score_case()` dispatcher
462
-
463
- ```python
464
- # In base.py — new function alongside diagnosis_in_differential()
465
-
466
- def score_case(
467
- target_answer: str,
468
- report: CDSReport,
469
- question_type: str = "diagnostic",
470
- reasoning_result: Optional[ClinicalReasoningResult] = None,
471
- ) -> dict[str, float]:
472
- """
473
- Score a case based on its question type.
474
-
475
- Returns a dict of metric_name → score (0.0 or 1.0).
476
- Always includes: "matched", "match_location", "match_rank"
477
- Plus type-specific metrics.
478
- """
479
- qt = question_type.lower()
480
-
481
- if qt == "diagnostic":
482
- return _score_diagnostic(target_answer, report)
483
- elif qt == "treatment":
484
- return _score_treatment(target_answer, report)
485
- elif qt == "mechanism":
486
- return _score_mechanism(target_answer, report, reasoning_result)
487
- elif qt == "lab_finding":
488
- return _score_lab_finding(target_answer, report, reasoning_result)
489
- else:
490
- return _score_generic(target_answer, report, reasoning_result)
491
- ```
492
-
493
- ### Per-type scorers
494
-
495
- ```python
496
- def _score_diagnostic(target: str, report: CDSReport) -> dict:
497
- """Score a diagnostic question — primary field is differential_diagnosis."""
498
- found_top1, r1, l1 = diagnosis_in_differential(target, report, top_n=1)
499
- found_top3, r3, l3 = diagnosis_in_differential(target, report, top_n=3)
500
- found_any, ra, la = diagnosis_in_differential(target, report)
501
-
502
- return {
503
- "top1_accuracy": 1.0 if found_top1 else 0.0,
504
- "top3_accuracy": 1.0 if found_top3 else 0.0,
505
- "mentioned_accuracy": 1.0 if found_any else 0.0,
506
- "differential_accuracy": 1.0 if (found_any and la == "differential") else 0.0,
507
- "match_location": la,
508
- "match_rank": ra,
509
- }
510
-
511
-
512
- def _score_treatment(target: str, report: CDSReport) -> dict:
513
- """Score a treatment question — primary fields are next_steps + recommendations."""
514
- # Check suggested_next_steps first (most specific)
515
- for i, action in enumerate(report.suggested_next_steps):
516
- if fuzzy_match(action.action, target):
517
- return {
518
- "top1_accuracy": 1.0 if i == 0 else 0.0,
519
- "top3_accuracy": 1.0 if i < 3 else 0.0,
520
- "mentioned_accuracy": 1.0,
521
- "match_location": "next_steps",
522
- "match_rank": i,
523
- }
524
-
525
- # Check guideline_recommendations
526
- for i, rec in enumerate(report.guideline_recommendations):
527
- if fuzzy_match(rec, target):
528
- return {
529
- "top1_accuracy": 0.0, # Not in primary slot
530
- "top3_accuracy": 0.0,
531
- "mentioned_accuracy": 1.0,
532
- "match_location": "recommendations",
533
- "match_rank": i,
534
- }
535
-
536
- # Check differential reasoning text (treatment may appear in reasoning)
537
- for dx in report.differential_diagnosis:
538
- if fuzzy_match(dx.reasoning, target, threshold=0.3):
539
- return {
540
- "top1_accuracy": 0.0,
541
- "top3_accuracy": 0.0,
542
- "mentioned_accuracy": 1.0,
543
- "match_location": "reasoning_text",
544
- "match_rank": -1,
545
- }
546
-
547
- # Fulltext fallback
548
- full_text = _build_fulltext(report)
549
- if fuzzy_match(full_text, target, threshold=0.3):
550
- return {
551
- "top1_accuracy": 0.0,
552
- "top3_accuracy": 0.0,
553
- "mentioned_accuracy": 1.0,
554
- "match_location": "fulltext",
555
- "match_rank": -1,
556
- }
557
-
558
- return _not_found()
559
-
560
-
561
- def _score_mechanism(
562
- target: str, report: CDSReport,
563
- reasoning_result: Optional[ClinicalReasoningResult] = None,
564
- ) -> dict:
565
- """Score a mechanism question — primary field is reasoning_chain."""
566
- # Check reasoning chain from clinical reasoning step
567
- if reasoning_result and reasoning_result.reasoning_chain:
568
- if fuzzy_match(reasoning_result.reasoning_chain, target, threshold=0.3):
569
- return {
570
- "top1_accuracy": 0.0,
571
- "top3_accuracy": 0.0,
572
- "mentioned_accuracy": 1.0,
573
- "match_location": "reasoning_chain",
574
- "match_rank": -1,
575
- }
576
-
577
- # Check differential reasoning text
578
- for dx in report.differential_diagnosis:
579
- if fuzzy_match(dx.reasoning, target, threshold=0.3):
580
- return {
581
- "top1_accuracy": 0.0,
582
- "top3_accuracy": 0.0,
583
- "mentioned_accuracy": 1.0,
584
- "match_location": "differential_reasoning",
585
- "match_rank": -1,
586
- }
587
-
588
- # Fulltext fallback
589
- full_text = _build_fulltext(report)
590
- if fuzzy_match(full_text, target, threshold=0.3):
591
- return {
592
- "top1_accuracy": 0.0,
593
- "top3_accuracy": 0.0,
594
- "mentioned_accuracy": 1.0,
595
- "match_location": "fulltext",
596
- "match_rank": -1,
597
- }
598
-
599
- return _not_found()
600
-
601
-
602
- def _score_lab_finding(
603
- target: str, report: CDSReport,
604
- reasoning_result: Optional[ClinicalReasoningResult] = None,
605
- ) -> dict:
606
- """Score a lab/finding question — primary field is recommended_workup."""
607
- # Check recommended workup
608
- if reasoning_result:
609
- for i, action in enumerate(reasoning_result.recommended_workup):
610
- if fuzzy_match(action.action, target, threshold=0.4):
611
- return {
612
- "top1_accuracy": 1.0 if i == 0 else 0.0,
613
- "top3_accuracy": 1.0 if i < 3 else 0.0,
614
- "mentioned_accuracy": 1.0,
615
- "match_location": "recommended_workup",
616
- "match_rank": i,
617
- }
618
-
619
- # Check next steps in final report
620
- for i, action in enumerate(report.suggested_next_steps):
621
- if fuzzy_match(action.action, target, threshold=0.4):
622
- return {
623
- "top1_accuracy": 0.0,
624
- "top3_accuracy": 0.0,
625
- "mentioned_accuracy": 1.0,
626
- "match_location": "next_steps",
627
- "match_rank": i,
628
- }
629
-
630
- # Fulltext fallback
631
- full_text = _build_fulltext(report)
632
- if fuzzy_match(full_text, target, threshold=0.3):
633
- return {
634
- "top1_accuracy": 0.0,
635
- "top3_accuracy": 0.0,
636
- "mentioned_accuracy": 1.0,
637
- "match_location": "fulltext",
638
- "match_rank": -1,
639
- }
640
-
641
- return _not_found()
642
-
643
-
644
- def _score_generic(
645
- target: str, report: CDSReport,
646
- reasoning_result: Optional[ClinicalReasoningResult] = None,
647
- ) -> dict:
648
- """Score any question type — searches all fields broadly."""
649
- # Try all specific scorers, return first hit
650
- for scorer in [_score_diagnostic, _score_treatment]:
651
- result = scorer(target, report)
652
- if result.get("mentioned_accuracy", 0.0) > 0.0:
653
- return result
654
-
655
- if reasoning_result:
656
- result = _score_mechanism(target, report, reasoning_result)
657
- if result.get("mentioned_accuracy", 0.0) > 0.0:
658
- return result
659
-
660
- return _not_found()
661
-
662
-
663
- def _build_fulltext(report: CDSReport) -> str:
664
- """Concatenate all report fields into a single searchable string."""
665
- return " ".join([
666
- report.patient_summary or "",
667
- " ".join(report.guideline_recommendations),
668
- " ".join(a.action for a in report.suggested_next_steps),
669
- " ".join(dx.diagnosis + " " + dx.reasoning for dx in report.differential_diagnosis),
670
- " ".join(report.sources_cited),
671
- " ".join(c.description for c in report.conflicts),
672
- ])
673
-
674
-
675
- def _not_found() -> dict:
676
- return {
677
- "top1_accuracy": 0.0,
678
- "top3_accuracy": 0.0,
679
- "mentioned_accuracy": 0.0,
680
- "match_location": "not_found",
681
- "match_rank": -1,
682
- }
683
- ```
684
-
685
- ### Integration in harness_medqa.py
686
-
687
- Replace the scoring block (lines ~242-290) in `validate_medqa()`:
688
-
689
- ```python
690
- # OLD:
691
- # found_top1, rank1, loc1 = diagnosis_in_differential(correct_answer, report, top_n=1)
692
- # ...etc...
693
-
694
- # NEW:
695
- question_type = case.metadata.get("question_type", "other")
696
- scores = score_case(
697
- target_answer=correct_answer,
698
- report=report,
699
- question_type=question_type,
700
- reasoning_result=state.clinical_reasoning if state else None,
701
- )
702
- # Extract individual metrics from the dict
703
- scores["parse_success"] = 1.0
704
- ```
705
-
706
- ### Key interface
707
-
708
- - `score_case()` returns `dict[str, float]` — always includes `top1_accuracy`,
709
- `top3_accuracy`, `mentioned_accuracy`, `match_location`, `match_rank`
710
- - The harness doesn't need to know about question type internals — just passes
711
- the string through
712
- - `diagnosis_in_differential()` is NOT removed — it's still used internally by
713
- `_score_diagnostic()` and as a utility
714
-
715
- ---
716
-
717
- ## Step 5: P6 — MCQ Answer-Selection Step
718
-
719
- **File:** `src/backend/validation/harness_medqa.py` (new function + integration)
720
- **Depends on:** P3 (question stem + options stored in metadata/ground_truth)
721
- **Depended on by:** P7 (reporting), but can be integrated independently
722
-
723
- ### Design
724
-
725
- After the pipeline generates its report, present MedGemma with the original
726
- question + answer choices + the pipeline's analysis, and ask it to select
727
- the best answer choice.
728
-
729
- ```python
730
- # In harness_medqa.py — new function
731
-
732
- from app.services.medgemma import MedGemmaService
733
-
734
-
735
- MCQ_SELECTION_PROMPT = """You are a medical expert taking a USMLE-style exam.
736
-
737
- You have already performed a thorough clinical analysis of this case.
738
- Now, based on your analysis, select the single best answer from the choices below.
739
-
740
- CLINICAL VIGNETTE:
741
- {vignette}
742
-
743
- QUESTION:
744
- {question_stem}
745
-
746
- YOUR CLINICAL ANALYSIS:
747
- - Top diagnoses: {top_diagnoses}
748
- - Key reasoning: {reasoning_summary}
749
- - Recommended next steps: {next_steps}
750
- - Guideline recommendations: {recommendations}
751
-
752
- ANSWER CHOICES:
753
- {formatted_options}
754
-
755
- Based on your clinical analysis above, which answer choice (A, B, C, or D)
756
- is BEST supported? Reply with ONLY the letter (A, B, C, or D) and a one-sentence justification.
757
-
758
- Format: X) Justification"""
759
-
760
-
761
- async def select_mcq_answer(
762
- case: ValidationCase,
763
- report: CDSReport,
764
- state: Optional[AgentState] = None,
765
- ) -> tuple[str, str]:
766
- """
767
- Use MedGemma to select the best MCQ answer given the pipeline's analysis.
768
-
769
- Args:
770
- case: The validation case (must have options in ground_truth)
771
- report: The CDS pipeline output
772
- state: Full agent state (for reasoning_chain access)
773
-
774
- Returns:
775
- (selected_letter, justification) — e.g. ("B", "Consistent with...")
776
- """
777
- options = case.ground_truth.get("options", {})
778
- if not options:
779
- return "", "No options available"
780
-
781
- # Format options
782
- if isinstance(options, dict):
783
- formatted = "\n".join(f"{k}) {v}" for k, v in sorted(options.items()))
784
- else:
785
- formatted = "\n".join(
786
- f"{chr(65+i)}) {v}" for i, v in enumerate(options)
787
- )
788
-
789
- # Build context from report
790
- top_dx = [dx.diagnosis for dx in report.differential_diagnosis[:3]]
791
- reasoning = ""
792
- if state and state.clinical_reasoning:
793
- reasoning = state.clinical_reasoning.reasoning_chain[:500]
794
- next_steps = [a.action for a in report.suggested_next_steps[:3]]
795
- recommendations = report.guideline_recommendations[:3]
796
-
797
- vignette = case.metadata.get("clinical_vignette", case.input_text)
798
- stem = case.metadata.get("question_stem", "")
799
-
800
- prompt = MCQ_SELECTION_PROMPT.format(
801
- vignette=vignette[:1000],
802
- question_stem=stem or "Based on the clinical presentation, select the best answer.",
803
- top_diagnoses=", ".join(top_dx) if top_dx else "None generated",
804
- reasoning_summary=reasoning[:500] if reasoning else "Not available",
805
- next_steps=", ".join(next_steps) if next_steps else "None",
806
- recommendations=", ".join(recommendations) if recommendations else "None",
807
- formatted_options=formatted,
808
- )
809
-
810
- service = MedGemmaService()
811
- raw = await service.generate(
812
- prompt=prompt,
813
- system_prompt="You are a medical expert. Select the single best answer.",
814
- max_tokens=100,
815
- temperature=0.1,
816
- )
817
-
818
- # Parse response — look for a letter A-D
819
- selected = ""
820
- justification = raw.strip()
821
- for char in raw.strip()[:5]:
822
- if char.upper() in "ABCD":
823
- selected = char.upper()
824
- break
825
-
826
- return selected, justification
827
-
828
-
829
- def score_mcq_selection(
830
- selected_letter: str,
831
- correct_idx: str,
832
- ) -> float:
833
- """Return 1.0 if selected matches correct, else 0.0."""
834
- return 1.0 if selected_letter.upper() == correct_idx.upper() else 0.0
835
- ```
836
-
837
- ### Integration in validate_medqa()
838
-
839
- After the existing scoring block, add:
840
-
841
- ```python
842
- # MCQ selection (optional additional scoring)
843
- if report and case.ground_truth.get("options"):
844
- try:
845
- selected, justification = await select_mcq_answer(case, report, state)
846
- scores["mcq_accuracy"] = score_mcq_selection(
847
- selected, case.ground_truth["answer_idx"]
848
- )
849
- details["mcq_selected"] = selected
850
- details["mcq_justification"] = justification
851
- details["mcq_correct"] = case.ground_truth["answer_idx"]
852
- except Exception as e:
853
- logger.warning(f"MCQ selection failed: {e}")
854
- scores["mcq_accuracy"] = 0.0
855
- ```
856
-
857
- ### Cost consideration
858
-
859
- This adds 1 extra MedGemma call per case (~100 tokens output). For 50 cases,
860
- that's ~5,000 extra output tokens — negligible cost (<$0.10).
861
-
862
- ### Key interface
863
-
864
- - `select_mcq_answer()` is self-contained — can be called or skipped
865
- - Adds `mcq_accuracy` to the scores dict
866
- - Does NOT change any existing score calculations
867
-
868
- ---
869
-
870
- ## Step 6: P7 — Stratified Reporting
871
-
872
- **File:** `src/backend/validation/base.py` (modify `print_summary`, `save_results`)
873
- + `src/backend/validation/harness_medqa.py` (modify aggregation block)
874
- **Depends on:** P1 (question types), P4 (per-type scores)
875
- **Depended on by:** Nothing (terminal node)
876
-
877
- ### Changes to summary aggregation in validate_medqa()
878
-
879
- ```python
880
- # In validate_medqa() — replace the aggregation block at the end
881
-
882
- # Aggregate — overall
883
- total = len(results)
884
- successful = sum(1 for r in results if r.success)
885
-
886
- metric_names = [
887
- "top1_accuracy", "top3_accuracy", "mentioned_accuracy",
888
- "differential_accuracy", "parse_success", "mcq_accuracy",
889
- ]
890
- metrics = {}
891
- for m in metric_names:
892
- values = [r.scores.get(m, 0.0) for r in results if m in r.scores]
893
- metrics[m] = sum(values) / len(values) if values else 0.0
894
-
895
- # Average pipeline time
896
- times = [r.pipeline_time_ms for r in results if r.success]
897
- metrics["avg_pipeline_time_ms"] = sum(times) / len(times) if times else 0
898
-
899
- # ── Stratified metrics ──
900
- from validation.question_classifier import QuestionType, PIPELINE_APPROPRIATE_TYPES
901
-
902
- # Group results by question type
903
- by_type: dict[str, list[ValidationResult]] = {}
904
- for r in results:
905
- qt = r.details.get("question_type", "other")
906
- by_type.setdefault(qt, []).append(r)
907
-
908
- # Per-type metrics
909
- for qt, type_results in by_type.items():
910
- n = len(type_results)
911
- metrics[f"count_{qt}"] = n
912
- for m in ["top1_accuracy", "top3_accuracy", "mentioned_accuracy", "mcq_accuracy"]:
913
- values = [r.scores.get(m, 0.0) for r in type_results if m in r.scores]
914
- if values:
915
- metrics[f"{m}_{qt}"] = sum(values) / len(values)
916
-
917
- # Pipeline-appropriate subset
918
- appropriate_results = [
919
- r for r in results
920
- if r.details.get("question_type", "other") in {t.value for t in PIPELINE_APPROPRIATE_TYPES}
921
- ]
922
- if appropriate_results:
923
- for m in ["top1_accuracy", "top3_accuracy", "mentioned_accuracy"]:
924
- values = [r.scores.get(m, 0.0) for r in appropriate_results]
925
- metrics[f"{m}_pipeline_appropriate"] = sum(values) / len(values) if values else 0.0
926
- metrics["count_pipeline_appropriate"] = len(appropriate_results)
927
- ```
928
-
929
- ### Changes to print_summary()
930
-
931
- ```python
932
- # In base.py — enhanced print_summary()
933
-
934
- def print_summary(summary: ValidationSummary):
935
- """Pretty-print validation results to console."""
936
- print(f"\n{'='*60}")
937
- print(f" Validation Results: {summary.dataset.upper()}")
938
- print(f"{'='*60}")
939
- print(f" Total cases: {summary.total_cases}")
940
- print(f" Successful: {summary.successful_cases}")
941
- print(f" Failed: {summary.failed_cases}")
942
- print(f" Duration: {summary.run_duration_sec:.1f}s")
943
-
944
- # Overall metrics (exclude per-type and count metrics)
945
- print(f"\n Overall Metrics:")
946
- for metric, value in sorted(summary.metrics.items()):
947
- if "_" in metric and any(metric.endswith(f"_{qt}") for qt in
948
- ["diagnostic", "treatment", "mechanism", "lab_finding",
949
- "pharmacology", "epidemiology", "ethics", "anatomy", "other",
950
- "pipeline_appropriate"]):
951
- continue # Print these in stratified section
952
- if metric.startswith("count_"):
953
- continue
954
- if "time" in metric and isinstance(value, (int, float)):
955
- print(f" {metric:35s} {value:.0f}ms")
956
- elif isinstance(value, float):
957
- print(f" {metric:35s} {value:.1%}")
958
- else:
959
- print(f" {metric:35s} {value}")
960
-
961
- # Stratified metrics
962
- type_keys = sorted(set(
963
- k.rsplit("_", 1)[-1] for k in summary.metrics
964
- if k.startswith("count_") and k != "count_pipeline_appropriate"
965
- ))
966
- if type_keys:
967
- print(f"\n By Question Type:")
968
- print(f" {'Type':15s} {'Count':>6s} {'Top-1':>7s} {'Top-3':>7s} {'Mentioned':>10s} {'MCQ':>7s}")
969
- print(f" {'-'*15} {'-'*6} {'-'*7} {'-'*7} {'-'*10} {'-'*7}")
970
- for qt in type_keys:
971
- count = summary.metrics.get(f"count_{qt}", 0)
972
- t1 = summary.metrics.get(f"top1_accuracy_{qt}", None)
973
- t3 = summary.metrics.get(f"top3_accuracy_{qt}", None)
974
- ma = summary.metrics.get(f"mentioned_accuracy_{qt}", None)
975
- mcq = summary.metrics.get(f"mcq_accuracy_{qt}", None)
976
- print(f" {qt:15s} {int(count):6d} "
977
- f"{f'{t1:.0%}':>7s if t1 is not None else ' - '} "
978
- f"{f'{t3:.0%}':>7s if t3 is not None else ' - '} "
979
- f"{f'{ma:.0%}':>10s if ma is not None else ' - '} "
980
- f"{f'{mcq:.0%}':>7s if mcq is not None else ' - '}")
981
-
982
- # Pipeline-appropriate subset
983
- pa_count = summary.metrics.get("count_pipeline_appropriate", 0)
984
- if pa_count > 0:
985
- print(f"\n Pipeline-Appropriate Subset ({int(pa_count)} cases):")
986
- for m in ["top1_accuracy", "top3_accuracy", "mentioned_accuracy"]:
987
- v = summary.metrics.get(f"{m}_pipeline_appropriate")
988
- if v is not None:
989
- print(f" {m:35s} {v:.1%}")
990
-
991
- print(f"{'='*60}\n")
992
- ```
993
-
994
- ### Key interface
995
-
996
- - `ValidationSummary.metrics` dict gains new keys with `_{question_type}` suffixes
997
- - `save_results()` doesn't need changes — it serializes `metrics` as-is
998
- - Console output is richer but backward-compatible (old scripts parsing the JSON
999
- still see all the original keys)
1000
-
1001
- ---
1002
-
1003
- ## Step 7: P2 — Multi-Mode Pipeline (Large — Future)
1004
-
1005
- **Files:** `src/backend/app/agent/orchestrator.py`, `src/backend/app/tools/clinical_reasoning.py`, `src/backend/app/models/schemas.py`
1006
- **Depends on:** P1 (question type routing into the pipeline), P3 (question stem passed to pipeline)
1007
- **Depended on by:** Nothing (this is the final architectural evolution)
1008
-
1009
- ### Overview
1010
-
1011
- This is the biggest change and should be done LAST. It modifies the production
1012
- pipeline, not just the validation framework.
1013
-
1014
- ### 7a. Add `question_context` to `CaseSubmission`
1015
-
1016
- ```python
1017
- # In schemas.py — extend CaseSubmission
1018
-
1019
- class CaseSubmission(BaseModel):
1020
- patient_text: str = Field(..., min_length=10)
1021
- include_drug_check: bool = Field(True)
1022
- include_guidelines: bool = Field(True)
1023
- question_context: Optional[str] = Field(
1024
- None,
1025
- description="The clinical question being asked (e.g., 'What is the most likely diagnosis?'). "
1026
- "If provided, the pipeline adapts its reasoning mode.",
1027
- )
1028
- question_type: Optional[str] = Field(
1029
- None,
1030
- description="Pre-classified question type: diagnostic, treatment, mechanism, etc.",
1031
- )
1032
- ```
1033
-
1034
- ### 7b. Mode-specific system prompts in clinical_reasoning.py
1035
-
1036
- ```python
1037
- # Replace single SYSTEM_PROMPT with a dict:
1038
-
1039
- SYSTEM_PROMPTS = {
1040
- "diagnostic": """You are an expert clinical reasoning assistant...
1041
- [existing diagnostic prompt — mostly unchanged]""",
1042
-
1043
- "treatment": """You are an expert clinical management assistant...
1044
- Given a structured patient profile and clinical question, recommend the
1045
- most appropriate treatment or next step in management.
1046
- Focus on: evidence-based treatment guidelines, patient-specific factors,
1047
- contraindications, and prioritized management steps.
1048
- Generate a ranked list of management options (not diagnoses)...""",
1049
-
1050
- "mechanism": """You are an expert in medical pathophysiology...
1051
- Given a clinical scenario, explain the underlying mechanism,
1052
- pathophysiology, or pharmacological principle being tested.
1053
- Focus on: molecular/cellular mechanism, physiological pathways,
1054
- drug mechanisms of action...""",
1055
-
1056
- "default": """[existing SYSTEM_PROMPT as fallback]""",
1057
- }
1058
- ```
1059
-
1060
- ### 7c. Extend clinical reasoning output model
1061
-
1062
- ```python
1063
- # In schemas.py — new model for non-diagnostic reasoning
1064
-
1065
- class ClinicalAnalysisResult(BaseModel):
1066
- """Flexible clinical analysis output that adapts to question type."""
1067
- analysis_mode: str = Field("diagnostic", description="What type of analysis was performed")
1068
- differential_diagnosis: List[DiagnosisCandidate] = Field(default_factory=list)
1069
- management_options: List[RecommendedAction] = Field(default_factory=list)
1070
- mechanism_explanation: str = Field("", description="Pathophysiology/mechanism explanation")
1071
- recommended_workup: List[RecommendedAction] = Field(default_factory=list)
1072
- reasoning_chain: str = Field("")
1073
- risk_assessment: Optional[str] = None
1074
- direct_answer: Optional[str] = Field(
1075
- None,
1076
- description="Direct answer to the clinical question (when applicable)",
1077
- )
1078
- ```
1079
-
1080
- ### 7d. Orchestrator routing
1081
-
1082
- ```python
1083
- # In orchestrator.py — _step_reason() adapts based on question type
1084
-
1085
- async def _step_reason(self):
1086
- question_type = self._case.question_type or "diagnostic"
1087
- result = await self.clinical_reasoning.run(
1088
- self._state.patient_profile,
1089
- mode=question_type,
1090
- )
1091
- ...
1092
- ```
1093
-
1094
- ### Scope warning
1095
-
1096
- This is a multi-file, multi-model refactor. Do it only after Steps 1-6 are
1097
- working and validated. The validation improvements (Steps 1-6) will already
1098
- give us honest metrics; Step 7 is about actually improving the pipeline's ability
1099
- to handle non-diagnostic questions.
1100
-
1101
- ---
1102
-
1103
- ## Testing Strategy
1104
-
1105
- ### Unit tests (no LLM calls needed)
1106
-
1107
- | Test file | What it tests |
1108
- |-----------|---------------|
1109
- | `test_fuzzy_match.py` | P5: fuzzy_match with short/long targets, edge cases |
1110
- | `test_question_classifier.py` | P1: classification accuracy on known questions |
1111
- | `test_split_question.py` | P3: vignette/stem separation on real MedQA samples |
1112
- | `test_score_case.py` | P4: type-aware scoring with mock CDSReport objects |
1113
-
1114
- ### Integration tests (need LLM endpoint)
1115
-
1116
- | Test | What it tests | Cost |
1117
- |------|---------------|------|
1118
- | 3-case smoke test with MCQ | P6: MCQ selection works | ~$0.50 |
1119
- | 10-case run with stratified reporting | P7: reporting output is correct | ~$2.00 |
1120
- | 50-case full run with all fixes | All: end-to-end accuracy comparison | ~$5.00 |
1121
-
1122
- ### Comparison protocol
1123
-
1124
- Run 50-case MedQA (seed=42) twice:
1125
- 1. **Before:** Current code (baseline: 36% top-1, 38% mentioned)
1126
- 2. **After:** All fixes applied
1127
-
1128
- Compare:
1129
- - Overall accuracy (should be similar or slightly higher)
1130
- - Diagnostic-only accuracy (should be similar — same pipeline, better matching)
1131
- - MCQ accuracy (expected 60-70%+ — this is the big win)
1132
- - Pipeline-appropriate accuracy (expected higher than overall)
1133
- - Stratified breakdown by question type
1134
-
1135
- ---
1136
-
1137
- ## File Change Summary
1138
-
1139
- | File | Changes | Step |
1140
- |------|---------|------|
1141
- | `validation/base.py` | Rewrite `fuzzy_match()`, add `_content_tokens()`, `_MEDICAL_STOPWORDS`. Add `score_case()` and per-type scorers. Modify `print_summary()`. | P5, P4, P7 |
1142
- | `validation/harness_medqa.py` | Replace `_extract_vignette()` with `_split_question()`. Update `fetch_medqa()` metadata. Refactor scoring block to use `score_case()`. Add `select_mcq_answer()`. Update aggregation. | P3, P4, P6, P7 |
1143
- | `validation/question_classifier.py` | **NEW FILE.** `QuestionType` enum, `classify_question()`, `_STEM_PATTERNS`. | P1 |
1144
- | `app/models/schemas.py` | Add `question_context`, `question_type` to `CaseSubmission`. Add `ClinicalAnalysisResult`. | P2 (Step 7 only) |
1145
- | `app/tools/clinical_reasoning.py` | Add mode-specific system prompts. Accept `mode` param. | P2 (Step 7 only) |
1146
- | `app/agent/orchestrator.py` | Route reasoning step based on question type. | P2 (Step 7 only) |
1147
-
1148
- **Steps 1-6 touch only validation code.** The production pipeline is unchanged
1149
- until Step 7.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
competition/download_data.txt DELETED
@@ -1 +0,0 @@
1
- kaggle competitions download -c med-gemma-impact-challenge
 
 
competition/overview.txt DELETED
@@ -1,167 +0,0 @@
1
- The MedGemma Impact Challenge
2
- Build human-centered AI applications with MedGemma and other open models from Google’s Health AI Developer Foundations (HAI-DEF).
3
-
4
-
5
- The MedGemma Impact Challenge
6
-
7
- View Writeups
8
- Overview
9
- In this competition, you’ll use MedGemma and other open models from Google’s Health AI Developer Foundations (HAI-DEF) to build human-centered AI applications.
10
-
11
- Start
12
-
13
- a month ago
14
- Close
15
- 11 days to go
16
- Description
17
- AI is already reshaping medicine, from diagnostics to drug discovery. But many clinical environments can’t rely on large, closed models that require constant internet access or centralized infrastructure. They need adaptable, privacy-focused tools that can run anywhere care is delivered.
18
-
19
- To meet this need, Google has released open-weight models specifically designed to help developers more efficiently create novel healthcare and life sciences applications. MedGemma and the rest of HAI-DEF collection give developers a starting point for building powerful tools while allowing them full control over the models and associated infrastructure.
20
-
21
- In this competition, you’ll use these models to build full fledged demonstration applications. Whether you’re building apps to streamline workflows, support patient communication, or facilitate diagnostics, your solution should demonstrate how these tools can enhance healthcare.
22
-
23
- Evaluation
24
- Minimum requirements
25
- To be considered a valid contribution, your submission should include:
26
-
27
- a high-quality writeup describing use of a specific HAI-DEF model,
28
- associated reproducible code for your initial results, and
29
- a video for judging.
30
- Your complete submission consists of a single package containing your video (3 minutes or less) and write-up (3 pages or less). This single entry can be submitted to the main competition track, and one special technology award, so separate submissions are not required. Read the section Submission Instructions for more details. Please follow the provided write-up template and refer to the judging criteria for all content requirements.
31
-
32
- Evaluation Criteria
33
- Submissions are evaluated on the following criteria:
34
-
35
- Criteria (percentage) Description
36
- Effective use of HAI-DEF models
37
- (20%) Are HAI-DEF models used appropriately?
38
-
39
- You will be assessed on: whether the submission proposes an application that uses HAI-DEF models to their fullest potential, where other solutions would likely be less effective.
40
-
41
- Note: Use of at least one of HAI-DEF models such as MedGemma is mandatory.
42
- Problem domain
43
- (15%) How important is this problem to solve and how plausible is it that AI is the right solution?
44
-
45
- You will be assessed on: storytelling, clarity of problem definition, clarity on whether there is an unmet need, the magnitude of the problem, who the user is and their improved journey given your solution.
46
- Impact potential
47
- (15%) If the solution works, what impact would it have?
48
-
49
- You will be assessed on: clear articulation of real or anticipated impact of your application within the given problem domain and description of how you calculated your estimates.
50
- Product feasibility
51
- (20%) Is the technical solution clearly feasible?
52
-
53
- You will be assessed on: technical documentation detailing model fine-tuning, model's performance analysis, your user-facing application stack, deployment challenges and how you plan on overcoming them. Consideration of how a product might be used in practice, rather than only for benchmarking.
54
- Execution and communication (30%) What is the quality of your project's execution and your clear and concise communication of your work? Your main submission package follows the provided template and includes a mandatory video demo and a write-up with links to your source material.
55
-
56
- You will be assessed on: the clarity, polish, and effectiveness of your video demonstration; the completeness and readability of your technical write-up; and the quality of your source code (e.g., organization, comments, reusability). Judges will look for a cohesive and compelling narrative across all submitted materials that effectively articulates how you meet the rest of the judging criteria.
57
- Timeline
58
- January 13, 2026 - Start Date.
59
- February 24, 2026 - Final Submission Deadline.
60
- March 17 - 24, 2026 - Anticipated Results Announcement - Time required to evaluate results is dependent on the number of submissions.
61
- All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.
62
-
63
- Judges
64
- Fereshteh Mahvar
65
- Staff Medical Software Engineer & Solutions Architect, Google Health AI
66
- Omar Sanseviero
67
- Developer Experience Lead, Google DeepMind
68
- Glenn Cameron
69
- Sr. PMM, Google
70
- Can "John" Kirmizi
71
- Software Engineer, Google Research
72
- Andrew Sellergren
73
- Software Engineer, Google Research
74
- Dave Steiner
75
- Clinical Research Scientist, Google
76
- Sunny Virmani
77
- Group Product Manager, Google Research
78
- Liron Yatziv
79
- Research Engineer, Google Research
80
- Daniel Golden
81
- Engineering Manager, Google Research
82
- Yun Liu
83
- Research Scientist, Google Research
84
- Rebecca Hemenway
85
- Health AI Strategic Partnerships, Google Research
86
- Fayaz Jamil
87
- Technical Program Manager, Google Research
88
- Tracks and Awards
89
- Main Track · $75,000
90
- Description
91
- These prizes are awarded to the best overall projects that demonstrate exceptional vision, technical execution, and potential for real-world impact.
92
-
93
- Track Awards
94
-
95
- 1st Place
96
- $30,000
97
-
98
- 2nd Place
99
- $20,000
100
-
101
- 3rd Place
102
- $15,000
103
-
104
- 4th Place
105
- $10,000
106
- Agentic Workflow Prize · $10,000
107
- Description
108
- It is awarded for the project that most effectively reimagines a complex workflow by deploying HAI-DEF models as intelligent agents or callable tools. The winning solution will demonstrate a significant overhaul of a challenging process, showcasing the power of agentic AI to improve efficiency and outcomes.
109
-
110
- Track Awards
111
-
112
- Agentic Workflow Prize 1
113
- $5,000
114
-
115
- Agentic Workflow Prize 2
116
- $5,000
117
- The Novel Task Prize · $10,000
118
- Description
119
- Awarded for the most impressive fine-tuned model that successfully adapts a HAI-DEF model to perform a useful task for which it was not originally trained on pre-release.
120
-
121
- Track Awards
122
-
123
- The Novel Task Prize 1
124
- $5,000
125
-
126
- The Novel Task Prize 2
127
- $5,000
128
- The Edge AI Prize · $5,000
129
- Description
130
- This prize is awarded to the most impressive solution that brings AI out of the cloud and into the field. It will be awarded to the team that best adapts a HAI-DEF model to run effectively on a local device like a mobile phone, portable scanner, lab instrument, or other edge hardware.
131
-
132
- Track Awards
133
-
134
- The Edge AI Prize
135
- $5,000
136
- Submission Instructions
137
- Your submission must be a Kaggle Writeup and it must be attached to this page. To create a new Writeup, click on the "New Writeup" button here. After you have saved your Writeup, you should see a "Submit" button in the top right corner. Each team is limited to submitting only a single Writeup, but that same Writeup can be un-submitted, edited, and re-submitted as many times as you'd like. Your Writeup should contain a summary of your overall project along with links to supporting resources.
138
-
139
- Choosing a track
140
- All submissions compete in the Main Track, and are eligible to win one special award prize (Agentic Workflow Prize, The Novel Task Prize, or The Edge of AI Prize). While you will have the option to select multiple tracks when you create your writeup, you can only chose the main track and one special award prize. If you choose multiple special awards, we will only consider your submission for one of your indicated special awards (randomly selected).
141
-
142
- Links
143
- Required: Video (3 min or less)
144
- Required: Public code repository
145
- Bonus: Public interactive live demo app
146
- Bonus: Open-weight Hugging Face model tracing to a HAI-DEF model
147
- Proposed Writeup template
148
- Use the following structure and in 3 pages or less present your work. Less is more! You should take advantage of the video to convey most of the concepts and keep the write-up as high level as possible.
149
-
150
- ### Project name
151
- [A concise name for your project.]
152
-
153
- ### Your team
154
- [Name your team members, their speciality and the role they played.]
155
-
156
- ### Problem statement
157
- [Your answer to the “Problem domain” & “Impact potential” criteria]
158
-
159
- ### Overall solution:
160
- [Your answer to “Effective use of HAI-DEF models” criterion]
161
-
162
- ### Technical details
163
- [Your answer to “Product feasibility” criterion]
164
- Note: If you attach a private Kaggle Resource to your public Kaggle Writeup, your private Resource will automatically be made public after the deadline.
165
-
166
- Citation
167
- Fereshteh Mahvar, Yun Liu, Daniel Golden, Fayaz Jamil, Sunny Jansen, Can Kirmizi, Rory Pilgrim, David F. Steiner, Andrew Sellergren, Richa Tiwari, Sunny Virmani, Liron Yatziv, Rebecca Hemenway, Yossi Matias, Ronit Levavi Morad, Avinatan Hassidim, Shravya Shetty, and María Cruz. The MedGemma Impact Challenge. https://kaggle.com/competitions/med-gemma-impact-challenge, 2026. Kaggle.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
competition/rules.txt DELETED
@@ -1,163 +0,0 @@
1
- Competition Rules
2
- ENTRY IN THIS COMPETITION CONSTITUTES YOUR ACCEPTANCE OF THESE OFFICIAL COMPETITION RULES.
3
- See Section 3.18 for defined terms
4
-
5
- The Competition named below is a skills-based competition to promote and further the field of data science. You must register via the Competition Website to enter. To enter the Competition, you must agree to these Official Competition Rules, which incorporate by reference the provisions and content of the Competition Website and any Specific Competition Rules herein (collectively, the "Rules"). Please read these Rules carefully before entry to ensure you understand and agree. You further agree that Submission in the Competition constitutes agreement to these Rules. You may not submit to the Competition and are not eligible to receive the prizes associated with this Competition unless you agree to these Rules. These Rules form a binding legal agreement between you and the Competition Sponsor with respect to the Competition. Your competition Submissions must conform to the requirements stated on the Competition Website. Your Submissions will be scored based on the evaluation metric described on the Competition Website. Subject to compliance with the Competition Rules, Prizes, if any, will be awarded to Participants with the best scores, based on the merits of the data science models submitted. See below for the complete Competition Rules. For Competitions designated as hackathons by the Competition Sponsor (“Hackathons”), your Submissions will be judged by the Competition Sponsor based on the evaluation rubric set forth on the Competition Website (“Evaluation Rubric”). The Prizes, if any, will be awarded to Participants with the highest ranking(s) as determined by the Competition Sponsor based on such rubric.
6
-
7
- You cannot sign up to Kaggle from multiple accounts and therefore you cannot enter or submit from multiple accounts.
8
-
9
- 1. COMPETITION-SPECIFIC TERMS
10
- 1. COMPETITION TITLE
11
- The MedGemma Impact Challenge
12
-
13
- 2. COMPETITION SPONSOR
14
- Google Research
15
-
16
- 3. COMPETITION SPONSOR ADDRESS
17
- 1600 Amphitheatre Parkway, Mountain View, California 94043 USA
18
-
19
- 4. COMPETITION WEBSITE
20
- https://www.kaggle.com/competitions/med-gemma-impact-challenge
21
-
22
- 5. TOTAL PRIZES AVAILABLE: $100,000
23
- Main track: $75,000
24
-
25
- 1st Place: $30,000
26
- 2nd Place: $20,000
27
- 3rd Place: $15,000
28
- 4th Place: $10,000
29
- Special Technology Awards: $25,000
30
-
31
- Agentic Workflow prize: $10,000 (Two prizes of $5,000)
32
- The Edge AI Prize: $5,000
33
- The Novel Task Prize: $10,000 (Two prizes of $5,000)
34
- 6. WINNER LICENSE TYPE
35
- CC BY 4.0
36
-
37
- 7. DATA ACCESS AND USE
38
- No data is provided for this competition. Use of HAI-DEF and MedGemma are subject to the HAI-DEF Terms of Use.
39
-
40
- 2. COMPETITION-SPECIFIC RULES
41
- In addition to the provisions of the General Competition Rules below, you understand and agree to these Competition-Specific Rules required by the Competition Sponsor:
42
-
43
- 1. TEAM LIMITS
44
- The maximum Team size is five (5). b. Team mergers are allowed and can be performed by the Team leader. In order to merge, the combined Team must have a total Submission count less than or equal to the maximum allowed as of the Team Merger Deadline. The maximum allowed is the number of Submissions per day multiplied by the number of days the competition has been running. For Hackathons, each team is allowed one (1) Submission; any Submissions submitted by Participants before merging into a Team will be unsubmitted.
45
-
46
- 2. SUBMISSION LIMITS
47
- For Hackathons, each Team may submit one (1) Submission. This single entry can be submitted to the main competition track, and one special technology award, so separate submissions are not required.
48
-
49
- 3. COMPETITION TIMELINE
50
- Competition Timeline dates (including Entry Deadline, Final Submission Deadline, Start Date, and Team Merger Deadline, as applicable) are reflected on the competition’s Overview > Timeline page.
51
-
52
- 4. COMPETITION DATA
53
- a. Data Access and Use
54
- None. Competition Data will not be provided by Competition Sponsor for this Competition.
55
- b. Data Security
56
- You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.
57
- 5. WINNER LICENSE
58
- a. Under Section 2.8 (Winners Obligations) of the General Rules below, you hereby grant and will grant the Competition Sponsor the following license(s) with respect to your Submission if you are a Competition winner:
59
-
60
- Open Source: You hereby license and will license your winning Submission and the source code used to generate the Submission to the Competition Sponsor under CC BY 4.0 that in no event limits commercial use of such code or model containing or depending on such code.
61
-
62
- For generally commercially available software that you used to generate your Submission that is not owned by you, but that can be procured by the Competition Sponsor without undue expense, you do not need to grant the license in the preceding Section for that software.
63
-
64
- In the event that input data or pretrained models with an incompatible license are used to generate your winning solution, you do not need to grant an open source license in the preceding Section for that data and/or model(s).
65
-
66
- b. You may be required by the Sponsor to provide a detailed description of how the winning Submission was generated, to the Competition Sponsor’s specifications, as outlined in Section 2.8, Winner’s Obligations. This may include a detailed description of methodology, where one must be able to reproduce the approach by reading the description, and includes a detailed explanation of the architecture, preprocessing, loss function, training details, hyper-parameters, etc. The description should also include a link to a code repository with complete and detailed instructions so that the results obtained can be reproduced.
67
-
68
- 6. EXTERNAL DATA AND TOOLS
69
- a. You may use data other than the Competition Data (“External Data”) to develop and test your Submissions. However, you will ensure the External Data is either publicly available and equally accessible to use by all Participants of the Competition for purposes of the competition at no cost to the other Participants, or satisfies the Reasonableness criteria as outlined in Section 2.6.b below. The ability to use External Data under this Section does not limit your other obligations under these Competition Rules, including but not limited to Section 2.8 (Winners Obligations).
70
-
71
- b. Use of HAI-DEF and MedGemma are subject to the HAI-DEF Terms of Use
72
-
73
- c. The use of external data and models is acceptable unless specifically prohibited by the Host. Because of the potential costs or restrictions (e.g., “geo restrictions”) associated with obtaining rights to use external data or certain software and associated tools, their use must be “reasonably accessible to all” and of “minimal cost”. Also, regardless of the cost challenges as they might affect all Participants during the course of the competition, the costs of potentially procuring a license for software used to generate a Submission, must also be considered. The Host will employ an assessment of whether or not the following criteria can exclude the use of the particular LLM, data set(s), or tool(s):
74
-
75
- Are Participants being excluded from a competition because of the "excessive" costs for access to certain LLMs, external data, or tools that might be used by other Participants. The Host will assess the excessive cost concern by applying a “Reasonableness” standard (the “Reasonableness Standard”). The Reasonableness Standard will be determined and applied by the Host in light of things like cost thresholds and accessibility.
76
-
77
- By way of example only, a small subscription charge to use additional elements of a large language model such as Gemini Advanced are acceptable if meeting the Reasonableness Standard of Sec. 8.2. Purchasing a license to use a proprietary dataset that exceeds the cost of a prize in the competition would not be considered reasonable.
78
-
79
- d. Automated Machine Learning Tools (“AMLT”)
80
-
81
- Individual Participants and Teams may use automated machine learning tool(s) (“AMLT”) (e.g., Google toML, H2O Driverless AI, etc.) to create a Submission, provided that the Participant or Team ensures that they have an appropriate license to the AMLT such that they are able to comply with the Competition Rules.
82
- 7. ELIGIBILITY
83
- a. Unless otherwise stated in the Competition-Specific Rules above or prohibited by internal policies of the Competition Entities, employees, interns, contractors, officers and directors of Competition Entities may enter and participate in the Competition, but are not eligible to win any Prizes. "Competition Entities" means the Competition Sponsor, Kaggle Inc., and their respective parent companies, subsidiaries and affiliates. If you are such a Participant from a Competition Entity, you are subject to all applicable internal policies of your employer with respect to your participation.
84
-
85
- 8. WINNER’S OBLIGATIONS
86
- a. As a condition to being awarded a Prize, a Prize winner must fulfill the following obligations:
87
-
88
- Deliver to the Competition Sponsor the final model's software code as used to generate the winning Submission and associated documentation. The delivered software code should follow these documentation guidelines, must be capable of generating the winning Submission, and contain a description of resources required to build and/or run the executable code successfully. For avoidance of doubt, delivered software code should include training code, inference code, and a description of the required computational environment. For Hackathons, the Submission deliverables will be as described on the Competition Website, which may be information or materials that are not software code.
89
- b. To the extent that the final model’s software code includes generally commercially available software that is not owned by you, but that can be procured by the Competition Sponsor without undue expense, then instead of delivering the code for that software to the Competition Sponsor, you must identify that software, method for procuring it, and any parameters or other information necessary to replicate the winning Submission; Individual Participants and Teams who create a Submission using an AMLT may win a Prize. However, for clarity, the potential winner’s Submission must still meet the requirements of these Rules, including but not limited to Section 2.5 (Winners License), Section 2.8 (Winners Obligations), and Section 3.14 (Warranty, Indemnity, and Release).”
90
-
91
- c. Individual Participants and Teams who create a Submission using an AMLT may win a Prize. However, for clarity, the potential winner’s Submission must still meet the requirements of these Rules,
92
-
93
- Grant to the Competition Sponsor the license to the winning Submission stated in the Competition Specific Rules above, and represent that you have the unrestricted right to grant that license;
94
-
95
- Sign and return all Prize acceptance documents as may be required by Competition Sponsor or Kaggle, including without limitation: (a) eligibility certifications; (b) licenses, releases and other agreements required under the Rules; and (c) U.S. tax forms (such as IRS Form W-9 if U.S. resident, IRS Form W-8BEN if foreign resident, or future equivalents).
96
-
97
- 9. GOVERNING LAW
98
- a. Unless otherwise provided in the Competition Specific Rules above, all claims arising out of or relating to these Rules will be governed by California law, excluding its conflict of laws rules, and will be litigated exclusively in the Federal or State courts of Santa Clara County, California, USA. The parties consent to personal jurisdiction in those courts. If any provision of these Rules is held to be invalid or unenforceable, all remaining provisions of the Rules will remain in full force and effect.
99
-
100
- 3. GENERAL COMPETITION RULES - BINDING AGREEMENT
101
- 1. ELIGIBILITY
102
- a. To be eligible to enter the Competition, you must be:
103
-
104
- a registered account holder at Kaggle.com;
105
- the older of 18 years old or the age of majority in your jurisdiction of residence (unless otherwise agreed to by Competition Sponsor and appropriate parental/guardian consents have been obtained by Competition Sponsor);
106
- not a resident of Crimea, so-called Donetsk People's Republic (DNR) or Luhansk People's Republic (LNR), Cuba, Iran, Syria, or North Korea; and
107
- not a person or representative of an entity under U.S. export controls or sanctions (see: https://www.treasury.gov/resourcecenter/sanctions/Programs/Pages/Programs.aspx).
108
- b. Competitions are open to residents of the United States and worldwide, except that if you are a resident of Crimea, so-called Donetsk People's Republic (DNR) or Luhansk People's Republic (LNR), Cuba, Iran, Syria, North Korea, or are subject to U.S. export controls or sanctions, you may not enter the Competition. Other local rules and regulations may apply to you, so please check your local laws to ensure that you are eligible to participate in skills-based competitions. The Competition Host reserves the right to forego or award alternative Prizes where needed to comply with local laws. If a winner is located in a country where prizes cannot be awarded, then they are not eligible to receive a prize.
109
-
110
- c. If you are entering as a representative of a company, educational institution or other legal entity, or on behalf of your employer, these rules are binding on you, individually, and the entity you represent or where you are an employee. If you are acting within the scope of your employment, or as an agent of another party, you warrant that such party or your employer has full knowledge of your actions and has consented thereto, including your potential receipt of a Prize. You further warrant that your actions do not violate your employer's or entity's policies and procedures.
111
-
112
- d. The Competition Sponsor reserves the right to verify eligibility and to adjudicate on any dispute at any time. If you provide any false information relating to the Competition concerning your identity, residency, mailing address, telephone number, email address, ownership of right, or information required for entering the Competition, you may be immediately disqualified from the Competition.
113
-
114
- 2. SPONSOR AND HOSTING PLATFORM
115
- a. The Competition is sponsored by Competition Sponsor named above. The Competition is hosted on behalf of Competition Sponsor by Kaggle Inc. ("Kaggle"). Kaggle is an independent contractor of Competition Sponsor, and is not a party to this or any agreement between you and Competition Sponsor. You understand that Kaggle has no responsibility with respect to selecting the potential Competition winner(s) or awarding any Prizes. Kaggle will perform certain administrative functions relating to hosting the Competition, and you agree to abide by the provisions relating to Kaggle under these Rules. As a Kaggle.com account holder and user of the Kaggle competition platform, remember you have accepted and are subject to the Kaggle Terms of Service at www.kaggle.com/terms in addition to these Rules.
116
-
117
- 3. COMPETITION PERIOD
118
- a. For the purposes of Prizes, the Competition will run from the Start Date and time to the Final Submission Deadline (such duration the “Competition Period”). The Competition Timeline is subject to change, and Competition Sponsor may introduce additional hurdle deadlines during the Competition Period. Any updated or additional deadlines will be publicized on the Competition Website. It is your responsibility to check the Competition Website regularly to stay informed of any deadline changes. YOU ARE RESPONSIBLE FOR DETERMINING THE CORRESPONDING TIME ZONE IN YOUR LOCATION.
119
-
120
- 4. COMPETITION ENTRY
121
- a. NO PURCHASE NECESSARY TO ENTER OR WIN. To enter the Competition, you must register on the Competition Website prior to the Entry Deadline, and follow the instructions for developing and entering your Submission through the Competition Website. Your Submissions must be made in the manner and format, and in compliance with all other requirements, stated on the Competition Website (the "Requirements"). Submissions must be received before any Submission deadlines stated on the Competition Website. Submissions not received by the stated deadlines will not be eligible to receive a Prize. b. Except as expressly allowed in Hackathons as set forth on the Competition Website, submissions may not use or incorporate information from hand labeling or human prediction of the validation dataset or test data records. c. If the Competition is a multi-stage competition with temporally separate training and/or test data, one or more valid Submissions may be required during each Competition stage in the manner described on the Competition Website in order for the Submissions to be Prize eligible. d. Submissions are void if they are in whole or part illegible, incomplete, damaged, altered, counterfeit, obtained through fraud, or late. Competition Sponsor reserves the right to disqualify any entrant who does not follow these Rules, including making a Submission that does not meet the Requirements.
122
-
123
- 5. INDIVIDUALS AND TEAMS
124
- a. Individual Account. You may make Submissions only under one, unique Kaggle.com account. You will be disqualified if you make Submissions through more than one Kaggle account, or attempt to falsify an account to act as your proxy. You may submit up to the maximum number of Submissions per day as specified on the Competition Website. b. Teams. If permitted under the Competition Website guidelines, multiple individuals may collaborate as a Team; however, you may join or form only one Team. Each Team member must be a single individual with a separate Kaggle account. You must register individually for the Competition before joining a Team. You must confirm your Team membership to make it official by responding to the Team notification message sent to your Kaggle account. Team membership may not exceed the Maximum Team Size stated on the Competition Website. c. Team Merger. Teams (or individual Participants) may request to merge via the Competition Website. Team mergers may be allowed provided that: (i) the combined Team does not exceed the Maximum Team Size; (ii) the number of Submissions made by the merging Teams does not exceed the number of Submissions permissible for one Team at the date of the merger request; (iii) the merger is completed before the earlier of: any merger deadline or the Competition deadline; and (iv) the proposed combined Team otherwise meets all the requirements of these Rules. d. Private Sharing. No private sharing outside of Teams. Privately sharing code or data outside of Teams is not permitted. It's okay to share code if made available to all Participants on the forums.
125
-
126
- 6. SUBMISSION CODE REQUIREMENTS
127
- a. Private Code Sharing. Unless otherwise specifically permitted under the Competition Website or Competition Specific Rules above, during the Competition Period, you are not allowed to privately share source or executable code developed in connection with or based upon the Competition Data or other source or executable code relevant to the Competition (“Competition Code”). This prohibition includes sharing Competition Code between separate Teams, unless a Team merger occurs. Any such sharing of Competition Code is a breach of these Competition Rules and may result in disqualification. b. Public Code Sharing. You are permitted to publicly share Competition Code, provided that such public sharing does not violate the intellectual property rights of any third party. If you do choose to share Competition Code or other such code, you are required to share it on Kaggle.com on the discussion forum or notebooks associated specifically with the Competition for the benefit of all competitors. By so sharing, you are deemed to have licensed the shared code under an Open Source Initiative-approved license (see www.opensource.org) that in no event limits commercial use of such Competition Code or model containing or depending on such Competition Code. c. Use of Open Source. Unless otherwise stated in the Specific Competition Rules above, if open source code is used in the model to generate the Submission, then you must only use open source code licensed under an Open Source Initiative-approved license (see www.opensource.org) that in no event limits commercial use of such code or model containing or depending on such code.
128
-
129
- 7. DETERMINING WINNERS
130
- a. Each Submission will be scored and/or ranked by the evaluation metric, or Evaluation Rubric (in the case of Hackathon Competitions),stated on the Competition Website. During the Competition Period, the current ranking will be visible on the Competition Website's Public Leaderboard. The potential winner(s) are determined solely by the leaderboard ranking on the Private Leaderboard, subject to compliance with these Rules. The Public Leaderboard will be based on the public test set and the Private Leaderboard will be based on the private test set. There will be no leaderboards for Hackathon Competitions. b. In the event of a tie, the Submission that was entered first to the Competition will be the winner. In the event a potential winner is disqualified for any reason, the Submission that received the next highest score rank will be chosen as the potential winner. For Hackathon Competitions, each of the top Submissions will get a unique ranking and there will be no tiebreakers.
131
-
132
- 8. NOTIFICATION OF WINNERS & DISQUALIFICATION
133
- a. The potential winner(s) will be notified by email. b. If a potential winner (i) does not respond to the notification attempt within one (1) week from the first notification attempt or (ii) notifies Kaggle within one week after the Final Submission Deadline that the potential winner does not want to be nominated as a winner or does not want to receive a Prize, then, in each case (i) and (ii) such potential winner will not receive any Prize, and an alternate potential winner will be selected from among all eligible entries received based on the Competition’s judging criteria. c. In case (i) and (ii) above Kaggle may disqualify the Participant. However, in case (ii) above, if requested by Kaggle, such potential winner may provide code and documentation to verify the Participant’s compliance with these Rules. If the potential winner provides code and documentation to the satisfaction of Kaggle, the Participant will not be disqualified pursuant to this paragraph. d. Competition Sponsor reserves the right to disqualify any Participant from the Competition if the Competition Sponsor reasonably believes that the Participant has attempted to undermine the legitimate operation of the Competition by cheating, deception, or other unfair playing practices or abuses, threatens or harasses any other Participants, Competition Sponsor or Kaggle. e. A disqualified Participant may be removed from the Competition leaderboard, at Kaggle's sole discretion. If a Participant is removed from the Competition Leaderboard, additional winning features associated with the Kaggle competition platform, for example Kaggle points or medals, may also not be awarded. f. The final leaderboard list will be publicly displayed at Kaggle.com. Determinations of Competition Sponsor are final and binding.
134
-
135
- 9. PRIZES
136
- a. Prize(s) are as described on the Competition Website and are only available for winning during the time period described on the Competition Website. The odds of winning any Prize depends on the number of eligible Submissions received during the Competition Period and the skill of the Participants. b. All Prizes are subject to Competition Sponsor's review and verification of the Participant’s eligibility and compliance with these Rules, and the compliance of the winning Submissions with the Submissions Requirements. In the event that the Submission demonstrates non-compliance with these Competition Rules, Competition Sponsor may at its discretion take either of the following actions: (i) disqualify the Submission(s); or (ii) require the potential winner to remediate within one week after notice all issues identified in the Submission(s) (including, without limitation, the resolution of license conflicts, the fulfillment of all obligations required by software licenses, and the removal of any software that violates the software restrictions). c. A potential winner may decline to be nominated as a Competition winner in accordance with Section 3.8. d. Potential winners must return all required Prize acceptance documents within two (2) weeks following notification of such required documents, or such potential winner will be deemed to have forfeited the prize and another potential winner will be selected. Prize(s) will be awarded within approximately thirty (30) days after receipt by Competition Sponsor or Kaggle of the required Prize acceptance documents. Transfer or assignment of a Prize is not allowed. e. You are not eligible to receive any Prize if you do not meet the Eligibility requirements in Section 2.7 and Section 3.1 above. f. If a Team wins a monetary Prize, the Prize money will be allocated in even shares between the eligible Team members, unless the Team unanimously opts for a different Prize split and notifies Kaggle before Prizes are issued.
137
-
138
- 10. TAXES
139
- a. ALL TAXES IMPOSED ON PRIZES ARE THE SOLE RESPONSIBILITY OF THE WINNERS. Payments to potential winners are subject to the express requirement that they submit all documentation requested by Competition Sponsor or Kaggle for compliance with applicable state, federal, local and foreign (including provincial) tax reporting and withholding requirements. Prizes will be net of any taxes that Competition Sponsor is required by law to withhold. If a potential winner fails to provide any required documentation or comply with applicable laws, the Prize may be forfeited and Competition Sponsor may select an alternative potential winner. Any winners who are U.S. residents will receive an IRS Form-1099 in the amount of their Prize.
140
-
141
- 11. GENERAL CONDITIONS
142
- a. All federal, state, provincial and local laws and regulations apply.
143
-
144
- 12. PUBLICITY
145
- a. You agree that Competition Sponsor, Kaggle and its affiliates may use your name and likeness for advertising and promotional purposes without additional compensation, unless prohibited by law.
146
-
147
- 13. PRIVACY
148
- a. You acknowledge and agree that Competition Sponsor and Kaggle may collect, store, share and otherwise use personally identifiable information provided by you during the Kaggle account registration process and the Competition, including but not limited to, name, mailing address, phone number, and email address (“Personal Information”). Kaggle acts as an independent controller with regard to its collection, storage, sharing, and other use of this Personal Information, and will use this Personal Information in accordance with its Privacy Policy <www.kaggle.com/privacy>, including for administering the Competition. As a Kaggle.com account holder, you have the right to request access to, review, rectification, portability or deletion of any personal data held by Kaggle about you by logging into your account and/or contacting Kaggle Support at <www.kaggle.com/contact>. b. As part of Competition Sponsor performing this contract between you and the Competition Sponsor, Kaggle will transfer your Personal Information to Competition Sponsor, which acts as an independent controller with regard to this Personal Information. As a controller of such Personal Information, Competition Sponsor agrees to comply with all U.S. and foreign data protection obligations with regard to your Personal Information. Kaggle will transfer your Personal Information to Competition Sponsor in the country specified in the Competition Sponsor Address listed above, which may be a country outside the country of your residence. Such country may not have privacy laws and regulations similar to those of the country of your residence.
149
-
150
- 14. WARRANTY, INDEMNITY AND RELEASE
151
- a. You warrant that your Submission is your own original work and, as such, you are the sole and exclusive owner and rights holder of the Submission, and you have the right to make the Submission and grant all required licenses. You agree not to make any Submission that: (i) infringes any third party proprietary rights, intellectual property rights, industrial property rights, personal or moral rights or any other rights, including without limitation, copyright, trademark, patent, trade secret, privacy, publicity or confidentiality obligations, or defames any person; or (ii) otherwise violates any applicable U.S. or foreign state or federal law. b. To the maximum extent permitted by law, you indemnify and agree to keep indemnified Competition Entities at all times from and against any liability, claims, demands, losses, damages, costs and expenses resulting from any of your acts, defaults or omissions and/or a breach of any warranty set forth herein. To the maximum extent permitted by law, you agree to defend, indemnify and hold harmless the Competition Entities from and against any and all claims, actions, suits or proceedings, as well as any and all losses, liabilities, damages, costs and expenses (including reasonable attorneys fees) arising out of or accruing from: (a) your Submission or other material uploaded or otherwise provided by you that infringes any third party proprietary rights, intellectual property rights, industrial property rights, personal or moral rights or any other rights, including without limitation, copyright, trademark, patent, trade secret, privacy, publicity or confidentiality obligations, or defames any person; (b) any misrepresentation made by you in connection with the Competition; (c) any non-compliance by you with these Rules or any applicable U.S. or foreign state or federal law; (d) claims brought by persons or entities other than the parties to these Rules arising from or related to your involvement with the Competition; and (e) your acceptance, possession, misuse or use of any Prize, or your participation in the Competition and any Competition-related activity. c. You hereby release Competition Entities from any liability associated with: (a) any malfunction or other problem with the Competition Website; (b) any error in the collection, processing, or retention of any Submission; or (c) any typographical or other error in the printing, offering or announcement of any Prize or winners.
152
-
153
- 15. INTERNET
154
- a. Competition Entities are not responsible for any malfunction of the Competition Website or any late, lost, damaged, misdirected, incomplete, illegible, undeliverable, or destroyed Submissions or entry materials due to system errors, failed, incomplete or garbled computer or other telecommunication transmission malfunctions, hardware or software failures of any kind, lost or unavailable network connections, typographical or system/human errors and failures, technical malfunction(s) of any telephone network or lines, cable connections, satellite transmissions, servers or providers, or computer equipment, traffic congestion on the Internet or at the Competition Website, or any combination thereof, which may limit a Participant’s ability to participate.
155
-
156
- 16. RIGHT TO CANCEL, MODIFY OR DISQUALIFY
157
- a. If for any reason the Competition is not capable of running as planned, including infection by computer virus, bugs, tampering, unauthorized intervention, fraud, technical failures, or any other causes which corrupt or affect the administration, security, fairness, integrity, or proper conduct of the Competition, Competition Sponsor reserves the right to cancel, terminate, modify or suspend the Competition. Competition Sponsor further reserves the right to disqualify any Participant who tampers with the submission process or any other part of the Competition or Competition Website. Any attempt by a Participant to deliberately damage any website, including the Competition Website, or undermine the legitimate operation of the Competition is a violation of criminal and civil laws. Should such an attempt be made, Competition Sponsor and Kaggle each reserves the right to seek damages from any such Participant to the fullest extent of the applicable law.
158
-
159
- 17. NOT AN OFFER OR CONTRACT OF EMPLOYMENT
160
- a. Under no circumstances will the entry of a Submission, the awarding of a Prize, or anything in these Rules be construed as an offer or contract of employment with Competition Sponsor or any of the Competition Entities. You acknowledge that you have submitted your Submission voluntarily and not in confidence or in trust. You acknowledge that no confidential, fiduciary, agency, employment or other similar relationship is created between you and Competition Sponsor or any of the Competition Entities by your acceptance of these Rules or your entry of your Submission.
161
-
162
- 18. DEFINITIONS
163
- a. "Competition Data" are the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to Participants. b. An “Entry” is when a Participant has joined, signed up, or accepted the rules of a competition. Entry is required to make a Submission to a competition. c. A “Final Submission” is the Submission selected by the user, or automatically selected by Kaggle in the event not selected by the user, that is/are used for final placement on the competition leaderboard. d. A “Participant” or “Participant User” is an individual who participates in a competition by entering the competition and making a Submission. e. The “Private Leaderboard” is a ranked display of Participants’ Submission scores against the private test set. The Private Leaderboard determines the final standing in the competition. f. The “Public Leaderboard” is a ranked display of Participants’ Submission scores against a representative sample of the test data. This leaderboard is visible throughout the competition. g. A “Sponsor” is responsible for hosting the competition, which includes but is not limited to providing the data for the competition, determining winners, and enforcing competition rules. h. A “Submission” is anything provided by the Participant to the Sponsor to be evaluated for competition purposes and determine leaderboard position. A Submission may be made as a model, notebook, prediction file, or other format as determined by the Sponsor. i. A “Team” is one or more Participants participating together in a Kaggle competition, by officially merging together as a Team within the competition platform.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/deploy_medgemma_hf.md CHANGED
@@ -8,7 +8,7 @@ OpenAI-compatible API.
8
 
9
  | Feature | Details |
10
  |---|---|
11
- | **Model** | `google/medgemma-27b-text-it` (HAI-DEF, competition-required) |
12
  | **Cost** | ~$2.50/hr (1× A100 80 GB on AWS) |
13
  | **Scale-to-zero** | Yes — no charges while idle |
14
  | **API format** | OpenAI-compatible (TGI) — zero code changes |
@@ -101,7 +101,7 @@ python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
101
  |---|---|---|
102
  | Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
103
  | Development / debugging (4 hrs) | ~4 hrs | ~$10 |
104
- | Competition demo recording | ~1 hr | ~$2.50 |
105
  | **Total estimated** | **~7 hrs** | **~$17.50** |
106
 
107
  With scale-to-zero enabled, the endpoint automatically shuts down after 15 min
 
8
 
9
  | Feature | Details |
10
  |---|---|
11
+ | **Model** | `google/medgemma-27b-text-it` (HAI-DEF) |
12
  | **Cost** | ~$2.50/hr (1× A100 80 GB on AWS) |
13
  | **Scale-to-zero** | Yes — no charges while idle |
14
  | **API format** | OpenAI-compatible (TGI) — zero code changes |
 
101
  |---|---|---|
102
  | Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
103
  | Development / debugging (4 hrs) | ~4 hrs | ~$10 |
104
+ | Demo recording | ~1 hr | ~$2.50 |
105
  | **Total estimated** | **~7 hrs** | **~$17.50** |
106
 
107
  With scale-to-zero enabled, the endpoint automatically shuts down after 15 min
docs/kaggle_writeup.md DELETED
@@ -1,87 +0,0 @@
1
- # CDS Agent — Agentic Clinical Decision Support System
2
-
3
- ### Project name
4
-
5
- **CDS Agent** — An agentic pipeline that orchestrates MedGemma across six specialized clinical reasoning steps, augmented with drug safety APIs and guideline RAG, to produce comprehensive decision support reports in real time.
6
-
7
- ### Your team
8
-
9
- | Name | Specialty | Role |
10
- |------|-----------|------|
11
- | [Your Name] | Software Engineering / AI | Architecture, full-stack development, agent pipeline, RAG system, validation framework |
12
-
13
- ### Problem statement
14
-
15
- **The problem:** Clinical decision-making is among the most cognitively demanding tasks in medicine. For every patient encounter, a clinician must simultaneously parse the clinical narrative, generate a differential diagnosis, recall drug interactions across the medication list, remember relevant clinical guidelines, and synthesize all of this into a care plan — often while fatigued and managing multiple patients.
16
-
17
- This cognitive burden has real consequences. Diagnostic errors affect approximately 12 million Americans annually. Medication errors harm over 1.5 million people per year. Many of these errors are not from lack of knowledge, but from the difficulty of integrating information from multiple sources under time pressure.
18
-
19
- **Who benefits:** Emergency physicians, hospitalists, and primary care clinicians — anyone making complex diagnostic and treatment decisions at the point of care. Patients benefit from more thorough, evidence-based care with fewer diagnostic and medication errors.
20
-
21
- **Impact potential:** The U.S. alone sees ~140 million ED visits per year. Even a modest improvement in diagnostic completeness or medication safety across a fraction of these encounters represents significant harm reduction. Our system surfaces specific, actionable conflicts between clinical guidelines and patient data — the kind of gap that leads to missed diagnoses, omitted treatments, and monitoring failures. By automating the information-gathering and synthesis steps of clinical reasoning, CDS Agent gives clinicians back cognitive bandwidth for the parts of medicine that require human judgment.
22
-
23
- ### Overall solution
24
-
25
- **HAI-DEF model:** MedGemma (`google/medgemma-27b-text-it`) — Google's medical-domain model from the Health AI Developer Foundations collection, deployed on a HuggingFace Dedicated Endpoint (1× A100 80 GB, TGI, bfloat16).
26
-
27
- **Why MedGemma is essential, not bolted on:** MedGemma is the reasoning engine in four of six pipeline steps. It is not a wrapper around a general-purpose model — it leverages MedGemma's medical training to:
28
-
29
- 1. **Parse** free-text clinical narratives into structured patient profiles (demographics, vitals, labs, medications, allergies, history)
30
- 2. **Reason** about the case via chain-of-thought to produce a ranked differential diagnosis with explicit evidence for/against each candidate
31
- 3. **Detect conflicts** between guideline recommendations and the patient's actual data — identifying omissions, contradictions, dosage concerns, and monitoring gaps
32
- 4. **Synthesize** all pipeline outputs into a comprehensive CDS report with recommendations, warnings, and citations
33
-
34
- Steps 3 and 4 augment MedGemma with external tools: **OpenFDA + RxNorm APIs** for drug interaction data, and **ChromaDB RAG** over 62 curated clinical guidelines spanning 14 specialties (sourced from ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, and others).
35
-
36
- The agentic architecture is critical: no single LLM call can parse patient data, check drug interactions against federal databases, retrieve specialty-specific guidelines, AND cross-reference those guidelines against the patient's profile. The orchestrated pipeline produces results that no individual component could achieve alone.
37
-
38
- ### Technical details
39
-
40
- **Architecture:**
41
-
42
- ```
43
- Frontend (Next.js 14) ←WebSocket→ Backend (FastAPI)
44
-
45
- Orchestrator (6-step pipeline)
46
- ├── Step 1: Parse Patient Data (MedGemma)
47
- ├── Step 2: Clinical Reasoning (MedGemma)
48
- ├── Step 3: Drug Interaction Check (OpenFDA + RxNorm)
49
- ├── Step 4: Guideline Retrieval (ChromaDB RAG, 62 guidelines)
50
- ├── Step 5: Conflict Detection (MedGemma)
51
- └── Step 6: Synthesis (MedGemma)
52
- ```
53
-
54
- All inter-step data is strongly typed (Pydantic v2). Each step streams its status to the frontend via WebSocket — the clinician watches the pipeline execute in real time, building trust through transparency.
55
-
56
- **Key design decisions:**
57
- - **Custom orchestrator** over LangChain — simpler, more transparent, no framework overhead
58
- - **Conflict detection over confidence scores** — we deliberately rejected numeric "confidence" scores (uncalibrated LLM outputs create dangerous anchoring bias). Instead, we compare guidelines against patient data to surface specific, actionable conflicts with cited sources and suggested resolutions.
59
- - **RAG with curated guidelines** — 62 guidelines across 14 specialties, indexed with sentence-transformer embeddings (all-MiniLM-L6-v2). 100% top-1 retrieval accuracy across 30 test queries.
60
-
61
- **Validation results:**
62
-
63
- | Test | Result |
64
- |------|--------|
65
- | RAG retrieval accuracy | 30/30 (100%) — correct guideline ranked #1 for every query |
66
- | E2E pipeline (ACS case) | All 6 steps passed, 75 s total |
67
- | Clinical test suite | 22 scenarios across 14 specialties |
68
- | MedQA (50 USMLE cases) | 94% pipeline success, 36% top-1 diagnostic accuracy, 38% mentioned |
69
- | MedQA diagnostic-only (36 cases) | 39% mentioned correct diagnosis in report |
70
-
71
- The 36% top-1 on MedQA reflects that many questions are non-diagnostic (treatment, mechanism, statistics) — the pipeline generates differential diagnoses, not multiple-choice answers. On diagnostic questions specifically, 39% mentioned the correct diagnosis.
72
-
73
- **Deployment:**
74
- - **Model hosting:** HuggingFace Dedicated Endpoint (`medgemma-27b-cds`), 1× A100 80 GB, scale-to-zero billing
75
- - **HIPAA path:** MedGemma is open-weight and can be self-hosted on-premises, eliminating external data transmission
76
- - **Scalability:** FastAPI async + uvicorn workers; production path includes task queue and horizontal scaling
77
- - **EHR integration:** Current input is manual text paste; production system would use FHIR APIs for automatic patient data extraction
78
-
79
- **Stack:** Python 3.10, FastAPI, ChromaDB, sentence-transformers, Next.js 14, React 18, TypeScript, Tailwind CSS
80
-
81
- ---
82
-
83
- **Links:**
84
- - **Video:** [TODO — insert video link]
85
- - **Code:** [github.com/bshepp/clinical-decision-support-agent](https://github.com/bshepp/clinical-decision-support-agent)
86
- - **Live Demo:** [TODO — insert demo link if deployed]
87
- - **HuggingFace Model:** [google/medgemma-27b-text-it](https://huggingface.co/google/medgemma-27b-text-it)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/video_script.md DELETED
@@ -1,125 +0,0 @@
1
- # CDS Agent — Demo Video Script
2
-
3
- > **Target length:** 3 minutes (max)
4
- > **Format:** Screen recording with voiceover
5
- > **Tool suggestion:** OBS Studio, Loom, or similar
6
-
7
- ---
8
-
9
- ## PRE-RECORDING CHECKLIST
10
-
11
- - [ ] Ensure HF Dedicated Endpoint is running (check `https://bshepp-cds-agent.hf.space/api/health/config`)
12
- - [ ] Open browser to `https://demo.briansheppard.com` (or `https://bshepp-cds-agent.hf.space`)
13
- - [ ] Close unnecessary tabs/notifications
14
- - [ ] Submit one case end-to-end before recording to confirm model is warm (watch for warm-up screen)
15
- - [ ] Browser zoom ~110-125% for readability on video
16
- - [ ] **Local fallback** (if Space is down): `cd src/backend && uvicorn app.main:app --host 0.0.0.0 --port 8002` + `cd src/frontend && npm run dev`, then open `http://localhost:3000`
17
-
18
- ---
19
-
20
- ## SCRIPT
21
-
22
- ### OPENING — The Problem (0:00 – 0:30)
23
-
24
- **[SCREEN: Title slide or the app landing page]**
25
-
26
- > "Clinical decision-making is one of the most cognitively demanding tasks in medicine. For every patient, a clinician must simultaneously parse the history, generate a differential, recall drug interactions, remember guidelines, and synthesize a care plan — all under time pressure.
27
- >
28
- > Diagnostic errors affect 12 million Americans annually. Many aren't from lack of knowledge — they're from the difficulty of integrating information from multiple sources at once.
29
- >
30
- > CDS Agent solves this with an agentic pipeline powered by MedGemma."
31
-
32
- ---
33
-
34
- ### LIVE DEMO — The Pipeline in Action (0:30 – 2:00)
35
-
36
- **[SCREEN: App interface — PatientInput component visible]**
37
-
38
- > "Let me show you how it works. I'll load a built-in sample case — a 55-year-old male presenting to the ED with acute substernal chest pain radiating to his left arm and jaw, with diaphoresis and nausea. He has hypertension, type 2 diabetes, and hyperlipidemia, and he's on metformin, lisinopril, atorvastatin, and aspirin."
39
-
40
- **[ACTION: Click the "Chest Pain (55M)" sample case button, then click "Analyze Patient Case"]**
41
-
42
- > "When I submit this case, the agent pipeline kicks off. You can see each step executing in real time on the left."
43
-
44
- **[SCREEN: AgentPipeline component showing steps lighting up one by one]**
45
-
46
- > "Step 1 — MedGemma parses the free-text narrative into structured patient data: demographics, vitals, labs, medications, allergies, history."
47
-
48
- **[Wait for Step 1 to complete]**
49
-
50
- > "Step 2 — Clinical reasoning. MedGemma generates a ranked differential diagnosis with chain-of-thought reasoning. It's considering ACS, GERD, PE, aortic dissection — weighing evidence for and against each."
51
-
52
- **[Wait for Step 2 to complete]**
53
-
54
- > "Steps 3 and 4 run in parallel. Step 3 — Drug interaction check. This isn't the LLM guessing — it's querying the actual OpenFDA and RxNorm databases for his four medications. Real API data, not hallucination. Step 4 — Guideline retrieval. Our RAG system searches 62 curated clinical guidelines across 14 specialties. For this case it pulls the ACC/AHA chest pain and ACS guidelines."
55
-
56
- **[Wait for Steps 3 & 4 to complete]**
57
-
58
- > "Step 5 — and this is what makes it a real safety tool — Conflict Detection. MedGemma compares what the guidelines recommend against what the patient is actually receiving. It surfaces omissions, contradictions, dosage concerns, and monitoring gaps."
59
-
60
- **[Wait for Step 5 to complete]**
61
-
62
- > "Step 6 — Synthesis. Everything gets integrated into a single comprehensive report."
63
-
64
- **[Wait for Step 6 to complete. Total pipeline ~2-3 minutes]**
65
-
66
- ---
67
-
68
- ### THE REPORT — Reviewing Results (2:00 – 2:40)
69
-
70
- **[SCREEN: Scroll through the CDSReport component]**
71
-
72
- > "Here's the CDS report. At the top — the ranked differential diagnosis. ACS is correctly identified as the leading diagnosis, with clear reasoning. The elevated troponin and ST elevation in II, III, and aVF support an inferior STEMI."
73
-
74
- **[ACTION: Scroll to drug interactions section]**
75
-
76
- > "Drug interaction warnings pulled from federal databases — not LLM-generated, real data."
77
-
78
- **[ACTION: Scroll to Conflicts & Gaps section — highlight the red-bordered cards]**
79
-
80
- > "This is the most important section — Conflicts and Gaps. Each card shows a specific conflict: what the guideline recommends, what the patient data shows, the severity, and a suggested resolution. These are the gaps that lead to missed diagnoses and omitted treatments in real clinical practice."
81
-
82
- **[ACTION: Scroll to guidelines section]**
83
-
84
- > "Cited guideline recommendations from authoritative sources — ACC/AHA, ADA, and others."
85
-
86
- **[ACTION: Click the "Download .md" button in the left panel]**
87
-
88
- > "And clinicians can download the full report as Markdown for their records."
89
-
90
- ---
91
-
92
- ### CLOSING — Technical & Impact (2:40 – 3:00)
93
-
94
- **[SCREEN: Back to app overview or a summary slide]**
95
-
96
- > "Under the hood: MedGemma 27B powers four of six pipeline steps — parsing, reasoning, conflict detection, and synthesis. It's augmented with OpenFDA and RxNorm APIs for drug safety, and a 62-guideline RAG corpus for evidence-based recommendations.
97
- >
98
- > We validated on 50 MedQA USMLE cases with 94% pipeline reliability and 38% diagnostic mention rate — before any fine-tuning.
99
- >
100
- > With 140 million ED visits per year in the U.S. alone, even a modest improvement in diagnostic completeness and medication safety represents lives saved. CDS Agent is built to make that happen."
101
-
102
- **[END]**
103
-
104
- ---
105
-
106
- ## TIMING SUMMARY
107
-
108
- | Section | Duration | Cumulative |
109
- |---------|----------|------------|
110
- | Opening — The Problem | 30 sec | 0:30 |
111
- | Live Demo — Pipeline Execution | 90 sec | 2:00 |
112
- | Report Review | 40 sec | 2:40 |
113
- | Closing — Tech & Impact | 20 sec | 3:00 |
114
-
115
- > **Note on timing:** The pipeline typically takes 2-3 minutes on the live endpoint. You can speed up the wait portions (1.5x-2x) in post-editing while keeping narration at normal speed to fit within 3 minutes. Alternatively, record narration separately and overlay it.
116
-
117
- ## TIPS
118
-
119
- - **Warm up before recording** — Submit a test case first. If the model has scaled to zero you'll see a "Model Warming Up" spinner; wait for it to complete (~1-2 min) before the real recording
120
- - **Speak during pipeline wait times** — the pipeline execution is perfect narration time
121
- - **Don't rush** — the real-time pipeline visualization IS the demo; let it breathe
122
- - **Zoom into the Conflicts section** — it's the most visually impressive and differentiating feature
123
- - **If the endpoint is slow** — speed up wait portions in post-editing (1.5x-2x) while keeping narration at normal speed
124
- - **Retry resilience** — if a pipeline run fails, the "Try Again" button lets you retry without reloading the page
125
- - **Backup plan** — if the HF endpoint is down, you can use Google AI Studio with Gemma 3 27B IT as a fallback (update .env accordingly)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/writeup_draft.md DELETED
@@ -1,169 +0,0 @@
1
- # CDS Agent — Project Writeup
2
-
3
- > Competition writeup template filled in with actual project details.
4
- > Also serves as the primary project summary document.
5
-
6
- ---
7
-
8
- ### Project name
9
-
10
- **CDS Agent** — Agentic Clinical Decision Support System
11
-
12
- ### Your team
13
-
14
- | Name | Specialty | Role |
15
- |------|-----------|------|
16
- | (Developer) | Software Engineering / AI | Full-stack development, agent architecture, RAG system, testing |
17
-
18
- ### Problem statement
19
-
20
- **The Problem:**
21
-
22
- Clinical decision-making is one of the most cognitively demanding tasks in medicine. A clinician seeing a patient must simultaneously: review the patient's history and current presentation, mentally generate a differential diagnosis, recall drug interactions for current and proposed medications, remember relevant clinical guidelines, and synthesize all of this into a coherent care plan — often while fatigued, time-pressured, and managing multiple patients.
23
-
24
- Medical errors remain a leading cause of patient harm. Studies estimate that diagnostic errors affect approximately 12 million Americans annually, and medication errors harm over 1.5 million people per year. Many of these errors stem not from lack of knowledge, but from the cognitive burden of integrating information from multiple sources under time pressure.
25
-
26
- **Who is affected:**
27
-
28
- - **Clinicians** (primary users) — physicians, nurse practitioners, physician assistants in emergency departments, urgent care, and inpatient settings where rapid, comprehensive decision-making is critical
29
- - **Patients** — who benefit from more thorough, evidence-based care with fewer diagnostic and medication errors
30
- - **Health systems** — which bear the cost of medical errors, readmissions, and liability
31
-
32
- **Why AI is the right solution:**
33
-
34
- This problem cannot be solved with traditional rule-based systems because:
35
- 1. Clinical reasoning requires understanding free-text narratives, not just coded data
36
- 2. Differential diagnosis generation requires probabilistic reasoning over thousands of conditions
37
- 3. Guideline retrieval requires semantic understanding of clinical context
38
- 4. Synthesis requires integrating heterogeneous data (structured labs, free-text guidelines, API-sourced drug data) into coherent recommendations
39
-
40
- Large language models — specifically medical-domain models like Gemma — can perform all of these tasks. But a single LLM call is insufficient. The agent architecture orchestrates the LLM across multiple specialized steps, augmented with external tools (drug APIs, RAG) to produce a result that no single component could achieve alone.
41
-
42
- **Impact potential:**
43
-
44
- If deployed, this system could:
45
- - Reduce diagnostic error rates by providing systematic differential diagnosis generation for every patient encounter
46
- - Catch drug interactions that clinicians might miss, especially in polypharmacy patients
47
- - Ensure guideline-concordant care by surfacing relevant, current clinical guidelines at the point of care
48
- - Save clinician time by automating the information-gathering and synthesis steps of clinical reasoning
49
-
50
- Estimated reach: There are approximately 140 million ED visits per year in the US alone. Even a modest improvement in diagnostic accuracy or medication safety across a fraction of these encounters would represent significant impact.
51
-
52
- ### Overall solution
53
-
54
- **HAI-DEF models used:**
55
-
56
- - **MedGemma** (`google/medgemma-27b-text-it`) — Google's medical-domain model from the Health AI Developer Foundations (HAI-DEF) collection
57
- - Development/validation also performed with **Gemma 3 27B IT** (`gemma-3-27b-it`) via Google AI Studio for rapid iteration
58
-
59
- **Why MedGemma:**
60
-
61
- MedGemma is purpose-built for medical applications and is part of Google's HAI-DEF collection:
62
- - Trained specifically for health and biomedical tasks, providing stronger clinical reasoning than general-purpose models
63
- - Open-weight model that can be self-hosted for HIPAA compliance in production
64
- - Large enough (27B parameters) for complex chain-of-thought clinical reasoning
65
- - Designed to be the foundation for healthcare AI applications — exactly what this competition demands
66
-
67
- **How the model is used:**
68
-
69
- The model serves as the reasoning engine in a 6-step agentic pipeline:
70
-
71
- 1. **Patient Data Parsing** (LLM) — Extracts structured patient data from free-text clinical narratives
72
- 2. **Clinical Reasoning** (LLM) — Generates ranked differential diagnoses with chain-of-thought reasoning
73
- 3. **Drug Interaction Check** (External APIs) — Queries OpenFDA and RxNorm for medication safety
74
- 4. **Guideline Retrieval** (RAG) — Retrieves relevant clinical guidelines from a 62-guideline corpus using ChromaDB
75
- 5. **Conflict Detection** (LLM) — Compares guideline recommendations against patient data to identify omissions, contradictions, dosage concerns, monitoring gaps, allergy risks, and interaction gaps
76
- 6. **Synthesis** (LLM) — Integrates all outputs into a comprehensive CDS report with conflicts prominently featured
77
-
78
- The model is used in Steps 1, 2, 5, and 6 — parsing, reasoning, conflict detection, and synthesis. This demonstrates the model used "to its fullest potential" across multiple distinct clinical tasks within a single workflow.
79
-
80
- ### Technical details
81
-
82
- **Architecture:**
83
-
84
- ```
85
- Frontend (Next.js 14) ←→ Backend (FastAPI + Python 3.10)
86
-
87
- Orchestrator (6-step pipeline)
88
- ├── Step 1: Patient Parser (LLM)
89
- ├── Step 2: Clinical Reasoning (LLM)
90
- ├── Step 3: Drug Check (OpenFDA + RxNorm APIs)
91
- ├── Step 4: Guideline Retrieval (ChromaDB RAG)
92
- ├── Step 5: Conflict Detection (LLM)
93
- └── Step 6: Synthesis (LLM)
94
- ```
95
-
96
- All inter-step data is strongly typed with Pydantic v2 models. The pipeline streams each step's progress to the frontend via WebSocket for real-time visibility.
97
-
98
- **Fine-tuning:**
99
-
100
- No fine-tuning was performed in the current version. The base MedGemma model (`medgemma-27b-text-it`) was used with carefully crafted prompt engineering for each pipeline step. Fine-tuning on clinical reasoning datasets is a planned improvement.
101
-
102
- **Performance analysis:**
103
-
104
- | Test | Result |
105
- |------|--------|
106
- | E2E pipeline (chest pain / ACS) | All 6 steps passed, ~75–85 s total |
107
- | RAG retrieval quality | 30/30 queries passed (100%), avg relevance 0.639 |
108
- | Clinical test suite | 22 scenarios across 14 specialties |
109
- | Top-1 RAG accuracy | 100% — correct guideline ranked #1 for all queries |
110
- | **MedQA 50-case validation** | **36% top-1, 38% top-3, 38% mentioned, 94% pipeline success** |
111
- | MedQA diagnostic-only (36 cases) | 39% mentioned, 14% differential |
112
-
113
- **Application stack:**
114
-
115
- | Layer | Technology |
116
- |-------|-----------|
117
- | Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS |
118
- | Backend | FastAPI, Python 3.10, Pydantic v2, WebSocket |
119
- | LLM | MedGemma 27B Text IT (HAI-DEF) + Gemma 3 27B IT for dev |
120
- | RAG | ChromaDB + sentence-transformers (all-MiniLM-L6-v2) |
121
- | Drug Data | OpenFDA API, RxNorm / NLM API |
122
-
123
- **Deployment considerations:**
124
-
125
- - **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
126
- - **Latency:** Current pipeline takes ~75 s for a single E2E case (local), or ~204 s avg on the HuggingFace Dedicated Endpoint (50-case MedQA validation). For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference with higher throughput.
127
- - **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
128
- - **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
129
-
130
- ### Validation methodology
131
-
132
- The project includes an external dataset validation framework (`src/backend/validation/`) that tests the full pipeline against real-world clinical data:
133
-
134
- | Dataset | Source | What It Tests |
135
- |---------|--------|---------------|
136
- | **MedQA (USMLE)** | HuggingFace (1,273 test cases) | Diagnostic accuracy — does the pipeline's top differential match the USMLE correct answer? |
137
- | **MTSamples** | GitHub (~5,000 medical transcriptions) | Parse quality, field completeness, specialty alignment on real clinical notes |
138
- | **PMC Case Reports** | PubMed E-utilities (dynamic) | Diagnostic accuracy on published case reports with known diagnoses |
139
-
140
- The validation harness calls the `Orchestrator` directly (no HTTP server), enabling rapid batch testing. Each dataset has a dedicated harness that fetches data, converts it to patient narratives, runs the pipeline, and scores the output against ground truth.
141
-
142
- **Initial smoke test (3 MedQA cases):** 100% parse success, 66.7% top-1 diagnostic accuracy, ~94 s avg per case.
143
-
144
- **50-case MedQA validation (MedGemma 27B via HF Endpoint):** 94% pipeline success, 36% top-1 diagnostic accuracy, 38% mentioned in report, 204 s avg per case. On diagnostic-only questions (36/50), 39% mentioned the correct diagnosis. Full results in [docs/test_results.md](docs/test_results.md).
145
-
146
- **Practical usage:**
147
-
148
- In a real clinical setting, the system would be used at the point of care:
149
- 1. Clinician opens the CDS Agent interface (embedded in the EHR or as a standalone app)
150
- 2. Patient data is automatically pulled from the EHR (or pasted manually)
151
- 3. The agent pipeline runs in ~60–90 seconds, during which the clinician can continue other tasks
152
- 4. The CDS report appears with:
153
- - Ranked differential diagnoses with reasoning chains (transparent AI)
154
- - Drug interaction warnings with severity levels
155
- - **Conflicts & gaps** between guideline recommendations and the patient's actual data — prominently displayed with specific guideline citations, patient data comparisons, and suggested resolutions
156
- - Relevant clinical guideline excerpts with citations to authoritative sources
157
- - Suggested next steps (immediate, short-term, long-term)
158
- 5. The clinician reviews the recommendations and incorporates them into their clinical judgment
159
-
160
- The system is explicitly designed as a **decision support** tool, not a decision-making tool. All recommendations include caveats and limitations. The clinician retains full authority over patient care.
161
-
162
- ---
163
-
164
- **Links:**
165
-
166
- - Video: [To be recorded]
167
- - Code Repository: [github.com/bshepp/clinical-decision-support-agent](https://github.com/bshepp/clinical-decision-support-agent)
168
- - Live Demo: [To be deployed]
169
- - Hugging Face Model: [google/medgemma-27b-text-it](https://huggingface.co/google/medgemma-27b-text-it)