File size: 19,651 Bytes
f2c113d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9dea0ad
f2c113d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c28dd56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2c113d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d53fbf
 
 
 
f2c113d
 
 
 
9dea0ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d53fbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9dea0ad
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
# Development Log β€” CDS Agent

> Chronological record of the build process, problems encountered, and solutions applied.

---

## Phase 1: Project Scaffolding

### Decision: Track Selection

Chose the **Agentic Workflow Prize** track ($10K) for the MedGemma Impact Challenge. The clinical decision support use case maps perfectly to an agentic architecture β€” multiple specialized tools orchestrated by a central agent.

### Architecture Design

Designed a 5-step sequential pipeline:

1. Parse patient data (LLM)
2. Clinical reasoning / differential diagnosis (LLM)
3. Drug interaction check (external APIs)
4. Guideline retrieval (RAG)
5. Synthesis into CDS report (LLM)

**Key design choices:**
- **Custom orchestrator** instead of LangChain β€” simpler, more transparent, no framework overhead
- **WebSocket streaming** β€” clinician sees each step execute in real time (critical for trust)
- **Pydantic v2 everywhere** β€” all inter-step data is strongly typed

### Backend Scaffold

Built the FastAPI backend from scratch:

- `app/main.py` β€” FastAPI app with CORS, router includes, lifespan
- `app/config.py` β€” Pydantic Settings from `.env`
- `app/models/schemas.py` β€” All domain models (~238 lines, 10+ Pydantic models)
- `app/agent/orchestrator.py` β€” 5-step pipeline (267 lines)
- `app/services/medgemma.py` β€” LLM service wrapping OpenAI SDK
- `app/tools/` β€” 5 tool modules (one per pipeline step)
- `app/api/` β€” 3 route modules (health, cases, WebSocket)

### Frontend Scaffold

Built the Next.js 14 frontend:

- `PatientInput.tsx` β€” Text area + 3 pre-loaded sample cases
- `AgentPipeline.tsx` β€” Real-time 5-step status visualization
- `CDSReport.tsx` β€” Final report renderer
- `useAgentWebSocket.ts` β€” WebSocket hook for real-time updates
- `next.config.js` β€” API proxy to backend

---

## Phase 2: Integration & Bug Fixes

### Bug: Gemma System Prompt 400 Error

**Problem:** The first LLM call failed with HTTP 400. Gemma models via the Google AI Studio OpenAI-compatible endpoint do not support `role: "system"` messages β€” a fundamental difference from OpenAI's API.

**Solution:** Modified `medgemma.py` to detect system messages and fold them into the first user message with a `[System Instructions]` prefix. All pipeline steps now work correctly.

**File changed:** `src/backend/app/services/medgemma.py`

### Bug: RxNorm API β€” `rxnormId` Is a List

**Problem:** The drug interaction checker crashed when querying RxNorm. The NLM API returns `rxnormId` as a **list** (e.g., `["12345"]`), not a scalar string. The code assumed a string.

**Solution:** Added type checking β€” if `rxnormId` is a list, take the first element; if it's a string, use directly.

**File changed:** `src/backend/app/tools/drug_interactions.py`

### Bug: OpenAI SDK Version Mismatch

**Problem:** `openai==1.0.0` had breaking API changes compared to the code written for the older API pattern.

**Solution:** Pinned to `openai==1.51.0` in `requirements.txt`, which is compatible with both the modern SDK API and the Google AI Studio OpenAI-compatible endpoint.

**File changed:** `src/backend/requirements.txt`

### Bug: Port 8000 Zombie Processes

**Problem:** Previous server instances left zombie processes holding port 8000. New `uvicorn` instances couldn't bind.

**Solution:** Switched to port 8002 for development. Updated `next.config.js` and `useAgentWebSocket.ts` to proxy to 8002.

**Files changed:** `src/frontend/next.config.js`, `src/frontend/src/hooks/useAgentWebSocket.ts`

---

## Phase 3: First Successful E2E Test

### Test Case: Chest Pain / ACS

Submitted a 62-year-old male with crushing substernal chest pain, diaphoresis, HTN, on lisinopril + metformin + atorvastatin.

**Results β€” all 5 steps passed:**

| Step | Duration | Outcome |
|------|----------|---------|
| Parse | 7.8 s | Correct structured extraction |
| Reason | 21.2 s | ACS as top differential (correct) |
| Drug Check | 11.3 s | Queried all 3 medications |
| Guidelines | 9.6 s | Retrieved ACS/chest pain guidelines |
| Synthesis | 25.3 s | Comprehensive report with recommendations |

This was the first end-to-end success. Total pipeline: ~75 seconds.

---

## Phase 4: Project Direction Shift

### Decision: From Competition to Real Application

After achieving the first successful E2E test, made the decision to shift focus from "winning a competition" to "building a genuinely important medical application." The clinical decision support problem is real and impactful regardless of competition outcomes.

This shift influenced subsequent work β€” emphasis on:
- Comprehensive clinical coverage (more specialties, more guidelines)
- Thorough testing (not just demos)
- Proper documentation

---

## Phase 5: RAG Expansion

### Guideline Corpus: 2 β†’ 62

The initial RAG system had only 2 minimal fallback guidelines. Expanded to a comprehensive corpus:

- **Created:** `app/data/clinical_guidelines.json` β€” 62 guidelines across 14 specialties
- **Updated:** `guideline_retrieval.py` β€” loads from JSON, stores specialty/ID metadata in ChromaDB
- **Sources:** ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, APA, AAP, ACR, ASH, KDIGO, WHO, USPSTF

### ChromaDB Rebuild

Had to kill locking processes holding the ChromaDB files before rebuilding. After clearing locks, ChromaDB successfully indexed all 62 guidelines with `all-MiniLM-L6-v2` embeddings (384 dimensions).

---

## Phase 6: Comprehensive Test Suite

### RAG Quality Tests (30 queries)

Created `test_rag_quality.py` with 30 clinical queries, each mapped to an expected guideline ID:

- **Result: 30/30 passed (100%)**
- Average relevance score: 0.639
- Every query returned the correct guideline as the #1 result
- All 14 specialty categories achieved 100% pass rate

### Clinical Test Cases (22 scenarios)

Created `test_clinical_cases.py` with 22 diverse clinical scenarios:

- Covers 14+ specialties (Cardiology, EM, Endocrinology, Neurology, Pulmonology, GI, ID, Psych, Peds, Nephrology, Toxicology, Geriatrics)
- Each case has: clinical vignette, expected specialty, validation keywords
- Supports CLI flags: `--case`, `--specialty`, `--list`, `--report`, `--quiet`

---

## Phase 7: Documentation

Performed comprehensive documentation audit. Found:
- README was outdated (wrong port, missing test info, incomplete structure tree)
- Architecture doc lacked implementation specifics (RAG details, Gemma workaround, timing)
- Writeup draft was 100% TODO placeholders
- No test results documentation existed
- No development log existed

Rewrote/created all documentation:
- **README.md** β€” Complete rewrite with results, RAG corpus info, updated structure, corrected setup
- **docs/architecture.md** β€” Updated with actual implementation details, timing, config, limitations
- **docs/test_results.md** β€” New file documenting all test results and reproduction steps
- **DEVELOPMENT_LOG.md** β€” This file
- **docs/writeup_draft.md** β€” Filled in with actual project information

---

## Phase 8: Conflict Detection Feature

### Design Decision: Drop Confidence Scores, Add Conflict Detection

During review, identified that the system's "confidence" was just the LLM picking a label (LOW/MODERATE/HIGH) β€” not a calibrated score. Composite numeric confidence scores were considered and **rejected** because:
- Uncalibrated confidence values are dangerous (clinician anchoring bias)
- No training data exists to calibrate outputs
- A single number hides more than it reveals

**Instead, added Conflict Detection** β€” a new pipeline step that compares guideline recommendations against the patient's actual data to identify specific, actionable gaps. This provides direct patient safety value without requiring calibration.

### Implementation

**New models added to `schemas.py`:**
- `ConflictType` enum β€” 6 categories: omission, contradiction, dosage, monitoring, allergy_risk, interaction_gap
- `ClinicalConflict` model β€” Each conflict has: type, severity, guideline_source, guideline_text, patient_data, description, suggested_resolution
- `ConflictDetectionResult` β€” List of conflicts + summary + guidelines_checked count
- `conflicts` field added to `CDSReport`
- `conflict_detection` field added to `AgentState`

**New tool: `conflict_detection.py`:**
- Takes patient profile, clinical reasoning, drug interactions, and guidelines
- Uses MedGemma at low temperature (0.1) for safety-critical analysis
- Returns structured `ConflictDetectionResult` with specific, actionable conflicts
- Graceful degradation: returns empty if no guidelines available

**Pipeline changes (`orchestrator.py`):**
- Pipeline expanded from 5 to 6 steps
- New Step 5: Conflict Detection (between guideline retrieval and synthesis)
- Synthesis (now Step 6) receives conflict data and prominently includes it in the report

**Synthesis changes (`synthesis.py`):**
- Accepts `conflict_detection` parameter
- New "Conflicts & Gaps" section in synthesis prompt
- Fallback: copies detected conflicts directly into report if LLM doesn't populate the structured field

**Frontend changes (`CDSReport.tsx`):**
- New "Conflicts & Gaps Detected" section with high visual prominence
- Red border container, severity-coded left-accent cards (critical=red, high=orange, moderate=yellow, low=blue)
- Side-by-side "Guideline says" vs "Patient data" comparison
- Green-highlighted suggested resolutions
- Positioned immediately after drug interactions for maximum visibility

**Files created:** `src/backend/app/tools/conflict_detection.py` (1 new file)
**Files modified:** `schemas.py`, `orchestrator.py`, `synthesis.py`, `CDSReport.tsx` (4 files)

---

## Dependency Inventory

### Python Backend (`requirements.txt`)

| Package | Version | Purpose |
|---------|---------|---------|
| fastapi | 0.115.0 | Web framework |
| uvicorn | 0.30.6 | ASGI server |
| openai | 1.51.0 | LLM API client (OpenAI-compatible) |
| chromadb | 0.5.7 | Vector database for RAG |
| sentence-transformers | 3.1.1 | Embedding model |
| httpx | 0.27.2 | Async HTTP client (API calls) |
| torch | 2.4.1 | PyTorch (sentence-transformers dependency) |
| transformers | 4.45.0 | HuggingFace transformers |
| pydantic-settings | 2.5.2 | Settings management |
| pydantic | 2.9.2 | Data validation |
| websockets | 13.1 | WebSocket support |
| python-dotenv | 1.0.1 | .env file loading |
| numpy | 1.26.4 | Numerical computing |

### Frontend (`package.json`)

| Package | Purpose |
|---------|---------|
| next 14.x | React framework |
| react 18.x | UI library |
| typescript | Type safety |
| tailwindcss | Styling |

---

## Environment Configuration

All config via `.env` (template in `.env.template`):

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `MEDGEMMA_API_KEY` | Yes | β€” | HuggingFace API token or Google AI Studio API key |
| `MEDGEMMA_BASE_URL` | No | `""` (empty) | LLM endpoint (HF Endpoint URL/v1 or Google AI Studio URL) |
| `MEDGEMMA_MODEL_ID` | No | `google/medgemma` | Model identifier (`tgi` for HF Endpoints, or full model name) |
| `HF_TOKEN` | No | `""` | HuggingFace token for dataset downloads |
| `CHROMA_PERSIST_DIR` | No | `./data/chroma` | ChromaDB storage |
| `EMBEDDING_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | RAG embeddings |
| `MAX_GUIDELINES` | No | `5` | Guidelines per RAG query |
| `AGENT_TIMEOUT` | No | `120` | Pipeline timeout (seconds) |

---

## Phase 9: External Dataset Validation Framework

### Motivation

Internal tests (RAG quality, clinical cases) are useful but don't measure diagnostic accuracy against ground truth. Added a validation framework to test the full pipeline against real-world clinical datasets with known correct answers.

### Datasets Evaluated

| Dataset | Source | What It Tests |
|---------|--------|---------------|
| **MedQA (USMLE)** | HuggingFace β€” `GBaker/MedQA-USMLE-4-options` | Diagnostic accuracy (1,273 USMLE-style questions with verified answers) |
| **MTSamples** | GitHub β€” `socd06/medical-nlp` | Parse quality & field completeness on real medical transcription notes |
| **PMC Case Reports** | PubMed E-utilities (esearch + efetch) | Diagnostic accuracy on published case reports with known diagnoses |

### Architecture

Created `src/backend/validation/` package:

- **`base.py`** β€” Core framework: `ValidationCase`, `ValidationResult`, `ValidationSummary` dataclasses. `run_cds_pipeline()` invokes the Orchestrator directly (no HTTP server needed). Includes `fuzzy_match()` token-overlap scorer and `diagnosis_in_differential()` checker.
- **`harness_medqa.py`** β€” Downloads JSONL from HuggingFace, extracts clinical vignettes (strips question stems), scores top-1/top-3/mentioned diagnostic accuracy.
- **`harness_mtsamples.py`** β€” Downloads CSV, filters to relevant specialties, stratified sampling. Scores parse success, field completeness, specialty alignment, has_differential, has_recommendations.
- **`harness_pmc.py`** β€” Uses NCBI E-utilities with 20 curated queries across specialties. Extracts diagnosis from article titles via regex patterns. Scores diagnostic accuracy.
- **`run_validation.py`** β€” Unified CLI: `python -m validation.run_validation --all --max-cases 10`. Supports `--fetch-only`, `--no-drugs`, `--no-guidelines`, `--seed`, `--delay`.

### Problems Solved

1. **MedQA URL 404:** Original GitHub raw URL was stale. Fixed to HuggingFace direct download.
2. **MTSamples URL 404:** Original mirror was down. Found working mirror at `socd06/medical-nlp`.
3. **PMC fetcher returned 0 cases:** PubMed API worked, but title regex patterns didn't match common formats like "X: A Case Report." Added 3 new title patterns and fixed query-based fallback extraction.
4. **`datetime.utcnow()` deprecation:** Replaced with `datetime.now(timezone.utc)` throughout.
5. **Pipeline time display bug:** `print_summary` showed time metrics as percentages. Fixed by reordering type checks.

### Initial Results (Smoke Test)

Ran 3 MedQA cases through the full pipeline:
- **Parse success:** 100% (3/3)
- **Top-1 diagnostic accuracy:** 66.7% (2/3)
- **Avg pipeline time:** ~94 seconds per case

Full validation runs (50–100+ cases) are planned for the next session.

**Files created:** `validation/__init__.py`, `validation/base.py`, `validation/harness_medqa.py`, `validation/harness_mtsamples.py`, `validation/harness_pmc.py`, `validation/run_validation.py`  
**Files modified:** `.gitignore` (added `validation/data/` and `validation/results/`)

---

## Phase 11: MedGemma HuggingFace Dedicated Endpoint

### Motivation

The competition requires using HAI-DEF models (MedGemma). Google AI Studio served `gemma-3-27b-it` for development, but for the final submission we needed the actual `google/medgemma-27b-text-it` model. HuggingFace Dedicated Endpoints provide an OpenAI-compatible TGI server with scale-to-zero billing.

### Deployment

- **Endpoint name:** `medgemma-27b-cds`
- **Model:** `google/medgemma-27b-text-it`
- **Instance:** 1Γ— NVIDIA A100 80 GB (AWS `us-east-1`)
- **Container:** Text Generation Inference (TGI) with `DTYPE=bfloat16`
- **Scale-to-zero:** Enabled (15 min idle timeout)
- **Cost:** ~$2.50/hr when running

### Key Configuration

After initial deployment, the default TGI token limits (`MAX_INPUT_TOKENS=4096`) caused 422 errors on longer synthesis prompts. Updated endpoint environment:

- `MAX_INPUT_TOKENS=12288`
- `MAX_TOTAL_TOKENS=16384`

Also reduced per-step `max_tokens` to stay within limits:
- `patient_parser.py`: 1500
- `clinical_reasoning.py`: 3072
- `conflict_detection.py`: 2000
- `synthesis.py`: 3000

### Code Changes

- **`medgemma.py`:** Updated to send `role: "system"` natively (TGI supports it), with automatic fallback to folding system prompt into user message for Google AI Studio compatibility.
- **`.env`:** Updated `MEDGEMMA_BASE_URL` to HF endpoint URL, `MEDGEMMA_API_KEY` to HF token, `MEDGEMMA_MODEL_ID=tgi`.
- **`.env.template`:** Updated with MedGemma model name and HF Endpoint instructions.

### Verification

Single-case test: Chikungunya question β†’ correct diagnosis appeared at rank 5 in differential. All 6 pipeline steps completed in 281s.

**Deployment guide:** `docs/deploy_medgemma_hf.md`

---

## Phase 12: 50-Case MedQA Validation

### Setup

Ran 50 MedQA (USMLE) cases through the full pipeline using the MedGemma HF Endpoint:

```bash
cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
```

### Results

| Metric | Value |
|--------|-------|
| Cases run | 50 |
| Pipeline success | 94% (47/50) |
| Top-1 diagnostic accuracy | 36% |
| Top-3 diagnostic accuracy | 38% |
| Differential accuracy | 10% |
| Mentioned in report | 38% |
| Avg pipeline time | 204 s/case |
| Total run time | ~60 min |

### Question Type Breakdown

Used `analyze_results.py` to categorize the 50 cases:

| Type | Count | Mentioned | Differential |
|------|-------|-----------|-------------|
| Diagnostic | 36 | 14 (39%) | 5 (14%) |
| Treatment | 6 | β€” | β€” |
| Pathophysiology | 6 | β€” | β€” |
| Statistics | 1 | β€” | β€” |
| Anatomy | 1 | β€” | β€” |

### Key Observations

1. **MedQA includes many non-diagnostic questions** (treatment, mechanism, stats) that the CDS pipeline is not designed to answer β€” it generates differential diagnoses, not multiple-choice answers.
2. **On diagnostic questions specifically**, 39% mentioned accuracy is reasonable for a pipeline that wasn't optimized for exam-style questions.
3. **Pipeline failures (3/50)** were caused by the HF endpoint scaling to zero mid-run. The `--resume` flag successfully continued from the checkpoint.
4. **Improved clinical reasoning prompt** to demand disease-level diagnoses rather than symptom categories (e.g., "Chikungunya" not "viral arthritis").

### Infrastructure Improvements

- **Incremental JSONL checkpoints:** Each case result is appended to `medqa_checkpoint.jsonl` as it completes.
- **`--resume` flag:** Skips already-completed cases, enabling graceful recovery from endpoint failures.
- **`check_progress.py`:** Utility to monitor checkpoint progress during long runs.
- **`analyze_results.py`:** Categorizes MedQA results by question type for more meaningful accuracy analysis.
- **Unicode fixes:** Replaced box-drawing characters (`β•”β•β•—β•‘β•šβ•`) and symbols (`βœ“βœ—β”€`) with ASCII equivalents for Windows console compatibility.

**Files created:** `validation/analyze_results.py`, `validation/check_progress.py`  
**Files modified:** `validation/base.py`, `validation/harness_medqa.py`, `validation/run_validation.py`, `app/tools/clinical_reasoning.py`, `app/tools/synthesis.py`, `app/tools/conflict_detection.py`, `app/tools/patient_parser.py`

---

## Phase 10: Final Documentation Audit & Cleanup

Performed a full accuracy audit of all 5 documentation files and `test_e2e.py`.

**Issues found and fixed:**
- README.md: step count said "5" in E2E table (fixed to 6), missing Conflict Detection row, missing `validation/` in project structure, missing validation section and test commands
- architecture.md: Design Decision #1 said "5-step" (fixed to 6), Decision #4 said "Gemma in two roles" (fixed to four), no validation framework section
- test_results.md: no external validation section, stale line count for test_e2e.py
- DEVELOPMENT_LOG.md: Phase 7 said "(Current)", missing Phase 9 for validation framework
- writeup_draft.md: referenced "confidence levels" (removed earlier), placeholder links, no validation methodology
- test_e2e.py: no assertions on step count or conflict_detection step

**Created:** `TODO.md` in project root with next-session action items for easy pickup by future contributors or AI instances.