File size: 19,651 Bytes
f2c113d 9dea0ad f2c113d c28dd56 f2c113d 5d53fbf f2c113d 9dea0ad 5d53fbf 9dea0ad | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 | # Development Log β CDS Agent
> Chronological record of the build process, problems encountered, and solutions applied.
---
## Phase 1: Project Scaffolding
### Decision: Track Selection
Chose the **Agentic Workflow Prize** track ($10K) for the MedGemma Impact Challenge. The clinical decision support use case maps perfectly to an agentic architecture β multiple specialized tools orchestrated by a central agent.
### Architecture Design
Designed a 5-step sequential pipeline:
1. Parse patient data (LLM)
2. Clinical reasoning / differential diagnosis (LLM)
3. Drug interaction check (external APIs)
4. Guideline retrieval (RAG)
5. Synthesis into CDS report (LLM)
**Key design choices:**
- **Custom orchestrator** instead of LangChain β simpler, more transparent, no framework overhead
- **WebSocket streaming** β clinician sees each step execute in real time (critical for trust)
- **Pydantic v2 everywhere** β all inter-step data is strongly typed
### Backend Scaffold
Built the FastAPI backend from scratch:
- `app/main.py` β FastAPI app with CORS, router includes, lifespan
- `app/config.py` β Pydantic Settings from `.env`
- `app/models/schemas.py` β All domain models (~238 lines, 10+ Pydantic models)
- `app/agent/orchestrator.py` β 5-step pipeline (267 lines)
- `app/services/medgemma.py` β LLM service wrapping OpenAI SDK
- `app/tools/` β 5 tool modules (one per pipeline step)
- `app/api/` β 3 route modules (health, cases, WebSocket)
### Frontend Scaffold
Built the Next.js 14 frontend:
- `PatientInput.tsx` β Text area + 3 pre-loaded sample cases
- `AgentPipeline.tsx` β Real-time 5-step status visualization
- `CDSReport.tsx` β Final report renderer
- `useAgentWebSocket.ts` β WebSocket hook for real-time updates
- `next.config.js` β API proxy to backend
---
## Phase 2: Integration & Bug Fixes
### Bug: Gemma System Prompt 400 Error
**Problem:** The first LLM call failed with HTTP 400. Gemma models via the Google AI Studio OpenAI-compatible endpoint do not support `role: "system"` messages β a fundamental difference from OpenAI's API.
**Solution:** Modified `medgemma.py` to detect system messages and fold them into the first user message with a `[System Instructions]` prefix. All pipeline steps now work correctly.
**File changed:** `src/backend/app/services/medgemma.py`
### Bug: RxNorm API β `rxnormId` Is a List
**Problem:** The drug interaction checker crashed when querying RxNorm. The NLM API returns `rxnormId` as a **list** (e.g., `["12345"]`), not a scalar string. The code assumed a string.
**Solution:** Added type checking β if `rxnormId` is a list, take the first element; if it's a string, use directly.
**File changed:** `src/backend/app/tools/drug_interactions.py`
### Bug: OpenAI SDK Version Mismatch
**Problem:** `openai==1.0.0` had breaking API changes compared to the code written for the older API pattern.
**Solution:** Pinned to `openai==1.51.0` in `requirements.txt`, which is compatible with both the modern SDK API and the Google AI Studio OpenAI-compatible endpoint.
**File changed:** `src/backend/requirements.txt`
### Bug: Port 8000 Zombie Processes
**Problem:** Previous server instances left zombie processes holding port 8000. New `uvicorn` instances couldn't bind.
**Solution:** Switched to port 8002 for development. Updated `next.config.js` and `useAgentWebSocket.ts` to proxy to 8002.
**Files changed:** `src/frontend/next.config.js`, `src/frontend/src/hooks/useAgentWebSocket.ts`
---
## Phase 3: First Successful E2E Test
### Test Case: Chest Pain / ACS
Submitted a 62-year-old male with crushing substernal chest pain, diaphoresis, HTN, on lisinopril + metformin + atorvastatin.
**Results β all 5 steps passed:**
| Step | Duration | Outcome |
|------|----------|---------|
| Parse | 7.8 s | Correct structured extraction |
| Reason | 21.2 s | ACS as top differential (correct) |
| Drug Check | 11.3 s | Queried all 3 medications |
| Guidelines | 9.6 s | Retrieved ACS/chest pain guidelines |
| Synthesis | 25.3 s | Comprehensive report with recommendations |
This was the first end-to-end success. Total pipeline: ~75 seconds.
---
## Phase 4: Project Direction Shift
### Decision: From Competition to Real Application
After achieving the first successful E2E test, made the decision to shift focus from "winning a competition" to "building a genuinely important medical application." The clinical decision support problem is real and impactful regardless of competition outcomes.
This shift influenced subsequent work β emphasis on:
- Comprehensive clinical coverage (more specialties, more guidelines)
- Thorough testing (not just demos)
- Proper documentation
---
## Phase 5: RAG Expansion
### Guideline Corpus: 2 β 62
The initial RAG system had only 2 minimal fallback guidelines. Expanded to a comprehensive corpus:
- **Created:** `app/data/clinical_guidelines.json` β 62 guidelines across 14 specialties
- **Updated:** `guideline_retrieval.py` β loads from JSON, stores specialty/ID metadata in ChromaDB
- **Sources:** ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, APA, AAP, ACR, ASH, KDIGO, WHO, USPSTF
### ChromaDB Rebuild
Had to kill locking processes holding the ChromaDB files before rebuilding. After clearing locks, ChromaDB successfully indexed all 62 guidelines with `all-MiniLM-L6-v2` embeddings (384 dimensions).
---
## Phase 6: Comprehensive Test Suite
### RAG Quality Tests (30 queries)
Created `test_rag_quality.py` with 30 clinical queries, each mapped to an expected guideline ID:
- **Result: 30/30 passed (100%)**
- Average relevance score: 0.639
- Every query returned the correct guideline as the #1 result
- All 14 specialty categories achieved 100% pass rate
### Clinical Test Cases (22 scenarios)
Created `test_clinical_cases.py` with 22 diverse clinical scenarios:
- Covers 14+ specialties (Cardiology, EM, Endocrinology, Neurology, Pulmonology, GI, ID, Psych, Peds, Nephrology, Toxicology, Geriatrics)
- Each case has: clinical vignette, expected specialty, validation keywords
- Supports CLI flags: `--case`, `--specialty`, `--list`, `--report`, `--quiet`
---
## Phase 7: Documentation
Performed comprehensive documentation audit. Found:
- README was outdated (wrong port, missing test info, incomplete structure tree)
- Architecture doc lacked implementation specifics (RAG details, Gemma workaround, timing)
- Writeup draft was 100% TODO placeholders
- No test results documentation existed
- No development log existed
Rewrote/created all documentation:
- **README.md** β Complete rewrite with results, RAG corpus info, updated structure, corrected setup
- **docs/architecture.md** β Updated with actual implementation details, timing, config, limitations
- **docs/test_results.md** β New file documenting all test results and reproduction steps
- **DEVELOPMENT_LOG.md** β This file
- **docs/writeup_draft.md** β Filled in with actual project information
---
## Phase 8: Conflict Detection Feature
### Design Decision: Drop Confidence Scores, Add Conflict Detection
During review, identified that the system's "confidence" was just the LLM picking a label (LOW/MODERATE/HIGH) β not a calibrated score. Composite numeric confidence scores were considered and **rejected** because:
- Uncalibrated confidence values are dangerous (clinician anchoring bias)
- No training data exists to calibrate outputs
- A single number hides more than it reveals
**Instead, added Conflict Detection** β a new pipeline step that compares guideline recommendations against the patient's actual data to identify specific, actionable gaps. This provides direct patient safety value without requiring calibration.
### Implementation
**New models added to `schemas.py`:**
- `ConflictType` enum β 6 categories: omission, contradiction, dosage, monitoring, allergy_risk, interaction_gap
- `ClinicalConflict` model β Each conflict has: type, severity, guideline_source, guideline_text, patient_data, description, suggested_resolution
- `ConflictDetectionResult` β List of conflicts + summary + guidelines_checked count
- `conflicts` field added to `CDSReport`
- `conflict_detection` field added to `AgentState`
**New tool: `conflict_detection.py`:**
- Takes patient profile, clinical reasoning, drug interactions, and guidelines
- Uses MedGemma at low temperature (0.1) for safety-critical analysis
- Returns structured `ConflictDetectionResult` with specific, actionable conflicts
- Graceful degradation: returns empty if no guidelines available
**Pipeline changes (`orchestrator.py`):**
- Pipeline expanded from 5 to 6 steps
- New Step 5: Conflict Detection (between guideline retrieval and synthesis)
- Synthesis (now Step 6) receives conflict data and prominently includes it in the report
**Synthesis changes (`synthesis.py`):**
- Accepts `conflict_detection` parameter
- New "Conflicts & Gaps" section in synthesis prompt
- Fallback: copies detected conflicts directly into report if LLM doesn't populate the structured field
**Frontend changes (`CDSReport.tsx`):**
- New "Conflicts & Gaps Detected" section with high visual prominence
- Red border container, severity-coded left-accent cards (critical=red, high=orange, moderate=yellow, low=blue)
- Side-by-side "Guideline says" vs "Patient data" comparison
- Green-highlighted suggested resolutions
- Positioned immediately after drug interactions for maximum visibility
**Files created:** `src/backend/app/tools/conflict_detection.py` (1 new file)
**Files modified:** `schemas.py`, `orchestrator.py`, `synthesis.py`, `CDSReport.tsx` (4 files)
---
## Dependency Inventory
### Python Backend (`requirements.txt`)
| Package | Version | Purpose |
|---------|---------|---------|
| fastapi | 0.115.0 | Web framework |
| uvicorn | 0.30.6 | ASGI server |
| openai | 1.51.0 | LLM API client (OpenAI-compatible) |
| chromadb | 0.5.7 | Vector database for RAG |
| sentence-transformers | 3.1.1 | Embedding model |
| httpx | 0.27.2 | Async HTTP client (API calls) |
| torch | 2.4.1 | PyTorch (sentence-transformers dependency) |
| transformers | 4.45.0 | HuggingFace transformers |
| pydantic-settings | 2.5.2 | Settings management |
| pydantic | 2.9.2 | Data validation |
| websockets | 13.1 | WebSocket support |
| python-dotenv | 1.0.1 | .env file loading |
| numpy | 1.26.4 | Numerical computing |
### Frontend (`package.json`)
| Package | Purpose |
|---------|---------|
| next 14.x | React framework |
| react 18.x | UI library |
| typescript | Type safety |
| tailwindcss | Styling |
---
## Environment Configuration
All config via `.env` (template in `.env.template`):
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `MEDGEMMA_API_KEY` | Yes | β | HuggingFace API token or Google AI Studio API key |
| `MEDGEMMA_BASE_URL` | No | `""` (empty) | LLM endpoint (HF Endpoint URL/v1 or Google AI Studio URL) |
| `MEDGEMMA_MODEL_ID` | No | `google/medgemma` | Model identifier (`tgi` for HF Endpoints, or full model name) |
| `HF_TOKEN` | No | `""` | HuggingFace token for dataset downloads |
| `CHROMA_PERSIST_DIR` | No | `./data/chroma` | ChromaDB storage |
| `EMBEDDING_MODEL` | No | `sentence-transformers/all-MiniLM-L6-v2` | RAG embeddings |
| `MAX_GUIDELINES` | No | `5` | Guidelines per RAG query |
| `AGENT_TIMEOUT` | No | `120` | Pipeline timeout (seconds) |
---
## Phase 9: External Dataset Validation Framework
### Motivation
Internal tests (RAG quality, clinical cases) are useful but don't measure diagnostic accuracy against ground truth. Added a validation framework to test the full pipeline against real-world clinical datasets with known correct answers.
### Datasets Evaluated
| Dataset | Source | What It Tests |
|---------|--------|---------------|
| **MedQA (USMLE)** | HuggingFace β `GBaker/MedQA-USMLE-4-options` | Diagnostic accuracy (1,273 USMLE-style questions with verified answers) |
| **MTSamples** | GitHub β `socd06/medical-nlp` | Parse quality & field completeness on real medical transcription notes |
| **PMC Case Reports** | PubMed E-utilities (esearch + efetch) | Diagnostic accuracy on published case reports with known diagnoses |
### Architecture
Created `src/backend/validation/` package:
- **`base.py`** β Core framework: `ValidationCase`, `ValidationResult`, `ValidationSummary` dataclasses. `run_cds_pipeline()` invokes the Orchestrator directly (no HTTP server needed). Includes `fuzzy_match()` token-overlap scorer and `diagnosis_in_differential()` checker.
- **`harness_medqa.py`** β Downloads JSONL from HuggingFace, extracts clinical vignettes (strips question stems), scores top-1/top-3/mentioned diagnostic accuracy.
- **`harness_mtsamples.py`** β Downloads CSV, filters to relevant specialties, stratified sampling. Scores parse success, field completeness, specialty alignment, has_differential, has_recommendations.
- **`harness_pmc.py`** β Uses NCBI E-utilities with 20 curated queries across specialties. Extracts diagnosis from article titles via regex patterns. Scores diagnostic accuracy.
- **`run_validation.py`** β Unified CLI: `python -m validation.run_validation --all --max-cases 10`. Supports `--fetch-only`, `--no-drugs`, `--no-guidelines`, `--seed`, `--delay`.
### Problems Solved
1. **MedQA URL 404:** Original GitHub raw URL was stale. Fixed to HuggingFace direct download.
2. **MTSamples URL 404:** Original mirror was down. Found working mirror at `socd06/medical-nlp`.
3. **PMC fetcher returned 0 cases:** PubMed API worked, but title regex patterns didn't match common formats like "X: A Case Report." Added 3 new title patterns and fixed query-based fallback extraction.
4. **`datetime.utcnow()` deprecation:** Replaced with `datetime.now(timezone.utc)` throughout.
5. **Pipeline time display bug:** `print_summary` showed time metrics as percentages. Fixed by reordering type checks.
### Initial Results (Smoke Test)
Ran 3 MedQA cases through the full pipeline:
- **Parse success:** 100% (3/3)
- **Top-1 diagnostic accuracy:** 66.7% (2/3)
- **Avg pipeline time:** ~94 seconds per case
Full validation runs (50β100+ cases) are planned for the next session.
**Files created:** `validation/__init__.py`, `validation/base.py`, `validation/harness_medqa.py`, `validation/harness_mtsamples.py`, `validation/harness_pmc.py`, `validation/run_validation.py`
**Files modified:** `.gitignore` (added `validation/data/` and `validation/results/`)
---
## Phase 11: MedGemma HuggingFace Dedicated Endpoint
### Motivation
The competition requires using HAI-DEF models (MedGemma). Google AI Studio served `gemma-3-27b-it` for development, but for the final submission we needed the actual `google/medgemma-27b-text-it` model. HuggingFace Dedicated Endpoints provide an OpenAI-compatible TGI server with scale-to-zero billing.
### Deployment
- **Endpoint name:** `medgemma-27b-cds`
- **Model:** `google/medgemma-27b-text-it`
- **Instance:** 1Γ NVIDIA A100 80 GB (AWS `us-east-1`)
- **Container:** Text Generation Inference (TGI) with `DTYPE=bfloat16`
- **Scale-to-zero:** Enabled (15 min idle timeout)
- **Cost:** ~$2.50/hr when running
### Key Configuration
After initial deployment, the default TGI token limits (`MAX_INPUT_TOKENS=4096`) caused 422 errors on longer synthesis prompts. Updated endpoint environment:
- `MAX_INPUT_TOKENS=12288`
- `MAX_TOTAL_TOKENS=16384`
Also reduced per-step `max_tokens` to stay within limits:
- `patient_parser.py`: 1500
- `clinical_reasoning.py`: 3072
- `conflict_detection.py`: 2000
- `synthesis.py`: 3000
### Code Changes
- **`medgemma.py`:** Updated to send `role: "system"` natively (TGI supports it), with automatic fallback to folding system prompt into user message for Google AI Studio compatibility.
- **`.env`:** Updated `MEDGEMMA_BASE_URL` to HF endpoint URL, `MEDGEMMA_API_KEY` to HF token, `MEDGEMMA_MODEL_ID=tgi`.
- **`.env.template`:** Updated with MedGemma model name and HF Endpoint instructions.
### Verification
Single-case test: Chikungunya question β correct diagnosis appeared at rank 5 in differential. All 6 pipeline steps completed in 281s.
**Deployment guide:** `docs/deploy_medgemma_hf.md`
---
## Phase 12: 50-Case MedQA Validation
### Setup
Ran 50 MedQA (USMLE) cases through the full pipeline using the MedGemma HF Endpoint:
```bash
cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
```
### Results
| Metric | Value |
|--------|-------|
| Cases run | 50 |
| Pipeline success | 94% (47/50) |
| Top-1 diagnostic accuracy | 36% |
| Top-3 diagnostic accuracy | 38% |
| Differential accuracy | 10% |
| Mentioned in report | 38% |
| Avg pipeline time | 204 s/case |
| Total run time | ~60 min |
### Question Type Breakdown
Used `analyze_results.py` to categorize the 50 cases:
| Type | Count | Mentioned | Differential |
|------|-------|-----------|-------------|
| Diagnostic | 36 | 14 (39%) | 5 (14%) |
| Treatment | 6 | β | β |
| Pathophysiology | 6 | β | β |
| Statistics | 1 | β | β |
| Anatomy | 1 | β | β |
### Key Observations
1. **MedQA includes many non-diagnostic questions** (treatment, mechanism, stats) that the CDS pipeline is not designed to answer β it generates differential diagnoses, not multiple-choice answers.
2. **On diagnostic questions specifically**, 39% mentioned accuracy is reasonable for a pipeline that wasn't optimized for exam-style questions.
3. **Pipeline failures (3/50)** were caused by the HF endpoint scaling to zero mid-run. The `--resume` flag successfully continued from the checkpoint.
4. **Improved clinical reasoning prompt** to demand disease-level diagnoses rather than symptom categories (e.g., "Chikungunya" not "viral arthritis").
### Infrastructure Improvements
- **Incremental JSONL checkpoints:** Each case result is appended to `medqa_checkpoint.jsonl` as it completes.
- **`--resume` flag:** Skips already-completed cases, enabling graceful recovery from endpoint failures.
- **`check_progress.py`:** Utility to monitor checkpoint progress during long runs.
- **`analyze_results.py`:** Categorizes MedQA results by question type for more meaningful accuracy analysis.
- **Unicode fixes:** Replaced box-drawing characters (`ββββββ`) and symbols (`βββ`) with ASCII equivalents for Windows console compatibility.
**Files created:** `validation/analyze_results.py`, `validation/check_progress.py`
**Files modified:** `validation/base.py`, `validation/harness_medqa.py`, `validation/run_validation.py`, `app/tools/clinical_reasoning.py`, `app/tools/synthesis.py`, `app/tools/conflict_detection.py`, `app/tools/patient_parser.py`
---
## Phase 10: Final Documentation Audit & Cleanup
Performed a full accuracy audit of all 5 documentation files and `test_e2e.py`.
**Issues found and fixed:**
- README.md: step count said "5" in E2E table (fixed to 6), missing Conflict Detection row, missing `validation/` in project structure, missing validation section and test commands
- architecture.md: Design Decision #1 said "5-step" (fixed to 6), Decision #4 said "Gemma in two roles" (fixed to four), no validation framework section
- test_results.md: no external validation section, stale line count for test_e2e.py
- DEVELOPMENT_LOG.md: Phase 7 said "(Current)", missing Phase 9 for validation framework
- writeup_draft.md: referenced "confidence levels" (removed earlier), placeholder links, no validation methodology
- test_e2e.py: no assertions on step count or conflict_detection step
**Created:** `TODO.md` in project root with next-session action items for easy pickup by future contributors or AI instances.
|