cds-agent / DEVELOPMENT_LOG.md
bshepp
docs: full documentation vs reality audit
5d53fbf

Development Log β€” CDS Agent

Chronological record of the build process, problems encountered, and solutions applied.


Phase 1: Project Scaffolding

Decision: Track Selection

Chose the Agentic Workflow Prize track ($10K) for the MedGemma Impact Challenge. The clinical decision support use case maps perfectly to an agentic architecture β€” multiple specialized tools orchestrated by a central agent.

Architecture Design

Designed a 5-step sequential pipeline:

  1. Parse patient data (LLM)
  2. Clinical reasoning / differential diagnosis (LLM)
  3. Drug interaction check (external APIs)
  4. Guideline retrieval (RAG)
  5. Synthesis into CDS report (LLM)

Key design choices:

  • Custom orchestrator instead of LangChain β€” simpler, more transparent, no framework overhead
  • WebSocket streaming β€” clinician sees each step execute in real time (critical for trust)
  • Pydantic v2 everywhere β€” all inter-step data is strongly typed

Backend Scaffold

Built the FastAPI backend from scratch:

  • app/main.py β€” FastAPI app with CORS, router includes, lifespan
  • app/config.py β€” Pydantic Settings from .env
  • app/models/schemas.py β€” All domain models (~238 lines, 10+ Pydantic models)
  • app/agent/orchestrator.py β€” 5-step pipeline (267 lines)
  • app/services/medgemma.py β€” LLM service wrapping OpenAI SDK
  • app/tools/ β€” 5 tool modules (one per pipeline step)
  • app/api/ β€” 3 route modules (health, cases, WebSocket)

Frontend Scaffold

Built the Next.js 14 frontend:

  • PatientInput.tsx β€” Text area + 3 pre-loaded sample cases
  • AgentPipeline.tsx β€” Real-time 5-step status visualization
  • CDSReport.tsx β€” Final report renderer
  • useAgentWebSocket.ts β€” WebSocket hook for real-time updates
  • next.config.js β€” API proxy to backend

Phase 2: Integration & Bug Fixes

Bug: Gemma System Prompt 400 Error

Problem: The first LLM call failed with HTTP 400. Gemma models via the Google AI Studio OpenAI-compatible endpoint do not support role: "system" messages β€” a fundamental difference from OpenAI's API.

Solution: Modified medgemma.py to detect system messages and fold them into the first user message with a [System Instructions] prefix. All pipeline steps now work correctly.

File changed: src/backend/app/services/medgemma.py

Bug: RxNorm API β€” rxnormId Is a List

Problem: The drug interaction checker crashed when querying RxNorm. The NLM API returns rxnormId as a list (e.g., ["12345"]), not a scalar string. The code assumed a string.

Solution: Added type checking β€” if rxnormId is a list, take the first element; if it's a string, use directly.

File changed: src/backend/app/tools/drug_interactions.py

Bug: OpenAI SDK Version Mismatch

Problem: openai==1.0.0 had breaking API changes compared to the code written for the older API pattern.

Solution: Pinned to openai==1.51.0 in requirements.txt, which is compatible with both the modern SDK API and the Google AI Studio OpenAI-compatible endpoint.

File changed: src/backend/requirements.txt

Bug: Port 8000 Zombie Processes

Problem: Previous server instances left zombie processes holding port 8000. New uvicorn instances couldn't bind.

Solution: Switched to port 8002 for development. Updated next.config.js and useAgentWebSocket.ts to proxy to 8002.

Files changed: src/frontend/next.config.js, src/frontend/src/hooks/useAgentWebSocket.ts


Phase 3: First Successful E2E Test

Test Case: Chest Pain / ACS

Submitted a 62-year-old male with crushing substernal chest pain, diaphoresis, HTN, on lisinopril + metformin + atorvastatin.

Results β€” all 5 steps passed:

Step Duration Outcome
Parse 7.8 s Correct structured extraction
Reason 21.2 s ACS as top differential (correct)
Drug Check 11.3 s Queried all 3 medications
Guidelines 9.6 s Retrieved ACS/chest pain guidelines
Synthesis 25.3 s Comprehensive report with recommendations

This was the first end-to-end success. Total pipeline: ~75 seconds.


Phase 4: Project Direction Shift

Decision: From Competition to Real Application

After achieving the first successful E2E test, made the decision to shift focus from "winning a competition" to "building a genuinely important medical application." The clinical decision support problem is real and impactful regardless of competition outcomes.

This shift influenced subsequent work β€” emphasis on:

  • Comprehensive clinical coverage (more specialties, more guidelines)
  • Thorough testing (not just demos)
  • Proper documentation

Phase 5: RAG Expansion

Guideline Corpus: 2 β†’ 62

The initial RAG system had only 2 minimal fallback guidelines. Expanded to a comprehensive corpus:

  • Created: app/data/clinical_guidelines.json β€” 62 guidelines across 14 specialties
  • Updated: guideline_retrieval.py β€” loads from JSON, stores specialty/ID metadata in ChromaDB
  • Sources: ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, APA, AAP, ACR, ASH, KDIGO, WHO, USPSTF

ChromaDB Rebuild

Had to kill locking processes holding the ChromaDB files before rebuilding. After clearing locks, ChromaDB successfully indexed all 62 guidelines with all-MiniLM-L6-v2 embeddings (384 dimensions).


Phase 6: Comprehensive Test Suite

RAG Quality Tests (30 queries)

Created test_rag_quality.py with 30 clinical queries, each mapped to an expected guideline ID:

  • Result: 30/30 passed (100%)
  • Average relevance score: 0.639
  • Every query returned the correct guideline as the #1 result
  • All 14 specialty categories achieved 100% pass rate

Clinical Test Cases (22 scenarios)

Created test_clinical_cases.py with 22 diverse clinical scenarios:

  • Covers 14+ specialties (Cardiology, EM, Endocrinology, Neurology, Pulmonology, GI, ID, Psych, Peds, Nephrology, Toxicology, Geriatrics)
  • Each case has: clinical vignette, expected specialty, validation keywords
  • Supports CLI flags: --case, --specialty, --list, --report, --quiet

Phase 7: Documentation

Performed comprehensive documentation audit. Found:

  • README was outdated (wrong port, missing test info, incomplete structure tree)
  • Architecture doc lacked implementation specifics (RAG details, Gemma workaround, timing)
  • Writeup draft was 100% TODO placeholders
  • No test results documentation existed
  • No development log existed

Rewrote/created all documentation:

  • README.md β€” Complete rewrite with results, RAG corpus info, updated structure, corrected setup
  • docs/architecture.md β€” Updated with actual implementation details, timing, config, limitations
  • docs/test_results.md β€” New file documenting all test results and reproduction steps
  • DEVELOPMENT_LOG.md β€” This file
  • docs/writeup_draft.md β€” Filled in with actual project information

Phase 8: Conflict Detection Feature

Design Decision: Drop Confidence Scores, Add Conflict Detection

During review, identified that the system's "confidence" was just the LLM picking a label (LOW/MODERATE/HIGH) β€” not a calibrated score. Composite numeric confidence scores were considered and rejected because:

  • Uncalibrated confidence values are dangerous (clinician anchoring bias)
  • No training data exists to calibrate outputs
  • A single number hides more than it reveals

Instead, added Conflict Detection β€” a new pipeline step that compares guideline recommendations against the patient's actual data to identify specific, actionable gaps. This provides direct patient safety value without requiring calibration.

Implementation

New models added to schemas.py:

  • ConflictType enum β€” 6 categories: omission, contradiction, dosage, monitoring, allergy_risk, interaction_gap
  • ClinicalConflict model β€” Each conflict has: type, severity, guideline_source, guideline_text, patient_data, description, suggested_resolution
  • ConflictDetectionResult β€” List of conflicts + summary + guidelines_checked count
  • conflicts field added to CDSReport
  • conflict_detection field added to AgentState

New tool: conflict_detection.py:

  • Takes patient profile, clinical reasoning, drug interactions, and guidelines
  • Uses MedGemma at low temperature (0.1) for safety-critical analysis
  • Returns structured ConflictDetectionResult with specific, actionable conflicts
  • Graceful degradation: returns empty if no guidelines available

Pipeline changes (orchestrator.py):

  • Pipeline expanded from 5 to 6 steps
  • New Step 5: Conflict Detection (between guideline retrieval and synthesis)
  • Synthesis (now Step 6) receives conflict data and prominently includes it in the report

Synthesis changes (synthesis.py):

  • Accepts conflict_detection parameter
  • New "Conflicts & Gaps" section in synthesis prompt
  • Fallback: copies detected conflicts directly into report if LLM doesn't populate the structured field

Frontend changes (CDSReport.tsx):

  • New "Conflicts & Gaps Detected" section with high visual prominence
  • Red border container, severity-coded left-accent cards (critical=red, high=orange, moderate=yellow, low=blue)
  • Side-by-side "Guideline says" vs "Patient data" comparison
  • Green-highlighted suggested resolutions
  • Positioned immediately after drug interactions for maximum visibility

Files created: src/backend/app/tools/conflict_detection.py (1 new file) Files modified: schemas.py, orchestrator.py, synthesis.py, CDSReport.tsx (4 files)


Dependency Inventory

Python Backend (requirements.txt)

Package Version Purpose
fastapi 0.115.0 Web framework
uvicorn 0.30.6 ASGI server
openai 1.51.0 LLM API client (OpenAI-compatible)
chromadb 0.5.7 Vector database for RAG
sentence-transformers 3.1.1 Embedding model
httpx 0.27.2 Async HTTP client (API calls)
torch 2.4.1 PyTorch (sentence-transformers dependency)
transformers 4.45.0 HuggingFace transformers
pydantic-settings 2.5.2 Settings management
pydantic 2.9.2 Data validation
websockets 13.1 WebSocket support
python-dotenv 1.0.1 .env file loading
numpy 1.26.4 Numerical computing

Frontend (package.json)

Package Purpose
next 14.x React framework
react 18.x UI library
typescript Type safety
tailwindcss Styling

Environment Configuration

All config via .env (template in .env.template):

Variable Required Default Description
MEDGEMMA_API_KEY Yes β€” HuggingFace API token or Google AI Studio API key
MEDGEMMA_BASE_URL No "" (empty) LLM endpoint (HF Endpoint URL/v1 or Google AI Studio URL)
MEDGEMMA_MODEL_ID No google/medgemma Model identifier (tgi for HF Endpoints, or full model name)
HF_TOKEN No "" HuggingFace token for dataset downloads
CHROMA_PERSIST_DIR No ./data/chroma ChromaDB storage
EMBEDDING_MODEL No sentence-transformers/all-MiniLM-L6-v2 RAG embeddings
MAX_GUIDELINES No 5 Guidelines per RAG query
AGENT_TIMEOUT No 120 Pipeline timeout (seconds)

Phase 9: External Dataset Validation Framework

Motivation

Internal tests (RAG quality, clinical cases) are useful but don't measure diagnostic accuracy against ground truth. Added a validation framework to test the full pipeline against real-world clinical datasets with known correct answers.

Datasets Evaluated

Dataset Source What It Tests
MedQA (USMLE) HuggingFace β€” GBaker/MedQA-USMLE-4-options Diagnostic accuracy (1,273 USMLE-style questions with verified answers)
MTSamples GitHub β€” socd06/medical-nlp Parse quality & field completeness on real medical transcription notes
PMC Case Reports PubMed E-utilities (esearch + efetch) Diagnostic accuracy on published case reports with known diagnoses

Architecture

Created src/backend/validation/ package:

  • base.py β€” Core framework: ValidationCase, ValidationResult, ValidationSummary dataclasses. run_cds_pipeline() invokes the Orchestrator directly (no HTTP server needed). Includes fuzzy_match() token-overlap scorer and diagnosis_in_differential() checker.
  • harness_medqa.py β€” Downloads JSONL from HuggingFace, extracts clinical vignettes (strips question stems), scores top-1/top-3/mentioned diagnostic accuracy.
  • harness_mtsamples.py β€” Downloads CSV, filters to relevant specialties, stratified sampling. Scores parse success, field completeness, specialty alignment, has_differential, has_recommendations.
  • harness_pmc.py β€” Uses NCBI E-utilities with 20 curated queries across specialties. Extracts diagnosis from article titles via regex patterns. Scores diagnostic accuracy.
  • run_validation.py β€” Unified CLI: python -m validation.run_validation --all --max-cases 10. Supports --fetch-only, --no-drugs, --no-guidelines, --seed, --delay.

Problems Solved

  1. MedQA URL 404: Original GitHub raw URL was stale. Fixed to HuggingFace direct download.
  2. MTSamples URL 404: Original mirror was down. Found working mirror at socd06/medical-nlp.
  3. PMC fetcher returned 0 cases: PubMed API worked, but title regex patterns didn't match common formats like "X: A Case Report." Added 3 new title patterns and fixed query-based fallback extraction.
  4. datetime.utcnow() deprecation: Replaced with datetime.now(timezone.utc) throughout.
  5. Pipeline time display bug: print_summary showed time metrics as percentages. Fixed by reordering type checks.

Initial Results (Smoke Test)

Ran 3 MedQA cases through the full pipeline:

  • Parse success: 100% (3/3)
  • Top-1 diagnostic accuracy: 66.7% (2/3)
  • Avg pipeline time: ~94 seconds per case

Full validation runs (50–100+ cases) are planned for the next session.

Files created: validation/__init__.py, validation/base.py, validation/harness_medqa.py, validation/harness_mtsamples.py, validation/harness_pmc.py, validation/run_validation.py
Files modified: .gitignore (added validation/data/ and validation/results/)


Phase 11: MedGemma HuggingFace Dedicated Endpoint

Motivation

The competition requires using HAI-DEF models (MedGemma). Google AI Studio served gemma-3-27b-it for development, but for the final submission we needed the actual google/medgemma-27b-text-it model. HuggingFace Dedicated Endpoints provide an OpenAI-compatible TGI server with scale-to-zero billing.

Deployment

  • Endpoint name: medgemma-27b-cds
  • Model: google/medgemma-27b-text-it
  • Instance: 1Γ— NVIDIA A100 80 GB (AWS us-east-1)
  • Container: Text Generation Inference (TGI) with DTYPE=bfloat16
  • Scale-to-zero: Enabled (15 min idle timeout)
  • Cost: ~$2.50/hr when running

Key Configuration

After initial deployment, the default TGI token limits (MAX_INPUT_TOKENS=4096) caused 422 errors on longer synthesis prompts. Updated endpoint environment:

  • MAX_INPUT_TOKENS=12288
  • MAX_TOTAL_TOKENS=16384

Also reduced per-step max_tokens to stay within limits:

  • patient_parser.py: 1500
  • clinical_reasoning.py: 3072
  • conflict_detection.py: 2000
  • synthesis.py: 3000

Code Changes

  • medgemma.py: Updated to send role: "system" natively (TGI supports it), with automatic fallback to folding system prompt into user message for Google AI Studio compatibility.
  • .env: Updated MEDGEMMA_BASE_URL to HF endpoint URL, MEDGEMMA_API_KEY to HF token, MEDGEMMA_MODEL_ID=tgi.
  • .env.template: Updated with MedGemma model name and HF Endpoint instructions.

Verification

Single-case test: Chikungunya question β†’ correct diagnosis appeared at rank 5 in differential. All 6 pipeline steps completed in 281s.

Deployment guide: docs/deploy_medgemma_hf.md


Phase 12: 50-Case MedQA Validation

Setup

Ran 50 MedQA (USMLE) cases through the full pipeline using the MedGemma HF Endpoint:

cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2

Results

Metric Value
Cases run 50
Pipeline success 94% (47/50)
Top-1 diagnostic accuracy 36%
Top-3 diagnostic accuracy 38%
Differential accuracy 10%
Mentioned in report 38%
Avg pipeline time 204 s/case
Total run time ~60 min

Question Type Breakdown

Used analyze_results.py to categorize the 50 cases:

Type Count Mentioned Differential
Diagnostic 36 14 (39%) 5 (14%)
Treatment 6 β€” β€”
Pathophysiology 6 β€” β€”
Statistics 1 β€” β€”
Anatomy 1 β€” β€”

Key Observations

  1. MedQA includes many non-diagnostic questions (treatment, mechanism, stats) that the CDS pipeline is not designed to answer β€” it generates differential diagnoses, not multiple-choice answers.
  2. On diagnostic questions specifically, 39% mentioned accuracy is reasonable for a pipeline that wasn't optimized for exam-style questions.
  3. Pipeline failures (3/50) were caused by the HF endpoint scaling to zero mid-run. The --resume flag successfully continued from the checkpoint.
  4. Improved clinical reasoning prompt to demand disease-level diagnoses rather than symptom categories (e.g., "Chikungunya" not "viral arthritis").

Infrastructure Improvements

  • Incremental JSONL checkpoints: Each case result is appended to medqa_checkpoint.jsonl as it completes.
  • --resume flag: Skips already-completed cases, enabling graceful recovery from endpoint failures.
  • check_progress.py: Utility to monitor checkpoint progress during long runs.
  • analyze_results.py: Categorizes MedQA results by question type for more meaningful accuracy analysis.
  • Unicode fixes: Replaced box-drawing characters (β•”β•β•—β•‘β•šβ•) and symbols (βœ“βœ—β”€) with ASCII equivalents for Windows console compatibility.

Files created: validation/analyze_results.py, validation/check_progress.py
Files modified: validation/base.py, validation/harness_medqa.py, validation/run_validation.py, app/tools/clinical_reasoning.py, app/tools/synthesis.py, app/tools/conflict_detection.py, app/tools/patient_parser.py


Phase 10: Final Documentation Audit & Cleanup

Performed a full accuracy audit of all 5 documentation files and test_e2e.py.

Issues found and fixed:

  • README.md: step count said "5" in E2E table (fixed to 6), missing Conflict Detection row, missing validation/ in project structure, missing validation section and test commands
  • architecture.md: Design Decision #1 said "5-step" (fixed to 6), Decision #4 said "Gemma in two roles" (fixed to four), no validation framework section
  • test_results.md: no external validation section, stale line count for test_e2e.py
  • DEVELOPMENT_LOG.md: Phase 7 said "(Current)", missing Phase 9 for validation framework
  • writeup_draft.md: referenced "confidence levels" (removed earlier), placeholder links, no validation methodology
  • test_e2e.py: no assertions on step count or conflict_detection step

Created: TODO.md in project root with next-session action items for easy pickup by future contributors or AI instances.