Development Log β CDS Agent
Chronological record of the build process, problems encountered, and solutions applied.
Phase 1: Project Scaffolding
Decision: Track Selection
Chose the Agentic Workflow Prize track ($10K) for the MedGemma Impact Challenge. The clinical decision support use case maps perfectly to an agentic architecture β multiple specialized tools orchestrated by a central agent.
Architecture Design
Designed a 5-step sequential pipeline:
- Parse patient data (LLM)
- Clinical reasoning / differential diagnosis (LLM)
- Drug interaction check (external APIs)
- Guideline retrieval (RAG)
- Synthesis into CDS report (LLM)
Key design choices:
- Custom orchestrator instead of LangChain β simpler, more transparent, no framework overhead
- WebSocket streaming β clinician sees each step execute in real time (critical for trust)
- Pydantic v2 everywhere β all inter-step data is strongly typed
Backend Scaffold
Built the FastAPI backend from scratch:
app/main.pyβ FastAPI app with CORS, router includes, lifespanapp/config.pyβ Pydantic Settings from.envapp/models/schemas.pyβ All domain models (~238 lines, 10+ Pydantic models)app/agent/orchestrator.pyβ 5-step pipeline (267 lines)app/services/medgemma.pyβ LLM service wrapping OpenAI SDKapp/tools/β 5 tool modules (one per pipeline step)app/api/β 3 route modules (health, cases, WebSocket)
Frontend Scaffold
Built the Next.js 14 frontend:
PatientInput.tsxβ Text area + 3 pre-loaded sample casesAgentPipeline.tsxβ Real-time 5-step status visualizationCDSReport.tsxβ Final report rendereruseAgentWebSocket.tsβ WebSocket hook for real-time updatesnext.config.jsβ API proxy to backend
Phase 2: Integration & Bug Fixes
Bug: Gemma System Prompt 400 Error
Problem: The first LLM call failed with HTTP 400. Gemma models via the Google AI Studio OpenAI-compatible endpoint do not support role: "system" messages β a fundamental difference from OpenAI's API.
Solution: Modified medgemma.py to detect system messages and fold them into the first user message with a [System Instructions] prefix. All pipeline steps now work correctly.
File changed: src/backend/app/services/medgemma.py
Bug: RxNorm API β rxnormId Is a List
Problem: The drug interaction checker crashed when querying RxNorm. The NLM API returns rxnormId as a list (e.g., ["12345"]), not a scalar string. The code assumed a string.
Solution: Added type checking β if rxnormId is a list, take the first element; if it's a string, use directly.
File changed: src/backend/app/tools/drug_interactions.py
Bug: OpenAI SDK Version Mismatch
Problem: openai==1.0.0 had breaking API changes compared to the code written for the older API pattern.
Solution: Pinned to openai==1.51.0 in requirements.txt, which is compatible with both the modern SDK API and the Google AI Studio OpenAI-compatible endpoint.
File changed: src/backend/requirements.txt
Bug: Port 8000 Zombie Processes
Problem: Previous server instances left zombie processes holding port 8000. New uvicorn instances couldn't bind.
Solution: Switched to port 8002 for development. Updated next.config.js and useAgentWebSocket.ts to proxy to 8002.
Files changed: src/frontend/next.config.js, src/frontend/src/hooks/useAgentWebSocket.ts
Phase 3: First Successful E2E Test
Test Case: Chest Pain / ACS
Submitted a 62-year-old male with crushing substernal chest pain, diaphoresis, HTN, on lisinopril + metformin + atorvastatin.
Results β all 5 steps passed:
| Step | Duration | Outcome |
|---|---|---|
| Parse | 7.8 s | Correct structured extraction |
| Reason | 21.2 s | ACS as top differential (correct) |
| Drug Check | 11.3 s | Queried all 3 medications |
| Guidelines | 9.6 s | Retrieved ACS/chest pain guidelines |
| Synthesis | 25.3 s | Comprehensive report with recommendations |
This was the first end-to-end success. Total pipeline: ~75 seconds.
Phase 4: Project Direction Shift
Decision: From Competition to Real Application
After achieving the first successful E2E test, made the decision to shift focus from "winning a competition" to "building a genuinely important medical application." The clinical decision support problem is real and impactful regardless of competition outcomes.
This shift influenced subsequent work β emphasis on:
- Comprehensive clinical coverage (more specialties, more guidelines)
- Thorough testing (not just demos)
- Proper documentation
Phase 5: RAG Expansion
Guideline Corpus: 2 β 62
The initial RAG system had only 2 minimal fallback guidelines. Expanded to a comprehensive corpus:
- Created:
app/data/clinical_guidelines.jsonβ 62 guidelines across 14 specialties - Updated:
guideline_retrieval.pyβ loads from JSON, stores specialty/ID metadata in ChromaDB - Sources: ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, APA, AAP, ACR, ASH, KDIGO, WHO, USPSTF
ChromaDB Rebuild
Had to kill locking processes holding the ChromaDB files before rebuilding. After clearing locks, ChromaDB successfully indexed all 62 guidelines with all-MiniLM-L6-v2 embeddings (384 dimensions).
Phase 6: Comprehensive Test Suite
RAG Quality Tests (30 queries)
Created test_rag_quality.py with 30 clinical queries, each mapped to an expected guideline ID:
- Result: 30/30 passed (100%)
- Average relevance score: 0.639
- Every query returned the correct guideline as the #1 result
- All 14 specialty categories achieved 100% pass rate
Clinical Test Cases (22 scenarios)
Created test_clinical_cases.py with 22 diverse clinical scenarios:
- Covers 14+ specialties (Cardiology, EM, Endocrinology, Neurology, Pulmonology, GI, ID, Psych, Peds, Nephrology, Toxicology, Geriatrics)
- Each case has: clinical vignette, expected specialty, validation keywords
- Supports CLI flags:
--case,--specialty,--list,--report,--quiet
Phase 7: Documentation
Performed comprehensive documentation audit. Found:
- README was outdated (wrong port, missing test info, incomplete structure tree)
- Architecture doc lacked implementation specifics (RAG details, Gemma workaround, timing)
- Writeup draft was 100% TODO placeholders
- No test results documentation existed
- No development log existed
Rewrote/created all documentation:
- README.md β Complete rewrite with results, RAG corpus info, updated structure, corrected setup
- docs/architecture.md β Updated with actual implementation details, timing, config, limitations
- docs/test_results.md β New file documenting all test results and reproduction steps
- DEVELOPMENT_LOG.md β This file
- docs/writeup_draft.md β Filled in with actual project information
Phase 8: Conflict Detection Feature
Design Decision: Drop Confidence Scores, Add Conflict Detection
During review, identified that the system's "confidence" was just the LLM picking a label (LOW/MODERATE/HIGH) β not a calibrated score. Composite numeric confidence scores were considered and rejected because:
- Uncalibrated confidence values are dangerous (clinician anchoring bias)
- No training data exists to calibrate outputs
- A single number hides more than it reveals
Instead, added Conflict Detection β a new pipeline step that compares guideline recommendations against the patient's actual data to identify specific, actionable gaps. This provides direct patient safety value without requiring calibration.
Implementation
New models added to schemas.py:
ConflictTypeenum β 6 categories: omission, contradiction, dosage, monitoring, allergy_risk, interaction_gapClinicalConflictmodel β Each conflict has: type, severity, guideline_source, guideline_text, patient_data, description, suggested_resolutionConflictDetectionResultβ List of conflicts + summary + guidelines_checked countconflictsfield added toCDSReportconflict_detectionfield added toAgentState
New tool: conflict_detection.py:
- Takes patient profile, clinical reasoning, drug interactions, and guidelines
- Uses MedGemma at low temperature (0.1) for safety-critical analysis
- Returns structured
ConflictDetectionResultwith specific, actionable conflicts - Graceful degradation: returns empty if no guidelines available
Pipeline changes (orchestrator.py):
- Pipeline expanded from 5 to 6 steps
- New Step 5: Conflict Detection (between guideline retrieval and synthesis)
- Synthesis (now Step 6) receives conflict data and prominently includes it in the report
Synthesis changes (synthesis.py):
- Accepts
conflict_detectionparameter - New "Conflicts & Gaps" section in synthesis prompt
- Fallback: copies detected conflicts directly into report if LLM doesn't populate the structured field
Frontend changes (CDSReport.tsx):
- New "Conflicts & Gaps Detected" section with high visual prominence
- Red border container, severity-coded left-accent cards (critical=red, high=orange, moderate=yellow, low=blue)
- Side-by-side "Guideline says" vs "Patient data" comparison
- Green-highlighted suggested resolutions
- Positioned immediately after drug interactions for maximum visibility
Files created: src/backend/app/tools/conflict_detection.py (1 new file)
Files modified: schemas.py, orchestrator.py, synthesis.py, CDSReport.tsx (4 files)
Dependency Inventory
Python Backend (requirements.txt)
| Package | Version | Purpose |
|---|---|---|
| fastapi | 0.115.0 | Web framework |
| uvicorn | 0.30.6 | ASGI server |
| openai | 1.51.0 | LLM API client (OpenAI-compatible) |
| chromadb | 0.5.7 | Vector database for RAG |
| sentence-transformers | 3.1.1 | Embedding model |
| httpx | 0.27.2 | Async HTTP client (API calls) |
| torch | 2.4.1 | PyTorch (sentence-transformers dependency) |
| transformers | 4.45.0 | HuggingFace transformers |
| pydantic-settings | 2.5.2 | Settings management |
| pydantic | 2.9.2 | Data validation |
| websockets | 13.1 | WebSocket support |
| python-dotenv | 1.0.1 | .env file loading |
| numpy | 1.26.4 | Numerical computing |
Frontend (package.json)
| Package | Purpose |
|---|---|
| next 14.x | React framework |
| react 18.x | UI library |
| typescript | Type safety |
| tailwindcss | Styling |
Environment Configuration
All config via .env (template in .env.template):
| Variable | Required | Default | Description |
|---|---|---|---|
MEDGEMMA_API_KEY |
Yes | β | HuggingFace API token or Google AI Studio API key |
MEDGEMMA_BASE_URL |
No | "" (empty) |
LLM endpoint (HF Endpoint URL/v1 or Google AI Studio URL) |
MEDGEMMA_MODEL_ID |
No | google/medgemma |
Model identifier (tgi for HF Endpoints, or full model name) |
HF_TOKEN |
No | "" |
HuggingFace token for dataset downloads |
CHROMA_PERSIST_DIR |
No | ./data/chroma |
ChromaDB storage |
EMBEDDING_MODEL |
No | sentence-transformers/all-MiniLM-L6-v2 |
RAG embeddings |
MAX_GUIDELINES |
No | 5 |
Guidelines per RAG query |
AGENT_TIMEOUT |
No | 120 |
Pipeline timeout (seconds) |
Phase 9: External Dataset Validation Framework
Motivation
Internal tests (RAG quality, clinical cases) are useful but don't measure diagnostic accuracy against ground truth. Added a validation framework to test the full pipeline against real-world clinical datasets with known correct answers.
Datasets Evaluated
| Dataset | Source | What It Tests |
|---|---|---|
| MedQA (USMLE) | HuggingFace β GBaker/MedQA-USMLE-4-options |
Diagnostic accuracy (1,273 USMLE-style questions with verified answers) |
| MTSamples | GitHub β socd06/medical-nlp |
Parse quality & field completeness on real medical transcription notes |
| PMC Case Reports | PubMed E-utilities (esearch + efetch) | Diagnostic accuracy on published case reports with known diagnoses |
Architecture
Created src/backend/validation/ package:
base.pyβ Core framework:ValidationCase,ValidationResult,ValidationSummarydataclasses.run_cds_pipeline()invokes the Orchestrator directly (no HTTP server needed). Includesfuzzy_match()token-overlap scorer anddiagnosis_in_differential()checker.harness_medqa.pyβ Downloads JSONL from HuggingFace, extracts clinical vignettes (strips question stems), scores top-1/top-3/mentioned diagnostic accuracy.harness_mtsamples.pyβ Downloads CSV, filters to relevant specialties, stratified sampling. Scores parse success, field completeness, specialty alignment, has_differential, has_recommendations.harness_pmc.pyβ Uses NCBI E-utilities with 20 curated queries across specialties. Extracts diagnosis from article titles via regex patterns. Scores diagnostic accuracy.run_validation.pyβ Unified CLI:python -m validation.run_validation --all --max-cases 10. Supports--fetch-only,--no-drugs,--no-guidelines,--seed,--delay.
Problems Solved
- MedQA URL 404: Original GitHub raw URL was stale. Fixed to HuggingFace direct download.
- MTSamples URL 404: Original mirror was down. Found working mirror at
socd06/medical-nlp. - PMC fetcher returned 0 cases: PubMed API worked, but title regex patterns didn't match common formats like "X: A Case Report." Added 3 new title patterns and fixed query-based fallback extraction.
datetime.utcnow()deprecation: Replaced withdatetime.now(timezone.utc)throughout.- Pipeline time display bug:
print_summaryshowed time metrics as percentages. Fixed by reordering type checks.
Initial Results (Smoke Test)
Ran 3 MedQA cases through the full pipeline:
- Parse success: 100% (3/3)
- Top-1 diagnostic accuracy: 66.7% (2/3)
- Avg pipeline time: ~94 seconds per case
Full validation runs (50β100+ cases) are planned for the next session.
Files created: validation/__init__.py, validation/base.py, validation/harness_medqa.py, validation/harness_mtsamples.py, validation/harness_pmc.py, validation/run_validation.py
Files modified: .gitignore (added validation/data/ and validation/results/)
Phase 11: MedGemma HuggingFace Dedicated Endpoint
Motivation
The competition requires using HAI-DEF models (MedGemma). Google AI Studio served gemma-3-27b-it for development, but for the final submission we needed the actual google/medgemma-27b-text-it model. HuggingFace Dedicated Endpoints provide an OpenAI-compatible TGI server with scale-to-zero billing.
Deployment
- Endpoint name:
medgemma-27b-cds - Model:
google/medgemma-27b-text-it - Instance: 1Γ NVIDIA A100 80 GB (AWS
us-east-1) - Container: Text Generation Inference (TGI) with
DTYPE=bfloat16 - Scale-to-zero: Enabled (15 min idle timeout)
- Cost: ~$2.50/hr when running
Key Configuration
After initial deployment, the default TGI token limits (MAX_INPUT_TOKENS=4096) caused 422 errors on longer synthesis prompts. Updated endpoint environment:
MAX_INPUT_TOKENS=12288MAX_TOTAL_TOKENS=16384
Also reduced per-step max_tokens to stay within limits:
patient_parser.py: 1500clinical_reasoning.py: 3072conflict_detection.py: 2000synthesis.py: 3000
Code Changes
medgemma.py: Updated to sendrole: "system"natively (TGI supports it), with automatic fallback to folding system prompt into user message for Google AI Studio compatibility..env: UpdatedMEDGEMMA_BASE_URLto HF endpoint URL,MEDGEMMA_API_KEYto HF token,MEDGEMMA_MODEL_ID=tgi..env.template: Updated with MedGemma model name and HF Endpoint instructions.
Verification
Single-case test: Chikungunya question β correct diagnosis appeared at rank 5 in differential. All 6 pipeline steps completed in 281s.
Deployment guide: docs/deploy_medgemma_hf.md
Phase 12: 50-Case MedQA Validation
Setup
Ran 50 MedQA (USMLE) cases through the full pipeline using the MedGemma HF Endpoint:
cd src/backend
python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
Results
| Metric | Value |
|---|---|
| Cases run | 50 |
| Pipeline success | 94% (47/50) |
| Top-1 diagnostic accuracy | 36% |
| Top-3 diagnostic accuracy | 38% |
| Differential accuracy | 10% |
| Mentioned in report | 38% |
| Avg pipeline time | 204 s/case |
| Total run time | ~60 min |
Question Type Breakdown
Used analyze_results.py to categorize the 50 cases:
| Type | Count | Mentioned | Differential |
|---|---|---|---|
| Diagnostic | 36 | 14 (39%) | 5 (14%) |
| Treatment | 6 | β | β |
| Pathophysiology | 6 | β | β |
| Statistics | 1 | β | β |
| Anatomy | 1 | β | β |
Key Observations
- MedQA includes many non-diagnostic questions (treatment, mechanism, stats) that the CDS pipeline is not designed to answer β it generates differential diagnoses, not multiple-choice answers.
- On diagnostic questions specifically, 39% mentioned accuracy is reasonable for a pipeline that wasn't optimized for exam-style questions.
- Pipeline failures (3/50) were caused by the HF endpoint scaling to zero mid-run. The
--resumeflag successfully continued from the checkpoint. - Improved clinical reasoning prompt to demand disease-level diagnoses rather than symptom categories (e.g., "Chikungunya" not "viral arthritis").
Infrastructure Improvements
- Incremental JSONL checkpoints: Each case result is appended to
medqa_checkpoint.jsonlas it completes. --resumeflag: Skips already-completed cases, enabling graceful recovery from endpoint failures.check_progress.py: Utility to monitor checkpoint progress during long runs.analyze_results.py: Categorizes MedQA results by question type for more meaningful accuracy analysis.- Unicode fixes: Replaced box-drawing characters (
ββββββ) and symbols (βββ) with ASCII equivalents for Windows console compatibility.
Files created: validation/analyze_results.py, validation/check_progress.py
Files modified: validation/base.py, validation/harness_medqa.py, validation/run_validation.py, app/tools/clinical_reasoning.py, app/tools/synthesis.py, app/tools/conflict_detection.py, app/tools/patient_parser.py
Phase 10: Final Documentation Audit & Cleanup
Performed a full accuracy audit of all 5 documentation files and test_e2e.py.
Issues found and fixed:
- README.md: step count said "5" in E2E table (fixed to 6), missing Conflict Detection row, missing
validation/in project structure, missing validation section and test commands - architecture.md: Design Decision #1 said "5-step" (fixed to 6), Decision #4 said "Gemma in two roles" (fixed to four), no validation framework section
- test_results.md: no external validation section, stale line count for test_e2e.py
- DEVELOPMENT_LOG.md: Phase 7 said "(Current)", missing Phase 9 for validation framework
- writeup_draft.md: referenced "confidence levels" (removed earlier), placeholder links, no validation methodology
- test_e2e.py: no assertions on step count or conflict_detection step
Created: TODO.md in project root with next-session action items for easy pickup by future contributors or AI instances.