Spaces:

bshepp
/

cds-agent

Running

App Files Files Community

bshepp commited on Feb 25

Commit

eaea340

1 Parent(s): fdd7dd5

Clean repo for portfolio: archive competition internals, update README and CLAUDE.md

Browse files

Files changed (16) hide show

.gitignore +3 -0
CLAUDE.md +34 -55
EXPERIMENT_PLAN.md +0 -806
README.md +38 -71
RULES_SUMMARY.md +0 -113
SUBMISSION_GUIDE.md +0 -152
TODO.md +0 -141
TRACKS.md +0 -194
VALIDATION_PIPELINE_PLAN.md +0 -1149
competition/download_data.txt +0 -1
competition/overview.txt +0 -167
competition/rules.txt +0 -163
docs/deploy_medgemma_hf.md +2 -2
docs/kaggle_writeup.md +0 -87
docs/video_script.md +0 -125
docs/writeup_draft.md +0 -169

.gitignore CHANGED Viewed

@@ -49,6 +49,9 @@ models/*.pt
 models/*.onnx
 models/*.safetensors
 # Notebooks checkpoints
 .ipynb_checkpoints/

 models/*.onnx
 models/*.safetensors
+# Archive (competition internals, planning docs)
+archive/
 # Notebooks checkpoints
 .ipynb_checkpoints/

CLAUDE.md CHANGED Viewed

@@ -6,46 +6,20 @@
 ## Project Overview
-**CDS Agent** is an agentic clinical decision support system built for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge) (Kaggle / Google Research). It orchestrates MedGemma across a multi-step pipeline to produce clinical decision support reports.
-**Deadline:** February 24, 2026, 11:59 PM UTC
 ---
-## Track System — READ THIS
-This project uses an **experimental track system** to evaluate multiple diagnostic accuracy strategies in strict isolation. Each track is an independent pipeline variant with its own files, configuration, and results.
-**The track registry is in [TRACKS.md](TRACKS.md).** That file is the single source of truth for:
-- Which tracks exist and what they do
-- Which files belong to which track
-- File tagging conventions
-- Isolation rules
-### Track Isolation Rules (Summary)
-1. **Every file owned by a track MUST have a track tag on line 1** — a comment identifying its track ID (e.g., `# [Track B: RAG Variants]`). The exact format depends on the file type.
-2. **Never modify a file owned by one track to benefit another.** Shared code lives in `src/backend/tracks/shared/`.
-3. **The baseline pipeline (`src/backend/app/`) is Track A.** Experimental tracks extend or wrap Track A code — they do NOT modify it.
-4. **Results from each track are stored separately** under `src/backend/tracks/<track_dir>/results/`.
-5. **Cross-track comparison** is performed only via shared utilities in `src/backend/tracks/shared/`.
-See **[TRACKS.md](TRACKS.md)** for the complete specification.
----
-## Critical Files
 | File | Purpose |
 |------|---------|
-| **[TRACKS.md](TRACKS.md)** | Track registry, file ownership, isolation rules — **start here for experimental work** |
-| **[EXPERIMENT_PLAN.md](EXPERIMENT_PLAN.md)** | 4-phase execution plan for accuracy optimization — **the step-by-step playbook** |
-| [TODO.md](TODO.md) | Session-level action items and project status |
 | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history and decisions |
-| [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition rules, timeline, and submission checklist |
-| [docs/kaggle_writeup.md](docs/kaggle_writeup.md) | Final writeup content for Kaggle submission |
-| [docs/video_script.md](docs/video_script.md) | 3-minute demo video narration script |
 | [docs/architecture.md](docs/architecture.md) | System architecture and design decisions |
 ---
@@ -53,28 +27,35 @@ See **[TRACKS.md](TRACKS.md)** for the complete specification.
 ```
 medgemma_impact_challenge/
-├── CLAUDE.md                  ← You are here
-├── TRACKS.md                  ← Track registry and isolation rules
-├── TODO.md                    ← Next-session action items
-├── DEVELOPMENT_LOG.md         ← Build history
 ├── src/backend/
-│   ├── app/                   ← Track A (Baseline) — production pipeline
-│   │   ├── agent/orchestrator.py
-│   │   ├── services/medgemma.py
-│   │   ├── tools/             ← 6 pipeline tools
-│   │   ├── models/schemas.py
-│   │   └── data/clinical_guidelines.json
-│   ├── tracks/                ← Experimental tracks
-│   │   ├── shared/            ← Cross-track utilities (cost tracking, comparison)
-│   │   ├── rag_variants/      ← Track B: Chunking & embedding experiments
-│   │   ├── iterative/         ← Track C: Serial iterative refinement
-│   │   ├── arbitrated/        ← Track D: Parallel specialists + arbiter
-│   │   ├── combined/          ← Track E: Composition of per-axis winners (Phase 3)
-│   │   ├── prompt_arch/       ← Track F: Prompt architecture variants (Phase 2)
-│   │   ├── voting/            ← Track G: Multi-sample voting (Phase 2)
-│   │   └── verification/      ← Track H: Evidence verification (Phase 2)
-│   └── validation/            ← Validation framework (shared across all tracks)
-└── src/frontend/              ← Next.js frontend (not track-specific)
 ```
 ---
@@ -84,8 +65,6 @@ medgemma_impact_challenge/
 - **Python style:** Pydantic v2 for all data models, async throughout, type hints everywhere
 - **LLM calls:** Always go through `app/services/medgemma.py` — never instantiate the OpenAI SDK directly
 - **Structured output:** Use `medgemma.generate_structured(prompt, response_model)` with Pydantic models
-- **Temperature conventions:** 0.1 for safety-critical/extraction, 0.2–0.3 for reasoning/synthesis
 - **Error handling:** Graceful degradation — return partial results rather than crashing
 - **No framework dependencies:** Custom orchestrator, no LangChain/LlamaIndex
-- **Windows compatibility:** ASCII characters only in console output (no box-drawing or Unicode symbols)
-- **Track tagging:** Line 1 of every track-owned file must carry the track tag comment

 ## Project Overview
+**CDS Agent** is an agentic clinical decision support system that orchestrates MedGemma 27B across a 6-step pipeline to produce clinical decision support reports from free-text patient cases. Originally built for the MedGemma Impact Challenge (Kaggle / Google Research).
+**Live demo:** [demo.briansheppard.com](https://demo.briansheppard.com)
 ---
+## Key Files
 | File | Purpose |
 |------|---------|
 | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history and decisions |
 | [docs/architecture.md](docs/architecture.md) | System architecture and design decisions |
+| [docs/test_results.md](docs/test_results.md) | Detailed test results and benchmarks |
+| [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HF Endpoint deployment guide |
 ---
 ```
 medgemma_impact_challenge/
+├── CLAUDE.md                  <- You are here
+├── DEVELOPMENT_LOG.md         <- Build history
 ├── src/backend/
+│   ├── app/                   <- Production pipeline
+│   │   ├── agent/orchestrator.py   <- 6-step pipeline orchestrator
+│   │   ├── services/medgemma.py    <- LLM service (OpenAI-compatible)
+│   │   ├── tools/                  <- 6 pipeline tools
+│   │   │   ├── patient_parser.py        Step 1: Free-text -> structured data
+│   │   │   ├── clinical_reasoning.py    Step 2: Differential diagnosis
+│   │   │   ├── drug_interactions.py     Step 3: OpenFDA + RxNorm APIs
+│   │   │   ├── guideline_retrieval.py   Step 4: RAG over ChromaDB
+│   │   │   ├── conflict_detection.py    Step 5: Guideline vs patient gaps
+│   │   │   └── synthesis.py             Step 6: CDS report generation
+│   │   ├── models/schemas.py       <- Pydantic data models
+│   │   ├── data/clinical_guidelines.json  <- 62 guidelines, 14 specialties
+│   │   └── api/                    <- REST + WebSocket endpoints
+│   ├── tracks/                <- Experimental pipeline variants
+│   │   ├── shared/            <- Cross-track utilities
+│   │   ├── rag_variants/      <- Chunking & embedding experiments
+│   │   ├── iterative/         <- Serial iterative refinement
+│   │   └── arbitrated/        <- Parallel specialists + arbiter
+│   └── validation/            <- External dataset validation framework
+│       ├── harness_medqa.py   <- MedQA (USMLE) diagnostic accuracy
+│       ├── harness_mtsamples.py <- MTSamples parse quality
+│       └── harness_pmc.py     <- PMC Case Reports diagnostic accuracy
+└── src/frontend/              <- Next.js 14 + React 18 + TypeScript
+    └── src/
+        ├── components/        <- PatientInput, AgentPipeline, CDSReport
+        └── hooks/             <- WebSocket state management
 ```
 ---
 - **Python style:** Pydantic v2 for all data models, async throughout, type hints everywhere
 - **LLM calls:** Always go through `app/services/medgemma.py` — never instantiate the OpenAI SDK directly
 - **Structured output:** Use `medgemma.generate_structured(prompt, response_model)` with Pydantic models
+- **Temperature conventions:** 0.1 for safety-critical/extraction, 0.2-0.3 for reasoning/synthesis
 - **Error handling:** Graceful degradation — return partial results rather than crashing
 - **No framework dependencies:** Custom orchestrator, no LangChain/LlamaIndex

EXPERIMENT_PLAN.md DELETED Viewed

@@ -1,806 +0,0 @@
-# EXPERIMENT_PLAN.md — 4-Phase Accuracy Optimization Plan
-> **Purpose:** Step-by-step execution plan for an AI agent or human to follow.
-> Each step is atomic, has clear inputs/outputs, and explicit success criteria.
->
-> **Context:** Baseline accuracy is 36% top-1 on 50-case MedQA (seed=42). Our
-> goal is to find the best composite strategy before the Feb 24, 2026 deadline.
->
-> **Prerequisite reading:** `CLAUDE.md` → `TRACKS.md` → this file.
----
-## Infrastructure Prerequisites
-Before ANY phase, ensure:
-1. **HF Endpoint is running.**
-   - Go to https://ui.endpoints.huggingface.co → `medgemma-27b-cds` → Resume
-   - Wait until status shows "Running" (5–15 min cold start)
-   - Cost: ~$2.50/hr — **pause when done**
-2. **Virtual environment is active.**
-   ```powershell
-   cd f:\kaggle\medgemma_impact_challenge\src\backend
-   .\venv\Scripts\Activate.ps1
-   ```
-3. **Dependencies installed.**
-   ```powershell
-   pip install -r requirements.txt
-   pip install sentence-transformers   # Needed for Track B embedding variants
-   ```
-4. **Environment variables set.**
-   - `.env` file in `src/backend/` must have `HF_TOKEN`, `MEDGEMMA_API_KEY`, `MEDGEMMA_BASE_URL`
-   - Verify: `python -c "from app.config import Settings; s = Settings(); print(s.medgemma_base_url)"`
-5. **Quick health check.** Run 1 case through baseline to confirm the endpoint responds:
-   ```powershell
-   python -m validation.run_validation --medqa --max-cases 1
-   ```
-   **Success:** Pipeline returns a `CDSReport` without timeout errors.
----
-## Phase 1 — Independent Axis Sweeps
-**Goal:** Find the best single-axis configuration for B, C, and D independently.
-**Estimated cost:** ~$15–25 of endpoint time (6–10 hours)
-**Estimated cases:** 50 per config × (10 + 4 + 4) = 900 total pipeline runs
-### Phase 1A — Track B: RAG Variants
-**What we're testing:** Which retrieval configuration gets the best documents in front of the model?
-#### Step 1A.1: Smoke Test (3 cases × 10 variants = 30 runs)
-```powershell
-cd f:\kaggle\medgemma_impact_challenge\src\backend
-python -m tracks.rag_variants.run_variants --max-cases 3
-```
-**Check for:**
-- [ ] All 10 variants complete without errors
-- [ ] Each variant produces a result JSON in `tracks/rag_variants/results/`
-- [ ] MedCPT and MPNet embedding models download successfully
-- [ ] Reranking variant (B9) loads the cross-encoder model
-- [ ] Output shows a comparison table with per-variant scores
-**If any variant fails:** Fix the error, then re-run with `--variant <id>` to test just that one:
-```powershell
-python -m tracks.rag_variants.run_variants --variant B6_medcpt --max-cases 3
-```
-**Common failure modes:**
-- `sentence-transformers` not installed → `pip install sentence-transformers`
-- MedCPT download fails → check `HF_TOKEN` is set
-- ChromaDB lock → delete `tracks/rag_variants/data/chroma/` and retry
-#### Step 1A.2: Full Sweep (50 cases × 10 variants = 500 runs)
-```powershell
-python -m tracks.rag_variants.run_variants
-```
-**Expected runtime:** 3–5 hours (50 cases × 10 variants, ~2 min/case with API latency)
-**Output:** Results in `tracks/rag_variants/results/` — one JSON per variant.
-#### Step 1A.3: Identify B*
-Read the comparison table printed at the end, or run:
-```powershell
-python -m tracks.shared.compare --tracks B --dataset medqa
-```
-**Record the winner:**
-```
-B* = ____________ (variant_id)
-B* top-1 accuracy = _____%
-B* improvement over B0_baseline = +_____%
-```
-**Decision rules:**
-- If the best variant beats B0 by <2%, retrieval isn't the bottleneck. Note this, but still carry B* forward.
-- If multiple variants tie within 1%, prefer the one with lower latency/complexity.
-- If reranking (B9) wins, note the added latency cost.
----
-### Phase 1B — Track C: Iterative Refinement
-**What we're testing:** Does repeated self-critique improve diagnostic accuracy? At what point do returns diminish?
-#### Step 1B.1: Smoke Test (3 cases × 4 configs = 12 runs)
-```powershell
-python -m tracks.iterative.run_iterative --max-cases 3
-```
-**Check for:**
-- [ ] All 4 configs complete without errors
-- [ ] Per-iteration accuracy and cost data is printed
-- [ ] Convergence detection works (C0_2rounds should always run all 2 iterations; C2_5rounds might converge early)
-- [ ] Cost ledger populates correctly
-**If a config hangs:** Likely an LLM timeout. Check that the endpoint is warm. The iterative track makes 2-10× more LLM calls per case than baseline.
-#### Step 1B.2: Full Sweep (50 cases × 4 configs)
-```powershell
-python -m tracks.iterative.run_iterative
-```
-**Expected runtime:** 2–4 hours (C0 fastest, C3 slowest)
-**Output:** Results in `tracks/iterative/results/`
-#### Step 1B.3: Identify C*
-```powershell
-python -m tracks.shared.compare --tracks C --dataset medqa
-```
-**Record the winner:**
-```
-C* = ____________ (config_id)
-C* top-1 accuracy = _____%
-C* avg iterations used = _____
-C* cost per case = $_____
-C* improvement over baseline = +_____%
-```
-**Key data to extract:** The per-iteration accuracy curve. Plot or record:
-```
-Iteration 0 (baseline): ___% top-1
-Iteration 1 (first critique): ___% top-1
-Iteration 2: ___% top-1
-Iteration 3: ___% top-1 (if applicable)
-...
-```
-**Decision rules:**
-- The winning config is the one with the best accuracy/cost ratio, not necessarily the one with the highest absolute accuracy.
-- If C2_5rounds converges at iteration 2 in most cases, the extra rounds aren't helping — C1_3rounds is probably enough.
-- If C3_aggressive loses accuracy (the critic is too harsh), note this as a failure mode.
----
-### Phase 1C — Track D: Arbitrated Parallel
-**What we're testing:** Do multiple specialist perspectives, coordinated by an arbiter, find diagnoses a generalist misses?
-#### Step 1C.1: Smoke Test (3 cases × 4 configs = 12 runs)
-```powershell
-python -m tracks.arbitrated.run_arbitrated --max-cases 3
-```
-**Check for:**
-- [ ] All 4 configs complete without errors
-- [ ] Specialist outputs show domain-specific reasoning (cardiologist emphasizes cardiac, etc.)
-- [ ] Arbiter merge output is a coherent consensus differential, not just concatenation
-- [ ] For multi-round configs (D2, D3): tailored resubmission prompts are generated
-- [ ] For multi-round configs: second-round specialist outputs differ from first round
-- [ ] Cost tracking shows escalating cost with more specialists/rounds
-**If the arbiter produces garbage:** The merge prompt may need tuning. Check `tracks/arbitrated/arbiter.py` ARBITER_MERGE_PROMPT.
-#### Step 1C.2: Full Sweep (50 cases × 4 configs)
-```powershell
-python -m tracks.arbitrated.run_arbitrated
-```
-**Expected runtime:** 3–6 hours (D0 fastest, D3 slowest — D3 runs 5 specialists × 2 rounds = 12 LLM calls/case)
-**Output:** Results in `tracks/arbitrated/results/`
-#### Step 1C.3: Identify D*
-```powershell
-python -m tracks.shared.compare --tracks D --dataset medqa
-```
-**Record the winner:**
-```
-D* = ____________ (config_id)
-D* top-1 accuracy = _____%
-D* cost per case = $_____
-D* improvement over baseline = +_____%
-```
-**Additional data to record:**
-```
-Per-specialist contribution analysis:
-  Cardiologist: Contributed unique correct dx in ___% of cases
-  Neurologist: ____%
-  ID Specialist: ____%
-  General Internist: ____%
-  Emergency Med: ____%
-Arbitration consensus rate: ____% of cases where >3 specialists agreed on top-1
-Round 2 lift (if applicable): +____% over round 1
-```
-**Decision rules:**
-- If D0 (3-spec, 1-round) matches D3 (5-spec, 2-rounds), the extra cost isn't justified.
-- If specialists all agree in round 1, round 2 is wasted computation — future configs can drop it.
-- If one specialist consistently disagrees with the correct answer, consider removing it from the ensemble.
----
-### Phase 1D — Cross-Track Comparison
-After all three tracks complete, run the unified comparison:
-```powershell
-python -m tracks.shared.compare --dataset medqa
-```
-**Expected output:**
-```
-Cross-Track Comparison: MEDQA
--------------------------------------------------------------
-Track                   Top-1   Top-3  Mentioned  Pipeline       Cost
--------------------------------------------------------------
-A: Baseline              36.0%  --       38.0%     94.0%        $X.XX
-B: RAG Variants          ___%   --       ___%      ___%         $X.XX
-C: Iterative             ___%   --       ___%      ___%         $X.XX
-D: Arbitrated            ___%   --       ___%      ___%         $X.XX
--------------------------------------------------------------
-```
-**Record Phase 1 summary:**
-```
-B* = __________, accuracy = ____%, delta = +____%
-C* = __________, accuracy = ____%, delta = +____%
-D* = __________, accuracy = ____%, delta = +____%
-Best single axis: Track ___
-```
-**Go/No-Go for Phase 2:**
-- If ALL tracks are within 2% of baseline → the model itself may be the bottleneck,
-  not the pipeline. Consider investigating prompt architecture (Phase 2) more aggressively.
-- If ANY single track shows ≥5% lift → strong signal, proceed to Phase 2 and Phase 3.
-- If results are noisy (high variance) → increase to 100 cases or use a different seed
-  to get more statistical power.
----
-## Phase 2 — New Axes (F, G, H)
-**Goal:** Test 3 lightweight axes that are cheap to implement and orthogonal to B/C/D.
-**Build these ONLY after Phase 1 data is in.** Phase 1 results inform which axes matter most.
-### Phase 2A — Track F: Prompt Architecture
-**Axis:** *How* the model is asked to reason, independent of depth (C) or breadth (D).
-**Why:** This is the cheapest axis to test — same token count, different structure. If prompt architecture matters more than retrieval or iteration, we want to know early.
-#### Step 2A.1: Build Track F
-Create `src/backend/tracks/prompt_arch/` with the track system conventions (see TRACKS.md "Adding a New Track").
-**Files to create:**
-```
-tracks/prompt_arch/
-  __init__.py           # Track tag, package init
-  config.py             # PromptVariant dataclass + 5 variants
-  reasoner.py           # Modified clinical_reasoning that accepts prompt templates
-  run_prompt_arch.py    # Runner following same pattern as other tracks
-  results/              # Output directory
-```
-**Variant definitions:**
-| ID | Name | Strategy | Prompt Change |
-|----|------|----------|---------------|
-| F0 | Baseline | Current free-form | No change (control) |
-| F1 | Structured Template | Force structured output | System prompt: "For each symptom, list 3 possible causes. Identify diagnoses appearing in ≥2 symptom lists. Rank by frequency of appearance." |
-| F2 | Few-Shot | 2 worked examples | Add 2 solved MedQA cases (NOT from test set) to the system prompt as worked examples with reasoning chains |
-| F3 | Reverse Reasoning | Falsification | After initial differential: "For each of your top 5 diagnoses, list the findings you would EXPECT. Mark which are present, absent, or unknown in this patient. Re-rank based on match percentage." |
-| F4 | Bayesian | Prior updating | "Assign a prior probability to each diagnosis based on prevalence. For each finding, update posterior probability. Show the Bayesian reasoning chain. Final differential ordered by posterior." |
-**Implementation notes:**
-- `reasoner.py` should accept a `prompt_template: str` parameter and inject it into the system prompt or user prompt of the clinical reasoning call.
-- F0 uses the exact same system prompt as `app/tools/clinical_reasoning.py` — this is the control.
-- Few-shot examples (F2) need to come from MedQA TRAIN set, not the 50-case test set. Pick 2 from `validation/data/medqa_test.jsonl` that are NOT in the seed=42 sample, or create synthetic examples from textbook cases.
-- F3 and F4 require TWO LLM calls: first the initial differential, then the structured verification/update. This makes them comparable to C in cost but different in mechanism (structured verification vs. open-ended critique).
-#### Step 2A.2: Run Track F
-```powershell
-# Smoke test
-python -m tracks.prompt_arch.run_prompt_arch --max-cases 3
-# Full sweep
-python -m tracks.prompt_arch.run_prompt_arch
-```
-#### Step 2A.3: Identify F*
-```
-F* = ____________
-F* top-1 accuracy = _____%
-F* improvement over F0 = +_____%
-```
----
-### Phase 2B — Track G: Multi-Sample Voting (Self-Consistency)
-**Axis:** Statistical diversity via repeated sampling at higher temperature.
-**Why:** Self-consistency is one of the most reliable accuracy boosters in the CoT literature. It's embarrassingly parallel and requires no new prompts — just `asyncio.gather()` over N samples.
-#### Step 2B.1: Build Track G
-Create `src/backend/tracks/voting/`.
-**Files:**
-```
-tracks/voting/
-  __init__.py
-  config.py             # VotingConfig: n_samples, temperature, aggregation_method
-  voter.py              # Generate N reasoning outputs, extract top-k diagnoses, vote
-  run_voting.py
-  results/
-```
-**Variant definitions:**
-| ID | Samples | Temp | Aggregation | Description |
-|----|---------|------|-------------|-------------|
-| G0 | 1 | 0.3 | N/A | Control (identical to baseline) |
-| G1 | 3 | 0.5 | Majority vote | 3 samples, majority wins |
-| G2 | 5 | 0.5 | Majority vote | 5 samples, majority wins |
-| G3 | 5 | 0.7 | Weighted vote | 5 samples at higher diversity, weighted by internal consistency |
-| G4 | 3 | 0.5 | Best-of-N | 3 samples, pick the one whose differential best matches retrieved guidelines |
-**Implementation notes:**
-- `voter.py` calls `medgemma.generate()` N times in parallel with `asyncio.gather()`.
-- Temperature must be high enough to get diversity (≥0.5), otherwise all N samples will be nearly identical.
-- **Majority vote aggregation:** Extract top-1 diagnosis from each sample. The diagnosis appearing most frequently wins. If tied, use the one from the sample with the longest reasoning (proxy for confidence).
-- **Weighted vote (G3):** For each sample, check how many of its diagnoses are mentioned in the retrieved guidelines. Weight = number of guideline-grounded diagnoses. This penalizes hallucinated differentials.
-- **Best-of-N (G4):** Score each sample's differential against the retrieved guidelines using fuzzy_match overlap. Pick the highest-scoring sample wholesale.
-- Cost scales linearly: G2 costs 5× baseline reasoning per case.
-#### Step 2B.2: Run Track G
-```powershell
-python -m tracks.voting.run_voting --max-cases 3   # smoke
-python -m tracks.voting.run_voting                   # full
-```
-#### Step 2B.3: Identify G*
-```
-G* = ____________
-G* top-1 accuracy = _____%
-G* cost multiplier vs baseline = _____×
-```
----
-### Phase 2C — Track H: Evidence Verification (Post-Hoc Grounding)
-**Axis:** A structured fact-check pass that re-ranks the differential based on evidence alignment.
-**Why:** The model might rank a diagnosis #1 that isn't actually supported by the evidence. H catches this. It's different from C (which is open-ended self-critique) — H is specifically checking "does the evidence support this ranking?"
-#### Step 2C.1: Build Track H
-Create `src/backend/tracks/verification/`.
-**Files:**
-```
-tracks/verification/
-  __init__.py
-  config.py             # VerificationConfig
-  verifier.py           # Post-hoc evidence grounding check
-  run_verification.py
-  results/
-```
-**Method for each case:**
-1. Run baseline pipeline → get differential with top-5 diagnoses
-2. For EACH diagnosis in the differential, make ONE LLM call:
-   ```
-   Patient findings: {summary}
-   Retrieved guidelines: {relevant_guidelines}
-   Diagnosis under review: {diagnosis_name}
-   Task: List the specific findings from this patient that SUPPORT this diagnosis,
-   the findings that ARGUE AGAINST it, and the findings that are NEUTRAL.
-   Give a grounding score from 0-10 based on evidence alignment.
-   ```
-3. Re-rank the differential by grounding score (descending)
-4. Use the re-ranked differential for scoring
-**Variant definitions:**
-| ID | Method | LLM Calls | Description |
-|----|--------|-----------|-------------|
-| H0 | None | 0 extra | Control |
-| H1 | Top-5 re-rank | 5 extra | Verify and re-rank all 5 diagnoses |
-| H2 | Top-3 re-rank | 3 extra | Verify only top 3 (cheaper) |
-| H3 | Eliminate-only | 5 extra | Don't re-rank — just DROP any diagnosis with score ≤3 and promote the rest |
-**Implementation notes:**
-- Use `medgemma.generate_structured()` with a Pydantic model for the grounding output:
-  ```python
-  class GroundingResult(BaseModel):
-      diagnosis: str
-      supporting_findings: List[str]
-      opposing_findings: List[str]
-      neutral_findings: List[str]
-      grounding_score: int  # 0-10
-  ```
-- Temperature: 0.1 (this is extraction/evaluation, not generation)
-- Each verification call is independent → run all 5 in parallel with `asyncio.gather()`
-#### Step 2C.2: Run Track H
-```powershell
-python -m tracks.verification.run_verification --max-cases 3
-python -m tracks.verification.run_verification
-```
-#### Step 2C.3: Identify H*
-```
-H* = ____________
-H* top-1 accuracy = _____%
-H* improvement over baseline = +_____%
-```
----
-### Phase 2D — Phase 2 Cross-Comparison
-After F, G, H are done:
-```powershell
-python -m tracks.shared.compare --dataset medqa
-```
-Update the shared compare.py to include tracks E/F/G/H before running (add entries to `TRACK_DIRS`).
-**Record Phase 2 summary:**
-```
-F* = __________, accuracy = ____%, delta = +____%
-G* = __________, accuracy = ____%, delta = +____%, cost = _____×
-H* = __________, accuracy = ____%, delta = +____%
-```
-**Rank all 6 axes by accuracy lift:**
-```
-1. Track ___ : +____% (cost: ___×)
-2. Track ___ : +____% (cost: ___×)
-3. Track ___ : +____% (cost: ___×)
-4. Track ___ : +____% (cost: ___×)
-5. Track ___ : +____% (cost: ___×)
-6. Track ___ : +____% (cost: ___×)
-```
----
-## Phase 3 — Composition (Track E: Combined)
-**Goal:** Wire the per-axis winners together and test whether gains are additive.
-**Only start this after Phase 1 and Phase 2 data is in hand.**
-### Step 3.1: Build Track E
-Create `src/backend/tracks/combined/`.
-**Files:**
-```
-tracks/combined/
-  __init__.py
-  config.py           # CombinedConfig: which B*/C*/D*/F*/G*/H* to compose
-  pipeline.py         # The composite pipeline that wires winners together
-  run_combined.py
-  results/
-```
-**CombinedConfig should reference winner IDs from Phase 1 and 2:**
-```python
-@dataclass
-class CombinedConfig:
-    config_id: str
-    rag_variant_id: Optional[str]         # B* winner (or None = baseline retrieval)
-    iterative_config_id: Optional[str]    # C* winner (or None = no iteration)
-    arbitrated_config_id: Optional[str]   # D* winner (or None = single generalist)
-    prompt_variant_id: Optional[str]      # F* winner (or None = default prompt)
-    voting_config_id: Optional[str]       # G* winner (or None = single sample)
-    verification_config_id: Optional[str] # H* winner (or None = no verification)
-    composition_pattern: str              # "E1", "E2", or "E3"
-    description: str = ""
-```
-### Step 3.2: Implement 3 Composition Patterns
-**Pattern E1: Breadth-then-Depth** (recommended starting point)
-```
-Parse
-  → B* retriever (swap guideline retrieval)
-  → F* prompt template (swap reasoning prompt)
-  → D* specialists in parallel (each uses F* prompt)
-  → D* arbiter merge → consensus differential
-  → C* iterative refinement on consensus
-  → H* evidence verification on refined output
-  → G* voting: run the above N times and vote (if G* ≠ G0)
-  → Drug Check + Conflict Detection
-  → Synthesis
-```
-**Pattern E2: Depth-within-Breadth**
-```
-Parse
-  → B* retriever
-  → D* specialists, each with F* prompt, each running C* internal iteration
-  → D* arbiter merge over refined specialist outputs
-  → H* evidence verification
-  → G* voting over the above
-  → Drug Check + Conflict Detection
-  → Synthesis
-```
-**Pattern E3: Bookend (full loop)**
-```
-Parse
-  → B* retriever
-  → D* specialists (round 1, F* prompt)
-  → D* arbiter merge → rough consensus
-  → C* iterative refinement on consensus
-  → D* specialists again (round 2, with refined consensus as additional context)
-  → D* arbiter re-merge → final differential
-  → H* evidence verification
-  → G* voting
-  → Drug Check + Conflict Detection
-  → Synthesis
-```
-**Implementation guidance:**
-- Import existing track modules — do NOT duplicate code
-  ```python
-  from tracks.rag_variants.retriever import VariantRetriever
-  from tracks.rag_variants.config import VARIANTS
-  from tracks.iterative.refiner import IterativeRefiner
-  from tracks.iterative.config import CONFIGS as ITERATIVE_CONFIGS
-  from tracks.arbitrated.specialists import run_specialists_parallel
-  from tracks.arbitrated.arbiter import Arbiter
-  from tracks.arbitrated.config import CONFIGS as ARBITRATED_CONFIGS
-  ```
-- The orchestrator's tools are swappable: `orchestrator.guideline_retrieval = variant_retriever`
-- Use a single `CostLedger` that spans ALL stages so the total cost is tracked
-### Step 3.3: Run Compositions
-```powershell
-# Start with E1 (simplest)
-python -m tracks.combined.run_combined --pattern E1 --max-cases 3   # smoke
-python -m tracks.combined.run_combined --pattern E1                  # full 50 cases
-# Then E2 and E3 if E1 shows promise
-python -m tracks.combined.run_combined --pattern E2 --max-cases 10
-python -m tracks.combined.run_combined --pattern E3 --max-cases 10
-```
-### Step 3.4: Evaluate Composition
-**Record:**
-```
-E1 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s
-E2 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s
-E3 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s
-Best single track: Track ___ at ____%
-Best composition: Pattern ___ at ____%
-Composition lift vs best single track: +____%
-```
-**Key questions to answer:**
-1. Are the gains from B/C/D/F/G/H additive when composed? (If E1 ≈ best single track, they're not.)
-2. Which pattern gives the best accuracy/cost ratio?
-3. Is there a simpler 2-axis composition (e.g., B+C only) that gets 80% of the E1 benefit at 30% of the cost?
-### Step 3.5: Test Partial Compositions
-Based on the Phase 1+2 ranking, test 2-axis combos of the top 3 axes:
-```
-E_BC:  B* + C* only (better retrieval + iteration)
-E_BD:  B* + D* only (better retrieval + specialists)
-E_BF:  B* + F* only (better retrieval + prompt architecture)
-E_CD:  C* + D* only (iteration + specialists)
-E_BH:  B* + H* only (better retrieval + verification)
-```
-This tells us which pairs compose well and which interfere. Run each at 50 cases.
-**Record pair interaction matrix:**
-```
-         B*      C*      D*      F*      G*      H*
-B*        -     ____%   ____%   ____%   ____%   ____%
-C*              -       ____%   ____%   ____%   ____%
-D*                      -       ____%   ____%   ____%
-F*                              -       ____%   ____%
-G*                                      -       ____%
-H*                                              -
-```
-(Each cell = top-1 accuracy of that 2-axis composition)
----
-## Phase 4 — Cherry-Pick and Finalize
-**Goal:** Take the best composition from Phase 3 and apply any remaining optimizations.
-### Step 4.1: Lock the Winner
-Based on Phase 3 data, select the final pipeline configuration:
-```
-FINAL CONFIG:
-  Retrieval: ____________ (B variant or baseline)
-  Prompt: ____________ (F variant or baseline)
-  Reasoning: ____________ (D config, or single generalist)
-  Iteration: ____________ (C config, or none)
-  Verification: ____________ (H config, or none)
-  Voting: ____________ (G config, or single sample)
-  Composition: ____________ (E pattern)
-  Top-1 accuracy: ____%
-  Cost per case: $____
-  Runtime per case: ____s
-```
-### Step 4.2: 100-Case Validation
-Run the final config against an expanded dataset to confirm the result isn't a fluke:
-```powershell
-# If possible, run 100 MedQA cases (load more from the JSONL)
-python -m tracks.combined.run_combined --pattern <winner> --max-cases 100
-```
-**If 100-case accuracy is within ±3% of 50-case accuracy:** The result is stable.
-**If it drops by >5%:** We overfit to the 50-case sample. Re-evaluate.
-### Step 4.3: Run Complementary Benchmarks
-Run the winner through MTSamples and PMC harnesses (if available) to show generalization:
-```powershell
-# These may need adaptation to work with the combined pipeline
-python -m validation.run_validation --mtsamples --max-cases 20
-python -m validation.run_validation --pmc --max-cases 10
-```
-### Step 4.4: Update Submission Materials
-1. **Update `docs/kaggle_writeup.md`** with final accuracy numbers, the winning configuration,
-   and the experimental journey (which axes mattered, which didn't, composition effects).
-2. **Update `docs/video_script.md`** if the demo pipeline changed significantly (e.g., if the
-   best config uses specialists, the video should show the specialist pipeline).
-3. **Update `docs/architecture.md`** with the final pipeline diagram.
-4. **Push to GitHub:**
-   ```powershell
-   git add -A
-   git commit -m "Phase 4: Final pipeline configuration - XX% top-1 accuracy"
-   git push
-   ```
-### Step 4.5: Record Demo Video
-Follow `docs/video_script.md` with the FINAL pipeline configuration running live.
-### Step 4.6: Submit on Kaggle
-Follow `docs/kaggle_writeup.md` submission steps. Include:
-- Final writeup with experimental results
-- Video link
-- GitHub repo link
-- (Optional) Live demo URL if deployed
----
-## Decision Log
-Use this section to record key decisions as you execute the plan.
-### Phase 1 Results
-```
-Date: ___________
-B* = ___________    accuracy: ____%    delta: +____%    latency: ____ms
-C* = ___________    accuracy: ____%    delta: +____%    avg_iters: ____
-D* = ___________    accuracy: ____%    delta: +____%    cost/case: $____
-Best single axis: Track ___
-Notes:
-```
-### Phase 2 Results
-```
-Date: ___________
-F* = ___________    accuracy: ____%    delta: +____%
-G* = ___________    accuracy: ____%    delta: +____%    cost: ____×
-H* = ___________    accuracy: ____%    delta: +____%
-Ranked axes (by lift):
-1. ___  2. ___  3. ___  4. ___  5. ___  6. ___
-Notes:
-```
-### Phase 3 Results
-```
-Date: ___________
-E1 accuracy: ____%    cost/case: $____
-E2 accuracy: ____%    cost/case: $____
-E3 accuracy: ____%    cost/case: $____
-Best pair: ___ + ___  accuracy: ____%
-Best triple: ___ + ___ + ___  accuracy: ____%
-Notes:
-```
-### Phase 4 Final
-```
-Date: ___________
-Final config: ___________________________
-Final accuracy (50-case): ____%
-Final accuracy (100-case): ____%
-Cost per case: $____
-Runtime per case: ____s
-Submitted: [ ] Yes  [ ] No
-Video recorded: [ ] Yes  [ ] No
-```
----
-## Time Budget
-| Phase | Estimated Endpoint Hours | Estimated Wall Clock | Estimated Cost |
-|-------|-------------------------|---------------------|---------------|
-| Phase 1 (B+C+D) | 8–12 hrs | 1–2 days | $20–30 |
-| Phase 2 (F+G+H) | 6–10 hrs | 1–2 days | $15–25 |
-| Phase 3 (Compositions) | 4–8 hrs | 1 day | $10–20 |
-| Phase 4 (Finalize) | 2–3 hrs | 1 day | $5–8 |
-| **Total** | **20–33 hrs** | **4–7 days** | **$50–83** |
-**Deadline:** February 24, 2026, 11:59 PM UTC
-**Today:** February 15, 2026
-**Available:** ~9 days
-**Suggested schedule:**
-- Feb 15–16: Phase 1 (run overnight, collect in morning)
-- Feb 17–18: Phase 2 (build F/G/H, run overnight)
-- Feb 19–20: Phase 3 (compositions)
-- Feb 21–22: Phase 4 (finalize, video, writeup update)
-- Feb 23: Buffer day + final submission
-- Feb 24: Deadline
----
-## Abort Conditions
-Stop and re-evaluate the strategy if:
-1. **Endpoint costs exceed $100 total** — we're overspending for marginal gains
-2. **All Phase 1 tracks show <2% lift** — the model, not the pipeline, is the bottleneck. Consider:
-   - Switching to `medgemma-4b-it` for faster iteration on prompts
-   - Focusing entirely on prompt architecture (Track F)
-   - Reducing scope to best-effort with current accuracy + strong writeup
-3. **Phase 3 compositions LOSE accuracy vs single tracks** — negative interaction effects. Simplify back to best single track.
-4. **Consistent pipeline failures (>10% error rate)** — endpoint stability issue. Fix infrastructure before continuing experiments.
-5. **February 22 reached without Phase 3 complete** — lock whatever is best so far and move directly to Phase 4 (finalize + submit). Do not risk missing the deadline for marginal gains.

README.md CHANGED Viewed

@@ -14,8 +14,8 @@ custom_domains:
 > An agentic clinical decision support application that orchestrates medical AI with specialized tools to assist clinicians in real time.
-**Origin:** [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge) (Kaggle / Google Research)
-**Focus:** Building a genuinely impactful medical application — not just a competition entry.
 ---
@@ -156,73 +156,44 @@ Sources include ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, APA, AAP, ACR, ASH, K
 ```
 medgemma_impact_challenge/
-├── README.md                           # This file
-├── DEVELOPMENT_LOG.md                  # Chronological build history & decisions
-├── SUBMISSION_GUIDE.md                 # Competition submission strategy
-├── RULES_SUMMARY.md                    # Competition rules checklist
 ├── docs/
-│   ├── architecture.md                 # System architecture & design decisions
-│   ├── test_results.md                 # Detailed test results & benchmarks
-│   ├── writeup_draft.md               # Project writeup / summary
-│   └── deploy_medgemma_hf.md          # MedGemma HF Endpoint deployment guide
 ├── src/
-│   ├── backend/                        # Python FastAPI backend
-│   │   ├── .env.template              # Environment config template
-│   │   ├── .env                       # Local config (not committed)
-│   │   ├── requirements.txt           # Python dependencies (28 packages)
-│   │   ├── test_e2e.py               # End-to-end pipeline test
-│   │   ├── test_clinical_cases.py    # 22 clinical scenario test suite
-│   │   ├── test_rag_quality.py       # RAG retrieval quality tests (30 queries)
-│   │   ├── test_poll.py              # Simple case poller utility
-│   │   ├── validation/               # External dataset validation framework
-│   │   │   ├── base.py               # Core framework (runners, scorers, utilities)
-│   │   │   ├── harness_medqa.py      # MedQA (USMLE) diagnostic accuracy harness
-│   │   │   ├── harness_mtsamples.py  # MTSamples parse quality harness
-│   │   │   ├── harness_pmc.py        # PMC Case Reports diagnostic harness
-│   │   │   ├── run_validation.py     # Unified CLI runner
-│   │   │   ├── analyze_results.py    # Question-type categorization & analysis
-│   │   │   └── check_progress.py     # Checkpoint progress monitor
 │   │   └── app/
-│   │       ├── main.py               # FastAPI entry (CORS, routers, lifespan)
-│   │       ├── config.py             # Pydantic Settings (ports, models, dirs)
-│   │       ├── __init__.py
-│   │       ├── models/
-│   │       │   └── schemas.py        # All Pydantic models (~280 lines)
-│   │       ├── agent/
-│   │       │   └── orchestrator.py   # 6-step pipeline orchestrator (~300 lines)
-│   │       ├── services/
-│   │       │   └── medgemma.py       # LLM service (OpenAI-compatible API)
 │   │       ├── tools/
-│   │       │   ├── patient_parser.py      # Step 1: Free-text → structured data
-│   │       │   ├── clinical_reasoning.py  # Step 2: Differential diagnosis
-│   │       │   ├── drug_interactions.py   # Step 3: OpenFDA + RxNorm
-│   │       │   ├── guideline_retrieval.py # Step 4: RAG over ChromaDB
-│   │       │   ├── conflict_detection.py  # Step 5: Guideline vs patient conflicts
-│   │       │   └── synthesis.py           # Step 6: CDS report generation
-│   │       ├── data/
-│   │       │   └── clinical_guidelines.json  # 62 guidelines, 14 specialties
-│   │       └── api/
-│   │           ├── health.py         # GET /api/health
-│   │           ├── cases.py          # POST /api/cases/submit, GET /api/cases/{id}
-│   │           └── ws.py            # WebSocket /ws/agent
-│   └── frontend/                     # Next.js 14 + React 18 + TypeScript
-│       ├── package.json
-│       ├── next.config.js            # API proxy → backend
-│       ├── tailwind.config.js
 │       └── src/
-│           ├── app/
-│           │   ├── layout.tsx
-│           │   ├── page.tsx          # Main CDS interface
-│           │   └── globals.css
-│           ├── components/
-│           │   ├── PatientInput.tsx   # Patient case input + 3 sample cases
-│           │   ├── AgentPipeline.tsx  # Real-time step visualization
-│           │   └── CDSReport.tsx     # Final report renderer
-│           └── hooks/
-│               └── useAgentWebSocket.ts  # WebSocket state management
-├── notebooks/                        # Experiment notebooks
-├── models/                           # Fine-tuned models (future)
-└── demo/                             # Video & demo assets
 ```
 ---
@@ -344,20 +315,16 @@ curl -X POST http://localhost:8000/api/cases/submit \
 ---
-## Documentation Index
 | Document | Description |
 |----------|-------------|
-| [README.md](README.md) | This file — overview, setup, results |
 | [docs/architecture.md](docs/architecture.md) | System architecture, pipeline design, design decisions |
 | [docs/test_results.md](docs/test_results.md) | Detailed test results, RAG benchmarks, pipeline timing |
 | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history, problems solved, decisions made |
-| [docs/writeup_draft.md](docs/writeup_draft.md) | Project writeup / summary |
-| [CONTRIBUTING.md](CONTRIBUTING.md) | How to contribute to the project |
 | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
-| [TODO.md](TODO.md) | Next-session action items and project state |
-| [SUBMISSION_GUIDE.md](SUBMISSION_GUIDE.md) | Competition submission strategy |
-| [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HuggingFace Endpoint deployment guide |
 ---

 > An agentic clinical decision support application that orchestrates medical AI with specialized tools to assist clinicians in real time.
+**Live demo:** [demo.briansheppard.com](https://demo.briansheppard.com)
+**Origin:** Built for the [MedGemma Impact Challenge](https://www.kaggle.com/competitions/med-gemma-impact-challenge) (Kaggle / Google Research).
 ---
 ```
 medgemma_impact_challenge/
+├── README.md
+├── CLAUDE.md                          # AI assistant context
+├── DEVELOPMENT_LOG.md                 # Build history & decisions
 ├── docs/
+│   ├── architecture.md                # System architecture & design
+│   ├── test_results.md                # Test results & benchmarks
+│   └── deploy_medgemma_hf.md          # HF Endpoint deployment guide
 ├── src/
+│   ├── backend/
+│   │   ├── requirements.txt
+│   │   ├── test_e2e.py                # End-to-end pipeline test
+│   │   ├── test_clinical_cases.py     # 22 clinical scenario test suite
+│   │   ├── test_rag_quality.py        # RAG retrieval quality tests
+│   │   ├── validation/                # External dataset validation
+│   │   │   ├── harness_medqa.py       # MedQA (USMLE) accuracy
+│   │   │   ├── harness_mtsamples.py   # MTSamples parse quality
+│   │   │   └── harness_pmc.py         # PMC Case Reports accuracy
+│   │   ├── tracks/                    # Experimental pipeline variants
 │   │   └── app/
+│   │       ├── main.py                # FastAPI entry point
+│   │       ├── config.py              # Settings
+│   │       ├── agent/orchestrator.py  # 6-step pipeline orchestrator
+│   │       ├── services/medgemma.py   # LLM service (OpenAI-compatible)
+│   │       ├── models/schemas.py      # Pydantic data models
 │   │       ├── tools/
+│   │       │   ├── patient_parser.py       # Step 1: Free-text → structured data
+│   │       │   ├── clinical_reasoning.py   # Step 2: Differential diagnosis
+│   │       │   ├── drug_interactions.py    # Step 3: OpenFDA + RxNorm
+│   │       │   ├── guideline_retrieval.py  # Step 4: RAG over ChromaDB
+│   │       │   ├── conflict_detection.py   # Step 5: Guideline vs patient gaps
+│   │       │   └── synthesis.py            # Step 6: CDS report generation
+│   │       ├── data/clinical_guidelines.json  # 62 guidelines, 14 specialties
+│   │       └── api/                   # REST + WebSocket endpoints
+│   └── frontend/                      # Next.js 14 + React 18 + TypeScript
 │       └── src/
+│           ├── components/            # PatientInput, AgentPipeline, CDSReport
+│           └── hooks/                 # WebSocket state management
+└── Dockerfile                         # HuggingFace Spaces deployment
 ```
 ---
 ---
+## Documentation
 | Document | Description |
 |----------|-------------|
 | [docs/architecture.md](docs/architecture.md) | System architecture, pipeline design, design decisions |
 | [docs/test_results.md](docs/test_results.md) | Detailed test results, RAG benchmarks, pipeline timing |
+| [docs/deploy_medgemma_hf.md](docs/deploy_medgemma_hf.md) | MedGemma HuggingFace Endpoint deployment guide |
 | [DEVELOPMENT_LOG.md](DEVELOPMENT_LOG.md) | Chronological build history, problems solved, decisions made |
+| [CONTRIBUTING.md](CONTRIBUTING.md) | How to contribute |
 | [SECURITY.md](SECURITY.md) | Security policy and responsible disclosure |
 ---

RULES_SUMMARY.md DELETED Viewed

@@ -1,113 +0,0 @@
-# Rules Summary & Compliance Checklist
-> Distilled from the full competition rules. When in doubt, refer to the [full rules](rules.txt).
----
-## Eligibility
-- [x] Must have a registered Kaggle account
-- [x] Must be 18+ (or age of majority in your jurisdiction)
-- [x] Cannot be a resident of: Crimea, DNR, LNR, Cuba, Iran, Syria, or North Korea
-- [x] Cannot be under U.S. export controls or sanctions
-- [x] Google/Kaggle employees may participate but **cannot win prizes**
-- [x] Only **one Kaggle account** per person — no multi-accounting
----
-## Team Rules
-| Rule | Detail |
-|------|--------|
-| Max team size | **5 members** |
-| Team mergers | Allowed before merger deadline |
-| Submissions per team | **1** (can be edited and re-submitted) |
-| Account requirement | Each member needs their own Kaggle account |
-| Must confirm membership | Respond to team notification message |
----
-## Submission Rules
-- **One submission per team** — this single entry covers Main Track + one special award
-- Submission format: **Kaggle Writeup** attached to the competition page
-- Can un-submit, edit, and re-submit unlimited times before deadline
-- Must be received before **February 24, 2026 at 11:59 PM UTC**
-### Private Resources Warning
-> If you attach a **private Kaggle Resource** to your public Writeup, it will **automatically become public** after the deadline.
----
-## Data & External Resources
-| Rule | Detail |
-|------|--------|
-| Competition data | **None provided** |
-| External data | Allowed — must be publicly available & free for all participants |
-| HAI-DEF models | Subject to [HAI-DEF Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms) |
-| Proprietary datasets | Not allowed if cost exceeds "Reasonableness Standard" |
-| AutoML tools | Allowed if properly licensed |
-| Open source | Must use OSI-approved licenses |
----
-## Code Sharing Rules
-| Type | Allowed? | Conditions |
-|------|----------|------------|
-| **Private sharing** (between teams) | **NO** | Grounds for disqualification |
-| **Private sharing** (within team) | Yes | — |
-| **Public sharing** | Yes | Must be shared on Kaggle (forums/notebooks) for all participants |
----
-## Winner Obligations
-If you win, you must:
-1. **Deliver final code** — training code, inference code, environment description
-2. **Grant CC BY 4.0 license** on your winning submission
-3. **Sign prize acceptance documents** within 2 weeks of notification
-4. **Complete tax forms** (W-9 for US, W-8BEN for foreign residents)
-5. **Respond to winner notification** within 1 week
-> If using commercially available software you don't own, you must identify it and explain how to procure it.
-> If input data/pretrained models have incompatible licenses, you don't need to grant open source license for those.
----
-## Prize Distribution
-- Monetary prizes split **evenly** among eligible team members (unless team unanimously agrees to different split)
-- **All taxes are the winner's responsibility**
-- Prizes awarded ~30 days after acceptance documents received
-- Prizes **cannot be transferred or assigned**
----
-## Disqualification Risks
-You can be disqualified for:
-- Using multiple Kaggle accounts
-- Private code sharing outside your team
-- Cheating, deception, or unfair practices
-- Threatening or harassing other participants
-- Not meeting submission requirements
-- Providing false personal information
-- Using non-publicly-available external data
----
-## Governing Law
-- California law applies
-- Disputes litigated in Santa Clara County, California, USA
----
-## Key Contacts
-- **Competition Sponsor:** Google Research — 1600 Amphitheatre Parkway, Mountain View, CA 94043
-- **Platform:** Kaggle Inc.
-- **Support:** www.kaggle.com/contact

SUBMISSION_GUIDE.md DELETED Viewed

@@ -1,152 +0,0 @@
-# Submission & Strategy Guide
-## Timeline at a Glance
-```
-Jan 13 ─────────────────────── Feb 24 ──────────── Mar 17-24
- START                     DEADLINE 11:59 PM UTC    RESULTS
- ◄────── Build & Iterate ──────►
-```
-**⏰ Days remaining as of Feb 15, 2026: ~9 days**
----
-## Winning Strategy by Track
-### Main Track ($75K)
-Focus on **Execution & Communication (30%)** — this is the highest-weighted criterion. A polished video, clean write-up, and well-organized code can make the difference.
-**Priority order:**
-1. **Execution & Communication (30%)** — Polish everything
-2. **Effective Use of HAI-DEF (20%)** — Show the models are essential, not bolted on
-3. **Product Feasibility (20%)** — Prove it can work in production
-4. **Problem Domain (15%)** — Tell a compelling story about who benefits
-5. **Impact Potential (15%)** — Quantify the impact with clear estimates
-### Agentic Workflow Prize ($10K)
-- Deploy HAI-DEF models as **intelligent agents** or **callable tools**
-- Demonstrate a **significant overhaul** of a challenging process
-- Show improved efficiency and outcomes via agentic AI
-### Novel Task Prize ($10K)
-- **Fine-tune** a HAI-DEF model for a task it wasn't originally designed for
-- The more creative and useful the adaptation, the better
-- Document fine-tuning methodology thoroughly
-### Edge AI Prize ($5K)
-- Run a HAI-DEF model on **local/edge hardware** (phone, scanner, etc.)
-- Focus on model optimization: quantization, distillation, pruning
-- Demonstrate real-world field deployment scenarios
----
-## Submission Checklist
-### Required Deliverables
-- [ ] **Kaggle Writeup** — 3 pages or less, following the template
-- [ ] **Video demo** — 3 minutes or less
-- [ ] **Public code repository** — linked in writeup
-- [ ] Uses **at least one HAI-DEF model** (e.g., MedGemma)
-- [ ] Code is **reproducible**
-### Bonus Deliverables
-- [ ] Public interactive live demo app
-- [ ] Open-weight Hugging Face model tracing to HAI-DEF
-### Write-up Quality
-- [ ] Clear project name
-- [ ] Team members with specialties and roles listed
-- [ ] Problem statement addresses "Problem Domain" and "Impact Potential" criteria
-- [ ] Overall solution addresses "Effective Use of HAI-DEF Models" criterion
-- [ ] Technical details address "Product Feasibility" criterion
-- [ ] All links (video, code, demo) are working and accessible
-### Video Quality
-- [ ] 3 minutes or less
-- [ ] Demonstrates the application in action
-- [ ] Explains the problem and solution clearly
-- [ ] Shows HAI-DEF model integration
-- [ ] Professional quality (clear audio, good visuals)
-### Code Quality
-- [ ] Well-organized repository structure
-- [ ] Clear README with setup instructions
-- [ ] Code is commented and readable
-- [ ] Dependencies are documented (requirements.txt / environment.yml)
-- [ ] Results are reproducible from the repository
----
-## Video Tips (30% of score rides on execution)
-1. **Open with the problem** (30 sec) — Who suffers? What's broken?
-2. **Show the solution** (90 sec) — Live demo, not just slides
-3. **Explain the tech** (30 sec) — Which HAI-DEF model, how it's used
-4. **Quantify impact** (15 sec) — Numbers, estimates, or projections
-5. **Close strong** (15 sec) — Vision for the future
----
-## Technical Approach Suggestions
-### Application Ideas Aligned to Criteria
-| Idea | Models | Special Award Fit |
-|------|--------|-------------------|
-| Clinical note summarizer with agent routing | MedGemma | Agentic Workflow |
-| Radiology triage assistant | MedGemma (vision) | Main Track |
-| Dermatology screening on mobile | MedGemma (quantized) | Edge AI |
-| Pathology slide analysis for rare diseases | MedGemma (fine-tuned) | Novel Task |
-| Patient education chatbot | MedGemma | Main Track |
-| Lab result interpreter agent pipeline | MedGemma + tools | Agentic Workflow |
-| Wound assessment via phone camera | MedGemma (vision, edge) | Edge AI |
-### Key Technical Considerations
-1. **Model Selection** — Choose the right HAI-DEF model variant for your task
-2. **Fine-tuning** — Document methodology, hyperparameters, dataset curation
-3. **Evaluation** — Include performance metrics and analysis
-4. **Deployment** — Describe your app stack and how it would scale
-5. **Privacy** — Healthcare data is sensitive; address HIPAA/privacy considerations
-6. **External Data** — Must be publicly available and equally accessible to all participants
----
-## External Data & Tools Rules
-- External data is allowed but must be **publicly available at no cost** to all participants
-- Use of HAI-DEF/MedGemma is subject to [HAI-DEF Terms of Use](https://developers.google.com/health-ai-developer-foundations/terms)
-- Open source code must use an **OSI-approved license**
-- AutoML tools are permitted if properly licensed
-- **No private code sharing** outside your team during the competition
-- Public code sharing must be done on Kaggle forums/notebooks
----
-## Draft Writeup Workspace
-Use `docs/writeup_draft.md` to iterate on your writeup before submitting on Kaggle:
-```markdown
-### Project name
-[TODO]
-### Your team
-[TODO: Name, specialty, role for each member]
-### Problem statement
-[TODO: Define the problem, who's affected, magnitude, why AI is the right solution]
-[TODO: Articulate impact — what changes if this works? How did you estimate impact?]
-### Overall solution
-[TODO: Which HAI-DEF model(s)? Why are they the right choice?]
-[TODO: How does the application use them to their fullest potential?]
-### Technical details
-[TODO: Architecture diagram / description]
-[TODO: Fine-tuning details (if applicable)]
-[TODO: Performance metrics and analysis]
-[TODO: Deployment stack and challenges]
-[TODO: How this works in practice, not just benchmarks]
-```

TODO.md DELETED Viewed

@@ -1,141 +0,0 @@
-# TODO — Next Session Action Items
-> **Last updated:** Feb 15, 2026 — Experimental track system built.
-> **Read this first** if you're a new AI instance picking up this project.
-> **See also:** `CLAUDE.md` (project intelligence) and `TRACKS.md` (track registry).
----
-## High Priority (Do Next)
-### 1. Run Experimental Tracks
-Three experimental tracks are built and ready to test. See `TRACKS.md` for full details.
-**Track B — RAG Variants** (`src/backend/tracks/rag_variants/`)
-```bash
-cd src/backend
-python -m tracks.rag_variants.run_variants --max-cases 10   # smoke test
-python -m tracks.rag_variants.run_variants                    # full sweep
-```
-Tests 10 configurations: chunking strategies (none, fixed-256, fixed-512, sentence, overlap), embedding models (MiniLM-L6, MiniLM-L12, MPNet, MedCPT), top-k sweep (3, 5, 10), and reranking.
-**Track C — Iterative Refinement** (`src/backend/tracks/iterative/`)
-```bash
-python -m tracks.iterative.run_iterative --max-cases 10
-python -m tracks.iterative.run_iterative
-```
-Tests 4 configurations: 2-round, 3-round, 5-round, and aggressive-critic. Produces cost/benefit data per iteration.
-**Track D — Arbitrated Parallel** (`src/backend/tracks/arbitrated/`)
-```bash
-python -m tracks.arbitrated.run_arbitrated --max-cases 10
-python -m tracks.arbitrated.run_arbitrated
-```
-Tests 4 configurations: 3-specialist/1-round, 5-specialist/1-round, 3-specialist/2-round, 5-specialist/2-round. Specialists: Cardiologist, Neurologist, ID, General IM, Emergency Medicine.
-**Prerequisites:**
-- Resume HF Endpoint (`medgemma-27b-cds`) — allow 5–15 min cold start (~$2.50/hr)
-- Activate venv: `src/backend/venv/`
-- May need: `pip install sentence-transformers` for MedCPT/MPNet/reranking variants
-### 2. Record the Demo Video
-Video script is ready: `docs/video_script.md`. Need to actually record:
-1. Resume HF Endpoint
-2. Start backend + frontend locally
-3. Record ~3 min screencast following the script
-4. Upload to YouTube/Loom and get the link
-### 3. Submit on Kaggle
-Kaggle writeup content is ready: `docs/kaggle_writeup.md`. Steps:
-1. Go to competition page → "New Writeup"
-2. Paste writeup content (fill in team name/member info first)
-3. Select tracks: Main Track + Agentic Workflow Prize
-4. Add links: video URL, GitHub repo, (optional) live demo
-5. Click Submit
-6. **Fill in [Your Name] placeholder** in the team table
----
-## Medium Priority
-### 4. CI Gating on Validation Scores
-Add a GitHub Action or pre-commit check that runs a small validation suite (e.g., 5 MedQA cases) and fails if top-1 accuracy drops below a threshold. This prevents regressions.
-### 5. PMC Harness Improvements
-The PMC case fetcher currently gets ~5 cases per run. The limiting factor is title-based diagnosis extraction — many PubMed case report titles don't follow parseable patterns. Options:
-- Use the full-text XML API (not just abstracts) to extract "final diagnosis" from structured sections
-- Add more title regex patterns
-- Use the LLM to extract the diagnosis from the abstract itself (meta, but effective)
-### 6. Calibrated Uncertainty Indicators
-We deliberately removed numeric confidence scores (see Phase 8 in DEVELOPMENT_LOG.md). If revisiting uncertainty communication:
-- Consider evidence-strength indicators per recommendation instead of a single composite score
-- Look at conformal prediction or test-time compute approaches if fine-tuning
-- Do NOT add back uncalibrated float scores — the anchoring bias risk is real
----
-## Low Priority / Future
-### 7. Model Optimization
-Currently using `google/medgemma-27b-text-it` on 1× A100 80 GB. Options:
-- Smaller/quantized models for latency reduction (medgemma-4b-it for lighter steps)
-- Specialized models for individual pipeline steps (e.g., a parse-only model)
-- Batch inference optimizations
-### 8. EHR Integration Prototype
-Current input is manual text paste. A FHIR client could auto-populate patient data. This is a significant scope expansion but would dramatically increase real-world usability.
-### 9. Frontend Polish
-- Loading skeletons during pipeline execution
-- Dark mode
-- Export report as PDF
-- Mobile-responsive layout
----
-## Project State Summary
-| Component | Status | Notes |
-|-----------|--------|-------|
-| Backend (6-step pipeline) | ✅ Complete | All steps working, conflict detection added |
-| Frontend (Next.js) | ✅ Complete | Real-time pipeline viz, CDS report with conflicts |
-| RAG (62 guidelines) | ✅ Complete | 30/30 quality test, 100% top-1 accuracy |
-| Conflict Detection | ✅ Complete | Integrated into pipeline, frontend, and docs |
-| MedGemma HF Endpoint | ✅ Deployed | `medgemma-27b-cds`, 1× A100 80 GB, scale-to-zero, **currently paused** |
-| MedQA Validation (50 cases) | ✅ Complete | 36% top-1, 38% mentioned, 94% pipeline success |
-| Validation Framework | ✅ Complete | MedQA done; MTSamples + PMC harnesses built but not yet run at scale |
-| **Track System** | ✅ **Scaffolded** | **4 tracks (A/B/C/D), shared utils, all runners built — needs experimentation** |
-| Track B — RAG Variants | ✅ Built | 10 variants (chunking × embedding × rerank), ready to run |
-| Track C — Iterative Refinement | ✅ Built | 4 configs (2/3/5-round + aggressive), ready to run |
-| Track D — Arbitrated Parallel | ✅ Built | 4 configs (3/5 specialists × 1/2 rounds), ready to run |
-| Documentation (8+ files) | ✅ Audited | All docs updated and cross-checked |
-| test_e2e.py | ✅ Fixed | Now asserts 6 steps + conflict_detection |
-| GitHub | ✅ Pushed | `bshepp/clinical-decision-support-agent` (master) |
-| Kaggle Writeup | ✅ Draft ready | `docs/kaggle_writeup.md` — paste into Kaggle |
-| Video Script | ✅ Ready | `docs/video_script.md` — 3 min narration |
-| Demo Video | ⬜ Not started | Required for submission |
-**Key files:**
-- Backend entry: `src/backend/app/main.py`
-- Orchestrator: `src/backend/app/agent/orchestrator.py`
-- MedGemma service: `src/backend/app/services/medgemma.py`
-- Validation CLI: `src/backend/validation/run_validation.py`
-- **Track registry: `TRACKS.md`**
-- **Project intelligence: `CLAUDE.md`**
-- HF Endpoint guide: `docs/deploy_medgemma_hf.md`
-- All docs: `README.md`, `docs/architecture.md`, `docs/test_results.md`, `docs/writeup_draft.md`, `DEVELOPMENT_LOG.md`
-**Infrastructure:**
-- HF Endpoint: `medgemma-27b-cds` at `https://lisvpf8if1yhgxn2.us-east-1.aws.endpoints.huggingface.cloud`
-- Dev ports: Backend = 8002 (not 8000 — zombie process issue), Frontend = 3000
-- Virtual env: `src/backend/venv/`

TRACKS.md DELETED Viewed

@@ -1,194 +0,0 @@
-# TRACKS.md — Experimental Track Registry
-> **Single source of truth** for all experimental tracks, their file ownership, tagging conventions, and isolation rules.
-> Referenced by [CLAUDE.md](CLAUDE.md). Read that file first for general project context.
----
-## Why Tracks?
-The baseline pipeline (Track A) achieves 36% top-1 diagnostic accuracy on MedQA. To improve this, we are evaluating **multiple independent strategies** in parallel. Each strategy is an isolated "track" with its own code, configuration, and results — so we can compare them fairly without cross-contamination.
----
-## Track Registry
-| ID | Name | Directory | Strategy |
-|----|------|-----------|----------|
-| **A** | Baseline | `src/backend/app/` | The production 6-step pipeline. No modifications for experiments. |
-| **B** | RAG Variants | `src/backend/tracks/rag_variants/` | Test different chunking sizes, segment strategies, and embedding models to optimize guideline retrieval quality and downstream diagnostic accuracy. |
-| **C** | Iterative Refinement | `src/backend/tracks/iterative/` | Run the diagnosis step in a serial loop — each iteration critiques and refines the previous output. Continue until the marginal improvement drops below a cost/benefit threshold. Produces a convergence chart. |
-| **D** | Arbitrated Parallel | `src/backend/tracks/arbitrated/` | Run multiple specialist reasoning agents in parallel. An arbiter agent evaluates all outputs, tailors resubmission prompts for each specialist based on their strengths/weaknesses, and repeats until the cost/benefit ratio plateaus. Produces a cost/benefit chart. |
-| **E** | Combined | `src/backend/tracks/combined/` | Compose per-axis winners from B/C/D/F/G/H. Tests 3 composition patterns (breadth-then-depth, depth-within-breadth, bookend). **Phase 3 — build after Phase 1+2 data.** |
-| **F** | Prompt Architecture | `src/backend/tracks/prompt_arch/` | Test how reasoning prompt structure affects accuracy: structured template, few-shot, reverse reasoning, Bayesian framing. **Phase 2.** |
-| **G** | Multi-Sample Voting | `src/backend/tracks/voting/` | Self-consistency via repeated sampling + majority/weighted vote. 1/3/5 samples at varying temperatures. **Phase 2.** |
-| **H** | Evidence Verification | `src/backend/tracks/verification/` | Post-hoc grounding check: verify each diagnosis against patient evidence, re-rank by grounding score. **Phase 2.** |
-| **—** | Shared | `src/backend/tracks/shared/` | Cross-track utilities: cost tracking, comparison harness, chart generation. Not a track itself. |
----
-## File Tagging Convention
-**Every file owned by a track MUST carry a track tag on line 1.** This makes ownership unambiguous when reading any file in isolation.
-### Format by file type
-| File Type | Tag Format | Example |
-|-----------|-----------|---------|
-| Python (`.py`) | `# [Track X: Name]` | `# [Track B: RAG Variants]` |
-| JSON (`.json`) | First key in object | `{"_track": "Track B: RAG Variants", ...}` |
-| Markdown (`.md`) | HTML comment | `<!-- [Track B: RAG Variants] -->` |
-| Config (`.env`, `.yaml`) | Comment | `# [Track B: RAG Variants]` |
-### Track A exception
-Track A files (`src/backend/app/`) were written before the track system existed. They are tagged with `# [Track A: Baseline]` on line 1, but their code is NOT modified for experimental purposes. Experiments extend or wrap Track A code from within their own track directory.
----
-## Isolation Rules
-These rules prevent cross-contamination between experimental tracks:
-### 1. File Ownership
-- Each file belongs to exactly **one track** (identified by its line-1 tag and directory).
-- Files in `src/backend/app/` belong to **Track A**.
-- Files in `src/backend/tracks/<dir>/` belong to the corresponding track.
-- Files in `src/backend/tracks/shared/` are shared utilities, not owned by any single track.
-### 2. No Cross-Modification
-- **Never modify a Track A file to serve an experiment.** Instead, import and extend from your track's directory.
-- **Never modify a Track B file from Track C code**, and so forth.
-- If two tracks need the same utility, put it in `shared/`.
-### 3. Import Direction
-```
-Track B/C/D code  →  may import from  →  Track A (app/) and shared/
-Track A code      →  NEVER imports    →  Track B/C/D
-shared/ code      →  may import from  →  Track A (app/) only
-```
-### 4. Results Isolation
-- Each track stores results in `src/backend/tracks/<dir>/results/`.
-- Result filenames include the track ID prefix (e.g., `trackB_medqa_20260215.json`).
-- Cross-track comparison is done **only** via `src/backend/tracks/shared/compare.py`.
-### 5. Configuration Isolation
-- Track-specific parameters live in each track's own config or constants — not in `app/config.py`.
-- The shared `app/config.py` provides only baseline/global settings (API keys, endpoints, etc.).
----
-## Track Details
-### Track A: Baseline
-**Purpose:** The production-ready pipeline. The control group for all experiments.
-**Pipeline:** Parse → Reason → Drug Check → Guideline Retrieval → Conflict Detection → Synthesis
-**Key parameters:**
-- Embedding: `all-MiniLM-L6-v2` (384 dims)
-- RAG top-k: 5
-- No guideline chunking (each guideline = 1 document)
-- Clinical reasoning temperature: 0.3
-- Synthesis temperature: 0.2
-- Single-pass reasoning (no iteration)
-**Baseline accuracy (50-case MedQA):** 36% top-1, 38% mentioned
----
-### Track B: RAG Variants
-**Purpose:** Determine whether retrieval quality improvements translate to better diagnostic accuracy.
-**Experiments:**
-1. **Chunking strategies** — Split each guideline into smaller segments (100-word chunks, 200-word chunks, sentence-level) with configurable overlap
-2. **Embedding models** — Compare `all-MiniLM-L6-v2` (384d) vs `all-mpnet-base-v2` (768d) vs `bge-base-en-v1.5` (768d) vs `medcpt` (medical-specific)
-3. **Top-k variation** — Test k=3, k=5, k=8, k=10 to find optimal retrieval breadth
-4. **Re-ranking** — Add a cross-encoder re-ranking step after initial retrieval
-**Measured outcomes:**
-- RAG retrieval accuracy (30-query test suite)
-- MedQA diagnostic accuracy (same 50-case seed=42)
-- Retrieval latency per query
-**Key files:**
-- `src/backend/tracks/rag_variants/config.py` — Variant definitions
-- `src/backend/tracks/rag_variants/chunker.py` — Guideline chunking strategies
-- `src/backend/tracks/rag_variants/retriever.py` — Modified retrieval with configurable embedding/chunking
-- `src/backend/tracks/rag_variants/run_variants.py` — Runner that tests all configurations
-- `src/backend/tracks/rag_variants/results/` — Per-variant results
----
-### Track C: Iterative Refinement
-**Purpose:** Determine whether repeated self-critique improves diagnostic accuracy, and find the point of diminishing returns.
-**Method:**
-1. Run baseline clinical reasoning (iteration 0)
-2. Feed the output back along with the patient data and a critique prompt
-3. The model reviews its own differential, identifies weaknesses, and produces a refined version
-4. Repeat until: (a) max iterations reached, or (b) the differential stops changing meaningfully
-5. Track accuracy and LLM cost at each iteration to produce a convergence/cost-benefit chart
-**Measured outcomes:**
-- Accuracy at each iteration (top-1, top-3, mentioned)
-- LLM token cost at each iteration
-- Convergence curve: accuracy vs. cumulative cost
-- Iteration at which improvement drops below threshold
-**Key files:**
-- `src/backend/tracks/iterative/config.py` — Max iterations, convergence threshold
-- `src/backend/tracks/iterative/refiner.py` — Iterative reasoning loop with self-critique
-- `src/backend/tracks/iterative/run_iterative.py` — Runner with per-iteration scoring
-- `src/backend/tracks/iterative/results/` — Per-iteration results and charts
----
-### Track D: Arbitrated Parallel
-**Purpose:** Determine whether multiple specialist agents, coordinated by an arbiter, outperform a single-pass generalist — and at what cost.
-**Method:**
-1. Run N specialist reasoning agents **in parallel**, each with a domain-specific system prompt (e.g., cardiologist, neurologist, infectious disease specialist)
-2. An **arbiter agent** receives all N specialist outputs plus the patient data
-3. The arbiter evaluates each specialist's differential, identifies agreements and disagreements
-4. The arbiter generates **tailored resubmission prompts** for each specialist — telling the cardiologist "the neurologist raised X, reconsider Y" and vice versa
-5. Specialists run again with the arbiter's feedback
-6. Repeat until: (a) consensus reached, (b) max rounds, or (c) cost/benefit drops below threshold
-7. The arbiter produces the final merged differential
-8. Track accuracy and cost at each round to produce a cost/benefit chart
-**Measured outcomes:**
-- Accuracy at each arbitration round (top-1, top-3, mentioned)
-- Per-specialist accuracy contribution
-- LLM token cost per round (N specialists + 1 arbiter)
-- Cost/benefit convergence chart
-- Consensus rate across rounds
-**Key files:**
-- `src/backend/tracks/arbitrated/config.py` — Specialist definitions, max rounds, threshold
-- `src/backend/tracks/arbitrated/specialists.py` — Domain-specific reasoning agents
-- `src/backend/tracks/arbitrated/arbiter.py` — Arbiter agent that evaluates and coordinates
-- `src/backend/tracks/arbitrated/run_arbitrated.py` — Runner with per-round scoring
-- `src/backend/tracks/arbitrated/results/` — Per-round results and charts
----
-## Adding a New Track
-1. Choose an unused letter ID (E, F, ...).
-2. Create `src/backend/tracks/<dir_name>/` with `__init__.py`.
-3. Add the track to the **Track Registry** table above.
-4. Tag every new file on line 1 with `# [Track X: Name]`.
-5. Store results in `src/backend/tracks/<dir_name>/results/`.
-6. Add a comparison entry in `src/backend/tracks/shared/compare.py`.
-7. Never import from another track's directory — only from `app/` and `shared/`.

VALIDATION_PIPELINE_PLAN.md DELETED Viewed

@@ -1,1149 +0,0 @@
-# VALIDATION_PIPELINE_PLAN.md — Validation Pipeline Fix Plan
-> **Purpose:** Step-by-step implementation plan for fixing the validation/scoring
-> pipeline so accuracy metrics actually reflect the system's capabilities.
->
-> **Root cause:** The pipeline forces every MedQA question through differential
-> diagnosis generation, but only 7/50 sampled questions are diagnostic. The other
-> 43 are treatment, mechanism, lab-finding, ethics, etc. — producing near-zero
-> accuracy on questions the pipeline was never designed to answer.
->
-> **Expected impact:** Fixes P5+P3+P6 alone should raise measured MedQA accuracy
-> from ~36% to 60-70%+. Full implementation of all 7 fixes gives honest,
-> stratified metrics and unlocks multi-mode pipeline expansion.
->
-> **Implementation order:** Bottom-up through the data flow. Each step locks down
-> its interface before the next layer builds on it. No rewrites needed.
----
-## Step 1: P5 — Fix `fuzzy_match()` for Short Answers
-**File:** `src/backend/validation/base.py`
-**Functions:** `fuzzy_match()`, `normalize_text()`
-**Depends on:** Nothing
-**Depended on by:** P4 (type-aware scoring), P6 (MCQ selection comparison)
-### Problem
-`fuzzy_match()` uses `min(len(c_tokens), len(t_tokens))` as the denominator for
-token overlap. For a 1-word target like "Clopidogrel", `min(1, 200) = 1`, so a
-single token match gives 100% overlap. But for a 3-word target like "Cross-linking
-of DNA", stop-word removal and normalization can reduce the target to 2 tokens,
-and if the candidate doesn't contain those specific tokens, it fails — even if
-the concept is present in different phrasing.
-The substring check (`normalize_text(target) in normalize_text(candidate)`) works
-for exact matches but fails for any morphological variation: "clopidogrel 75mg"
-won't substring-match "Clopidogrel" because the candidate is longer.
-Wait — actually the current code does `normalize_text(target) in normalize_text(candidate)`,
-which WOULD match "clopidogrel" inside "clopidogrel 75mg daily". The real failure
-case is when the answer uses different phrasing than the pipeline output, e.g.:
-- Target: "Reassurance and continuous monitoring"
-- Pipeline says: "reassure the patient and monitor continuously"
-- Neither substring contains the other, and token overlap may be low
-### Changes
-```python
-# In base.py — replace fuzzy_match() entirely
-def normalize_text(text: str) -> str:
-    """Lowercase, strip punctuation, normalize whitespace."""
-    text = text.lower().strip()
-    text = re.sub(r'[^\w\s]', ' ', text)
-    text = re.sub(r'\s+', ' ', text)
-    return text.strip()
-# Medical stopwords that don't carry diagnostic meaning
-_MEDICAL_STOPWORDS = frozenset({
-    "the", "a", "an", "of", "in", "to", "and", "or", "is", "are", "was",
-    "were", "be", "been", "with", "for", "on", "at", "by", "from", "this",
-    "that", "these", "those", "it", "its", "has", "have", "had", "do",
-    "does", "did", "will", "would", "could", "should", "may", "might",
-    "most", "likely", "following", "which", "what", "patient", "patients",
-})
-def _content_tokens(text: str) -> set[str]:
-    """Extract meaningful content tokens, removing medical stopwords."""
-    tokens = set(normalize_text(text).split())
-    return tokens - _MEDICAL_STOPWORDS
-def fuzzy_match(candidate: str, target: str, threshold: float = 0.6) -> bool:
-    """
-    Check if candidate text is a fuzzy match for target.
-    Strategy (checked in order, first match wins):
-      1. Normalized substring containment (either direction)
-      2. All content tokens of target appear in candidate (recall=1.0)
-      3. Token overlap ratio >= threshold (using content tokens)
-    Args:
-        candidate: Text from the pipeline output (may be long)
-        target: Ground truth text (usually short)
-        threshold: Minimum token overlap ratio (0.0-1.0)
-    """
-    c_norm = normalize_text(candidate)
-    t_norm = normalize_text(target)
-    if not t_norm:
-        return False
-    # 1. Substring containment (either direction)
-    if t_norm in c_norm or c_norm in t_norm:
-        return True
-    # 2. All content tokens of target present in candidate
-    #    This catches "clopidogrel" in a 500-word report
-    t_content = _content_tokens(target)
-    c_content = _content_tokens(candidate)
-    if t_content and t_content.issubset(c_content):
-        return True
-    # 3. Token overlap ratio
-    if not t_content or not c_content:
-        return False
-    overlap = len(t_content & c_content)
-    # Use target token count as denominator — "what fraction of
-    # the target's meaning is present in the candidate?"
-    recall = overlap / len(t_content)
-    return recall >= threshold
-```
-### Key interface change
-- **Signature stays the same:** `fuzzy_match(candidate, target, threshold) -> bool`
-- **Behavior change:** More permissive matching for short targets (all-token-subset check),
-  slightly different threshold semantics (recall-based instead of min-denominator-based).
-  This is strictly better — no downstream code breaks.
-### Tests to write
-```python
-# test_fuzzy_match.py
-def test_short_target_substring():
-    assert fuzzy_match("Start clopidogrel 75mg daily", "Clopidogrel") == True
-def test_short_target_all_tokens():
-    assert fuzzy_match("The diagnosis is cholesterol embolization syndrome", "Cholesterol embolization") == True
-def test_multi_word_phrasing_variation():
-    # "Reassurance and continuous monitoring" vs report text
-    assert fuzzy_match(
-        "reassure the patient and provide continuous cardiac monitoring",
-        "Reassurance and continuous monitoring"
-    ) == True  # content tokens: {reassurance, continuous, monitoring} — "reassurance" != "reassure" though
-def test_no_false_positive():
-    assert fuzzy_match("Acute myocardial infarction", "Pulmonary embolism") == False
-def test_empty_target():
-    assert fuzzy_match("some text", "") == False
-```
-**Note:** The "reassurance" vs "reassure" case will still fail without stemming.
-Add stemming as a future enhancement (e.g., via `nltk.stem.PorterStemmer` or a
-simple suffix-stripping function). For now, the all-token-subset check is the
-biggest improvement.
-### Validation
-Run existing test suite — no existing tests should break because matching is
-strictly more permissive. Verify on a few known failure cases from the 50-case
-run results.
----
-## Step 2: P3 — Preserve the Question Stem
-**File:** `src/backend/validation/harness_medqa.py`
-**Functions:** `_extract_vignette()`, `fetch_medqa()`
-**Depends on:** Nothing (independent of P5, but listed second for logical flow)
-**Depended on by:** P1 (classifier needs the stem), P6 (MCQ step needs the stem + options)
-### Problem
-`_extract_vignette()` strips the question stem ("Which of the following is the
-most likely diagnosis?") from the MedQA question. This means:
-1. The pipeline doesn't know what's being asked — it always defaults to
-   "generate a differential"
-2. The question classifier (P1) can't classify without the stem
-3. The MCQ step (P6) can't present the original question
-### Changes
-#### 2a. Refactor `_extract_vignette()` → `_split_question()`
-```python
-# In harness_medqa.py — replace _extract_vignette()
-def _split_question(question: str) -> tuple[str, str]:
-    """
-    Split a USMLE question into (clinical_vignette, question_stem).
-    The vignette is the clinical narrative. The stem is the actual question
-    being asked ("Which of the following is the most likely diagnosis?").
-    Returns:
-        (vignette, stem) — stem may be empty if no recognizable stem found.
-        In that case, vignette contains the full question text.
-    """
-    stems = [
-        r"which of the following",
-        r"what is the most likely",
-        r"what is the best next step",
-        r"what is the most appropriate",
-        r"what is the diagnosis",
-        r"the most likely diagnosis is",
-        r"this patient most likely has",
-        r"what would be the next step",
-        r"what is the next best step",
-        r"what is the underlying",
-        r"what is the mechanism",
-        r"what is the pathophysiology",
-    ]
-    text = question.strip()
-    for stem_pattern in stems:
-        pattern = re.compile(
-            rf'(\.?\s*)([A-Z][^.]*{stem_pattern}[^.]*[\?\.]?\s*)$',
-            re.IGNORECASE,
-        )
-        match = pattern.search(text)
-        if match:
-            vignette = text[:match.start()].strip()
-            stem_text = match.group(2).strip()
-            if len(vignette) > 50:  # Sanity check
-                return vignette, stem_text
-    # Fallback: no recognizable stem — return full text as vignette
-    return text, ""
-```
-#### 2b. Update `fetch_medqa()` to store stem + vignette separately
-```python
-# In fetch_medqa(), replace the case-building loop body:
-        vignette, question_stem = _split_question(question)
-        cases.append(ValidationCase(
-            case_id=f"medqa_{i:04d}",
-            source_dataset="medqa",
-            input_text=vignette,           # Pipeline still gets the vignette
-            ground_truth={
-                "correct_answer": answer_text,
-                "answer_idx": answer_idx,
-                "options": options,
-                "full_question": question,
-            },
-            metadata={
-                "question_stem": question_stem,      # NEW
-                "clinical_vignette": vignette,       # NEW (same as input_text, explicit)
-                "full_question_with_stem": question,  # NEW (redundant with ground_truth but cleaner access)
-            },
-        ))
-```
-### Key interface change
-- `ValidationCase.metadata` now has 3 new keys: `question_stem`, `clinical_vignette`,
-  `full_question_with_stem`
-- `input_text` is still just the vignette (pipeline input unchanged)
-- `_extract_vignette()` is renamed to `_split_question()` returning a tuple
-- Old callers of `_extract_vignette()`: only `fetch_medqa()` — update in place
-### Backward compatibility
-- `input_text` stays the same → pipeline behavior unchanged
-- `ground_truth` keeps all existing keys → scoring unchanged
-- New data is in `metadata` only → nothing breaks
----
-## Step 3: P1 — Question-Type Classifier
-**New file:** `src/backend/validation/question_classifier.py`
-**Depends on:** P3 (needs `metadata["question_stem"]`)
-**Depended on by:** P4 (type-aware scoring), P6 (routing), P7 (stratified reporting)
-### Design
-Two-tier classifier:
-1. **Heuristic classifier** (fast, no LLM call, used by default) — regex on question stem
-2. **LLM classifier** (optional, for ambiguous cases) — ask MedGemma to classify
-Start with heuristic only. It correctly classified our 50-case sample already
-(7 diagnostic, 6 treatment, 1 mechanism, 2 lab, 34 other — matching manual review).
-### Question type enum
-```python
-# In question_classifier.py
-from enum import Enum
-class QuestionType(str, Enum):
-    DIAGNOSTIC = "diagnostic"           # "most likely diagnosis/cause/explanation"
-    TREATMENT = "treatment"             # "most appropriate next step/management/treatment"
-    MECHANISM = "mechanism"             # "mechanism of action", "pathophysiology"
-    LAB_FINDING = "lab_finding"         # "expected finding", "characteristic on agar"
-    PHARMACOLOGY = "pharmacology"       # "drug that targets...", "receptor..."
-    EPIDEMIOLOGY = "epidemiology"       # "risk factor", "prevalence", "incidence"
-    ETHICS = "ethics"                   # "most appropriate action" (ethical dilemmas)
-    ANATOMY = "anatomy"                 # "structure most likely damaged"
-    OTHER = "other"                     # Doesn't fit above categories
-```
-### Heuristic classifier
-```python
-import re
-from typing import Optional
-from validation.base import ValidationCase
-# Pattern → QuestionType mapping (checked in order, first match wins)
-_STEM_PATTERNS: list[tuple[str, QuestionType]] = [
-    # Diagnostic
-    (r"most likely diagnosis", QuestionType.DIAGNOSTIC),
-    (r"most likely cause", QuestionType.DIAGNOSTIC),
-    (r"most likely explanation", QuestionType.DIAGNOSTIC),
-    (r"what is the diagnosis", QuestionType.DIAGNOSTIC),
-    (r"diagnosis is", QuestionType.DIAGNOSTIC),
-    (r"most likely condition", QuestionType.DIAGNOSTIC),
-    (r"most likely has", QuestionType.DIAGNOSTIC),
-    (r"most likely suffer", QuestionType.DIAGNOSTIC),
-    # Treatment / Management
-    (r"most appropriate (next step|management|treatment|intervention|therapy|pharmacotherapy)", QuestionType.TREATMENT),
-    (r"best (next step|initial step|management|treatment)", QuestionType.TREATMENT),
-    (r"most appropriate action", QuestionType.TREATMENT),  # Can be ethics — see below
-    (r"recommended (treatment|management|therapy)", QuestionType.TREATMENT),
-    # Mechanism
-    (r"mechanism of action", QuestionType.MECHANISM),
-    (r"pathophysiology", QuestionType.MECHANISM),
-    (r"mediator.*(responsible|involved)", QuestionType.MECHANISM),
-    (r"(inhibit|block|activate).*receptor", QuestionType.MECHANISM),
-    (r"cross-link", QuestionType.MECHANISM),
-    # Lab / Findings
-    (r"most likely finding", QuestionType.LAB_FINDING),
-    (r"expected (finding|result|value)", QuestionType.LAB_FINDING),
-    (r"characteristic (finding|feature|appearance)", QuestionType.LAB_FINDING),
-    (r"(agar|culture|stain|gram|biopsy).*show", QuestionType.LAB_FINDING),
-    (r"(laboratory|lab).*(result|finding|value)", QuestionType.LAB_FINDING),
-    # Pharmacology
-    (r"drug.*(target|mechanism|receptor|inhibit)", QuestionType.PHARMACOLOGY),
-    (r"(target|act on|bind).*(receptor|enzyme|channel)", QuestionType.PHARMACOLOGY),
-    # Epidemiology
-    (r"(risk factor|prevalence|incidence|odds ratio|relative risk)", QuestionType.EPIDEMIOLOGY),
-    (r"most (common|frequent).*(cause|risk|complication)", QuestionType.EPIDEMIOLOGY),
-    # Anatomy
-    (r"(structure|nerve|artery|vein|muscle|ligament).*(damaged|injured|affected|involved)", QuestionType.ANATOMY),
-    # Ethics (refine: "most appropriate action" in context of disclosure, consent, etc.)
-    (r"(tell|inform|disclose|report|consent|refuse|autonomy|confidentiality)", QuestionType.ETHICS),
-]
-def classify_question(case: ValidationCase) -> QuestionType:
-    """
-    Classify a MedQA question by type using heuristics on the question stem.
-    Looks at metadata["question_stem"] first, falls back to ground_truth["full_question"].
-    Returns:
-        QuestionType enum value
-    """
-    stem = case.metadata.get("question_stem", "")
-    full_q = case.ground_truth.get("full_question", case.input_text)
-    # Classify on stem first (more specific), then full question
-    for text in [stem, full_q]:
-        text_lower = text.lower()
-        for pattern, qtype in _STEM_PATTERNS:
-            if re.search(pattern, text_lower):
-                return qtype
-    return QuestionType.OTHER
-def classify_question_from_text(question_text: str) -> QuestionType:
-    """
-    Classify a raw question string (no ValidationCase needed).
-    Useful for ad-hoc classification.
-    """
-    text_lower = question_text.lower()
-    for pattern, qtype in _STEM_PATTERNS:
-        if re.search(pattern, text_lower):
-            return qtype
-    return QuestionType.OTHER
-# Convenience: which types are "pipeline-appropriate"?
-DIAGNOSTIC_TYPES = {QuestionType.DIAGNOSTIC}
-PIPELINE_APPROPRIATE_TYPES = {
-    QuestionType.DIAGNOSTIC,
-    QuestionType.TREATMENT,
-    QuestionType.LAB_FINDING,
-}
-```
-### Integration point
-In `fetch_medqa()`, after building each case, classify it:
-```python
-from validation.question_classifier import classify_question
-# After creating the ValidationCase:
-case.metadata["question_type"] = classify_question(case).value
-```
-### Tests
-```python
-def test_diagnostic_classification():
-    case = make_case(question="...What is the most likely diagnosis?")
-    assert classify_question(case) == QuestionType.DIAGNOSTIC
-def test_treatment_classification():
-    case = make_case(question="...What is the most appropriate next step in management?")
-    assert classify_question(case) == QuestionType.TREATMENT
-def test_mechanism_classification():
-    case = make_case(question="...mechanism of action...")
-    assert classify_question(case) == QuestionType.MECHANISM
-def test_ethics_override():
-    # "most appropriate action" + disclosure keywords → ethics, not treatment
-    case = make_case(question="...Tell the attending that he cannot fail to disclose this mistake. What is the most appropriate action?")
-    assert classify_question(case) == QuestionType.ETHICS
-```
-**Note on ethics override:** The pattern order matters. "most appropriate action"
-will match TREATMENT first. To handle ethics, we need the ethics patterns to check
-for disclosure/consent keywords in the *answer* or full question context. The
-current design checks patterns in order — put ethics keyword patterns before the
-generic "most appropriate action" treatment pattern, OR do a two-pass: first check
-for ethics keywords, then fall through to treatment.
-**Decision:** Use a two-pass approach. If the question contains ethics keywords
-AND a treatment-like stem, classify as ETHICS. Otherwise classify as TREATMENT.
-Implement this in `classify_question()` with a special-case check.
----
-## Step 4: P4 — Question-Type-Aware Scoring
-**File:** `src/backend/validation/base.py` (new function) + `src/backend/validation/harness_medqa.py` (refactor scoring block)
-**Depends on:** P5 (correct fuzzy_match), P1 (question_type in metadata)
-**Depended on by:** P7 (stratified reporting)
-### Problem
-`diagnosis_in_differential()` always searches the same fields in the same order
-regardless of question type. Treatment answers get looked up in the differential
-(wrong place), and mechanism answers get looked up everywhere (unlikely to match).
-### Design: `score_case()` dispatcher
-```python
-# In base.py — new function alongside diagnosis_in_differential()
-def score_case(
-    target_answer: str,
-    report: CDSReport,
-    question_type: str = "diagnostic",
-    reasoning_result: Optional[ClinicalReasoningResult] = None,
-) -> dict[str, float]:
-    """
-    Score a case based on its question type.
-    Returns a dict of metric_name → score (0.0 or 1.0).
-    Always includes: "matched", "match_location", "match_rank"
-    Plus type-specific metrics.
-    """
-    qt = question_type.lower()
-    if qt == "diagnostic":
-        return _score_diagnostic(target_answer, report)
-    elif qt == "treatment":
-        return _score_treatment(target_answer, report)
-    elif qt == "mechanism":
-        return _score_mechanism(target_answer, report, reasoning_result)
-    elif qt == "lab_finding":
-        return _score_lab_finding(target_answer, report, reasoning_result)
-    else:
-        return _score_generic(target_answer, report, reasoning_result)
-```
-### Per-type scorers
-```python
-def _score_diagnostic(target: str, report: CDSReport) -> dict:
-    """Score a diagnostic question — primary field is differential_diagnosis."""
-    found_top1, r1, l1 = diagnosis_in_differential(target, report, top_n=1)
-    found_top3, r3, l3 = diagnosis_in_differential(target, report, top_n=3)
-    found_any, ra, la = diagnosis_in_differential(target, report)
-    return {
-        "top1_accuracy": 1.0 if found_top1 else 0.0,
-        "top3_accuracy": 1.0 if found_top3 else 0.0,
-        "mentioned_accuracy": 1.0 if found_any else 0.0,
-        "differential_accuracy": 1.0 if (found_any and la == "differential") else 0.0,
-        "match_location": la,
-        "match_rank": ra,
-    }
-def _score_treatment(target: str, report: CDSReport) -> dict:
-    """Score a treatment question — primary fields are next_steps + recommendations."""
-    # Check suggested_next_steps first (most specific)
-    for i, action in enumerate(report.suggested_next_steps):
-        if fuzzy_match(action.action, target):
-            return {
-                "top1_accuracy": 1.0 if i == 0 else 0.0,
-                "top3_accuracy": 1.0 if i < 3 else 0.0,
-                "mentioned_accuracy": 1.0,
-                "match_location": "next_steps",
-                "match_rank": i,
-            }
-    # Check guideline_recommendations
-    for i, rec in enumerate(report.guideline_recommendations):
-        if fuzzy_match(rec, target):
-            return {
-                "top1_accuracy": 0.0,  # Not in primary slot
-                "top3_accuracy": 0.0,
-                "mentioned_accuracy": 1.0,
-                "match_location": "recommendations",
-                "match_rank": i,
-            }
-    # Check differential reasoning text (treatment may appear in reasoning)
-    for dx in report.differential_diagnosis:
-        if fuzzy_match(dx.reasoning, target, threshold=0.3):
-            return {
-                "top1_accuracy": 0.0,
-                "top3_accuracy": 0.0,
-                "mentioned_accuracy": 1.0,
-                "match_location": "reasoning_text",
-                "match_rank": -1,
-            }
-    # Fulltext fallback
-    full_text = _build_fulltext(report)
-    if fuzzy_match(full_text, target, threshold=0.3):
-        return {
-            "top1_accuracy": 0.0,
-            "top3_accuracy": 0.0,
-            "mentioned_accuracy": 1.0,
-            "match_location": "fulltext",
-            "match_rank": -1,
-        }
-    return _not_found()
-def _score_mechanism(
-    target: str, report: CDSReport,
-    reasoning_result: Optional[ClinicalReasoningResult] = None,
-) -> dict:
-    """Score a mechanism question — primary field is reasoning_chain."""
-    # Check reasoning chain from clinical reasoning step
-    if reasoning_result and reasoning_result.reasoning_chain:
-        if fuzzy_match(reasoning_result.reasoning_chain, target, threshold=0.3):
-            return {
-                "top1_accuracy": 0.0,
-                "top3_accuracy": 0.0,
-                "mentioned_accuracy": 1.0,
-                "match_location": "reasoning_chain",
-                "match_rank": -1,
-            }
-    # Check differential reasoning text
-    for dx in report.differential_diagnosis:
-        if fuzzy_match(dx.reasoning, target, threshold=0.3):
-            return {
-                "top1_accuracy": 0.0,
-                "top3_accuracy": 0.0,
-                "mentioned_accuracy": 1.0,
-                "match_location": "differential_reasoning",
-                "match_rank": -1,
-            }
-    # Fulltext fallback
-    full_text = _build_fulltext(report)
-    if fuzzy_match(full_text, target, threshold=0.3):
-        return {
-            "top1_accuracy": 0.0,
-            "top3_accuracy": 0.0,
-            "mentioned_accuracy": 1.0,
-            "match_location": "fulltext",
-            "match_rank": -1,
-        }
-    return _not_found()
-def _score_lab_finding(
-    target: str, report: CDSReport,
-    reasoning_result: Optional[ClinicalReasoningResult] = None,
-) -> dict:
-    """Score a lab/finding question — primary field is recommended_workup."""
-    # Check recommended workup
-    if reasoning_result:
-        for i, action in enumerate(reasoning_result.recommended_workup):
-            if fuzzy_match(action.action, target, threshold=0.4):
-                return {
-                    "top1_accuracy": 1.0 if i == 0 else 0.0,
-                    "top3_accuracy": 1.0 if i < 3 else 0.0,
-                    "mentioned_accuracy": 1.0,
-                    "match_location": "recommended_workup",
-                    "match_rank": i,
-                }
-    # Check next steps in final report
-    for i, action in enumerate(report.suggested_next_steps):
-        if fuzzy_match(action.action, target, threshold=0.4):
-            return {
-                "top1_accuracy": 0.0,
-                "top3_accuracy": 0.0,
-                "mentioned_accuracy": 1.0,
-                "match_location": "next_steps",
-                "match_rank": i,
-            }
-    # Fulltext fallback
-    full_text = _build_fulltext(report)
-    if fuzzy_match(full_text, target, threshold=0.3):
-        return {
-            "top1_accuracy": 0.0,
-            "top3_accuracy": 0.0,
-            "mentioned_accuracy": 1.0,
-            "match_location": "fulltext",
-            "match_rank": -1,
-        }
-    return _not_found()
-def _score_generic(
-    target: str, report: CDSReport,
-    reasoning_result: Optional[ClinicalReasoningResult] = None,
-) -> dict:
-    """Score any question type — searches all fields broadly."""
-    # Try all specific scorers, return first hit
-    for scorer in [_score_diagnostic, _score_treatment]:
-        result = scorer(target, report)
-        if result.get("mentioned_accuracy", 0.0) > 0.0:
-            return result
-    if reasoning_result:
-        result = _score_mechanism(target, report, reasoning_result)
-        if result.get("mentioned_accuracy", 0.0) > 0.0:
-            return result
-    return _not_found()
-def _build_fulltext(report: CDSReport) -> str:
-    """Concatenate all report fields into a single searchable string."""
-    return " ".join([
-        report.patient_summary or "",
-        " ".join(report.guideline_recommendations),
-        " ".join(a.action for a in report.suggested_next_steps),
-        " ".join(dx.diagnosis + " " + dx.reasoning for dx in report.differential_diagnosis),
-        " ".join(report.sources_cited),
-        " ".join(c.description for c in report.conflicts),
-    ])
-def _not_found() -> dict:
-    return {
-        "top1_accuracy": 0.0,
-        "top3_accuracy": 0.0,
-        "mentioned_accuracy": 0.0,
-        "match_location": "not_found",
-        "match_rank": -1,
-    }
-```
-### Integration in harness_medqa.py
-Replace the scoring block (lines ~242-290) in `validate_medqa()`:
-```python
-# OLD:
-#   found_top1, rank1, loc1 = diagnosis_in_differential(correct_answer, report, top_n=1)
-#   ...etc...
-# NEW:
-question_type = case.metadata.get("question_type", "other")
-scores = score_case(
-    target_answer=correct_answer,
-    report=report,
-    question_type=question_type,
-    reasoning_result=state.clinical_reasoning if state else None,
-)
-# Extract individual metrics from the dict
-scores["parse_success"] = 1.0
-```
-### Key interface
-- `score_case()` returns `dict[str, float]` — always includes `top1_accuracy`,
-  `top3_accuracy`, `mentioned_accuracy`, `match_location`, `match_rank`
-- The harness doesn't need to know about question type internals — just passes
-  the string through
-- `diagnosis_in_differential()` is NOT removed — it's still used internally by
-  `_score_diagnostic()` and as a utility
----
-## Step 5: P6 — MCQ Answer-Selection Step
-**File:** `src/backend/validation/harness_medqa.py` (new function + integration)
-**Depends on:** P3 (question stem + options stored in metadata/ground_truth)
-**Depended on by:** P7 (reporting), but can be integrated independently
-### Design
-After the pipeline generates its report, present MedGemma with the original
-question + answer choices + the pipeline's analysis, and ask it to select
-the best answer choice.
-```python
-# In harness_medqa.py — new function
-from app.services.medgemma import MedGemmaService
-MCQ_SELECTION_PROMPT = """You are a medical expert taking a USMLE-style exam.
-You have already performed a thorough clinical analysis of this case.
-Now, based on your analysis, select the single best answer from the choices below.
-CLINICAL VIGNETTE:
-{vignette}
-QUESTION:
-{question_stem}
-YOUR CLINICAL ANALYSIS:
-- Top diagnoses: {top_diagnoses}
-- Key reasoning: {reasoning_summary}
-- Recommended next steps: {next_steps}
-- Guideline recommendations: {recommendations}
-ANSWER CHOICES:
-{formatted_options}
-Based on your clinical analysis above, which answer choice (A, B, C, or D)
-is BEST supported? Reply with ONLY the letter (A, B, C, or D) and a one-sentence justification.
-Format: X) Justification"""
-async def select_mcq_answer(
-    case: ValidationCase,
-    report: CDSReport,
-    state: Optional[AgentState] = None,
-) -> tuple[str, str]:
-    """
-    Use MedGemma to select the best MCQ answer given the pipeline's analysis.
-    Args:
-        case: The validation case (must have options in ground_truth)
-        report: The CDS pipeline output
-        state: Full agent state (for reasoning_chain access)
-    Returns:
-        (selected_letter, justification) — e.g. ("B", "Consistent with...")
-    """
-    options = case.ground_truth.get("options", {})
-    if not options:
-        return "", "No options available"
-    # Format options
-    if isinstance(options, dict):
-        formatted = "\n".join(f"{k}) {v}" for k, v in sorted(options.items()))
-    else:
-        formatted = "\n".join(
-            f"{chr(65+i)}) {v}" for i, v in enumerate(options)
-        )
-    # Build context from report
-    top_dx = [dx.diagnosis for dx in report.differential_diagnosis[:3]]
-    reasoning = ""
-    if state and state.clinical_reasoning:
-        reasoning = state.clinical_reasoning.reasoning_chain[:500]
-    next_steps = [a.action for a in report.suggested_next_steps[:3]]
-    recommendations = report.guideline_recommendations[:3]
-    vignette = case.metadata.get("clinical_vignette", case.input_text)
-    stem = case.metadata.get("question_stem", "")
-    prompt = MCQ_SELECTION_PROMPT.format(
-        vignette=vignette[:1000],
-        question_stem=stem or "Based on the clinical presentation, select the best answer.",
-        top_diagnoses=", ".join(top_dx) if top_dx else "None generated",
-        reasoning_summary=reasoning[:500] if reasoning else "Not available",
-        next_steps=", ".join(next_steps) if next_steps else "None",
-        recommendations=", ".join(recommendations) if recommendations else "None",
-        formatted_options=formatted,
-    )
-    service = MedGemmaService()
-    raw = await service.generate(
-        prompt=prompt,
-        system_prompt="You are a medical expert. Select the single best answer.",
-        max_tokens=100,
-        temperature=0.1,
-    )
-    # Parse response — look for a letter A-D
-    selected = ""
-    justification = raw.strip()
-    for char in raw.strip()[:5]:
-        if char.upper() in "ABCD":
-            selected = char.upper()
-            break
-    return selected, justification
-def score_mcq_selection(
-    selected_letter: str,
-    correct_idx: str,
-) -> float:
-    """Return 1.0 if selected matches correct, else 0.0."""
-    return 1.0 if selected_letter.upper() == correct_idx.upper() else 0.0
-```
-### Integration in validate_medqa()
-After the existing scoring block, add:
-```python
-# MCQ selection (optional additional scoring)
-if report and case.ground_truth.get("options"):
-    try:
-        selected, justification = await select_mcq_answer(case, report, state)
-        scores["mcq_accuracy"] = score_mcq_selection(
-            selected, case.ground_truth["answer_idx"]
-        )
-        details["mcq_selected"] = selected
-        details["mcq_justification"] = justification
-        details["mcq_correct"] = case.ground_truth["answer_idx"]
-    except Exception as e:
-        logger.warning(f"MCQ selection failed: {e}")
-        scores["mcq_accuracy"] = 0.0
-```
-### Cost consideration
-This adds 1 extra MedGemma call per case (~100 tokens output). For 50 cases,
-that's ~5,000 extra output tokens — negligible cost (<$0.10).
-### Key interface
-- `select_mcq_answer()` is self-contained — can be called or skipped
-- Adds `mcq_accuracy` to the scores dict
-- Does NOT change any existing score calculations
----
-## Step 6: P7 — Stratified Reporting
-**File:** `src/backend/validation/base.py` (modify `print_summary`, `save_results`)
-+ `src/backend/validation/harness_medqa.py` (modify aggregation block)
-**Depends on:** P1 (question types), P4 (per-type scores)
-**Depended on by:** Nothing (terminal node)
-### Changes to summary aggregation in validate_medqa()
-```python
-# In validate_medqa() — replace the aggregation block at the end
-    # Aggregate — overall
-    total = len(results)
-    successful = sum(1 for r in results if r.success)
-    metric_names = [
-        "top1_accuracy", "top3_accuracy", "mentioned_accuracy",
-        "differential_accuracy", "parse_success", "mcq_accuracy",
-    ]
-    metrics = {}
-    for m in metric_names:
-        values = [r.scores.get(m, 0.0) for r in results if m in r.scores]
-        metrics[m] = sum(values) / len(values) if values else 0.0
-    # Average pipeline time
-    times = [r.pipeline_time_ms for r in results if r.success]
-    metrics["avg_pipeline_time_ms"] = sum(times) / len(times) if times else 0
-    # ── Stratified metrics ──
-    from validation.question_classifier import QuestionType, PIPELINE_APPROPRIATE_TYPES
-    # Group results by question type
-    by_type: dict[str, list[ValidationResult]] = {}
-    for r in results:
-        qt = r.details.get("question_type", "other")
-        by_type.setdefault(qt, []).append(r)
-    # Per-type metrics
-    for qt, type_results in by_type.items():
-        n = len(type_results)
-        metrics[f"count_{qt}"] = n
-        for m in ["top1_accuracy", "top3_accuracy", "mentioned_accuracy", "mcq_accuracy"]:
-            values = [r.scores.get(m, 0.0) for r in type_results if m in r.scores]
-            if values:
-                metrics[f"{m}_{qt}"] = sum(values) / len(values)
-    # Pipeline-appropriate subset
-    appropriate_results = [
-        r for r in results
-        if r.details.get("question_type", "other") in {t.value for t in PIPELINE_APPROPRIATE_TYPES}
-    ]
-    if appropriate_results:
-        for m in ["top1_accuracy", "top3_accuracy", "mentioned_accuracy"]:
-            values = [r.scores.get(m, 0.0) for r in appropriate_results]
-            metrics[f"{m}_pipeline_appropriate"] = sum(values) / len(values) if values else 0.0
-        metrics["count_pipeline_appropriate"] = len(appropriate_results)
-```
-### Changes to print_summary()
-```python
-# In base.py — enhanced print_summary()
-def print_summary(summary: ValidationSummary):
-    """Pretty-print validation results to console."""
-    print(f"\n{'='*60}")
-    print(f"  Validation Results: {summary.dataset.upper()}")
-    print(f"{'='*60}")
-    print(f"  Total cases:      {summary.total_cases}")
-    print(f"  Successful:       {summary.successful_cases}")
-    print(f"  Failed:           {summary.failed_cases}")
-    print(f"  Duration:         {summary.run_duration_sec:.1f}s")
-    # Overall metrics (exclude per-type and count metrics)
-    print(f"\n  Overall Metrics:")
-    for metric, value in sorted(summary.metrics.items()):
-        if "_" in metric and any(metric.endswith(f"_{qt}") for qt in
-            ["diagnostic", "treatment", "mechanism", "lab_finding",
-             "pharmacology", "epidemiology", "ethics", "anatomy", "other",
-             "pipeline_appropriate"]):
-            continue  # Print these in stratified section
-        if metric.startswith("count_"):
-            continue
-        if "time" in metric and isinstance(value, (int, float)):
-            print(f"    {metric:35s} {value:.0f}ms")
-        elif isinstance(value, float):
-            print(f"    {metric:35s} {value:.1%}")
-        else:
-            print(f"    {metric:35s} {value}")
-    # Stratified metrics
-    type_keys = sorted(set(
-        k.rsplit("_", 1)[-1] for k in summary.metrics
-        if k.startswith("count_") and k != "count_pipeline_appropriate"
-    ))
-    if type_keys:
-        print(f"\n  By Question Type:")
-        print(f"    {'Type':15s} {'Count':>6s} {'Top-1':>7s} {'Top-3':>7s} {'Mentioned':>10s} {'MCQ':>7s}")
-        print(f"    {'-'*15} {'-'*6} {'-'*7} {'-'*7} {'-'*10} {'-'*7}")
-        for qt in type_keys:
-            count = summary.metrics.get(f"count_{qt}", 0)
-            t1 = summary.metrics.get(f"top1_accuracy_{qt}", None)
-            t3 = summary.metrics.get(f"top3_accuracy_{qt}", None)
-            ma = summary.metrics.get(f"mentioned_accuracy_{qt}", None)
-            mcq = summary.metrics.get(f"mcq_accuracy_{qt}", None)
-            print(f"    {qt:15s} {int(count):6d} "
-                  f"{f'{t1:.0%}':>7s if t1 is not None else '   -   '} "
-                  f"{f'{t3:.0%}':>7s if t3 is not None else '   -   '} "
-                  f"{f'{ma:.0%}':>10s if ma is not None else '     -     '} "
-                  f"{f'{mcq:.0%}':>7s if mcq is not None else '   -   '}")
-    # Pipeline-appropriate subset
-    pa_count = summary.metrics.get("count_pipeline_appropriate", 0)
-    if pa_count > 0:
-        print(f"\n  Pipeline-Appropriate Subset ({int(pa_count)} cases):")
-        for m in ["top1_accuracy", "top3_accuracy", "mentioned_accuracy"]:
-            v = summary.metrics.get(f"{m}_pipeline_appropriate")
-            if v is not None:
-                print(f"    {m:35s} {v:.1%}")
-    print(f"{'='*60}\n")
-```
-### Key interface
-- `ValidationSummary.metrics` dict gains new keys with `_{question_type}` suffixes
-- `save_results()` doesn't need changes — it serializes `metrics` as-is
-- Console output is richer but backward-compatible (old scripts parsing the JSON
-  still see all the original keys)
----
-## Step 7: P2 — Multi-Mode Pipeline (Large — Future)
-**Files:** `src/backend/app/agent/orchestrator.py`, `src/backend/app/tools/clinical_reasoning.py`, `src/backend/app/models/schemas.py`
-**Depends on:** P1 (question type routing into the pipeline), P3 (question stem passed to pipeline)
-**Depended on by:** Nothing (this is the final architectural evolution)
-### Overview
-This is the biggest change and should be done LAST. It modifies the production
-pipeline, not just the validation framework.
-### 7a. Add `question_context` to `CaseSubmission`
-```python
-# In schemas.py — extend CaseSubmission
-class CaseSubmission(BaseModel):
-    patient_text: str = Field(..., min_length=10)
-    include_drug_check: bool = Field(True)
-    include_guidelines: bool = Field(True)
-    question_context: Optional[str] = Field(
-        None,
-        description="The clinical question being asked (e.g., 'What is the most likely diagnosis?'). "
-                    "If provided, the pipeline adapts its reasoning mode.",
-    )
-    question_type: Optional[str] = Field(
-        None,
-        description="Pre-classified question type: diagnostic, treatment, mechanism, etc.",
-    )
-```
-### 7b. Mode-specific system prompts in clinical_reasoning.py
-```python
-# Replace single SYSTEM_PROMPT with a dict:
-SYSTEM_PROMPTS = {
-    "diagnostic": """You are an expert clinical reasoning assistant...
-    [existing diagnostic prompt — mostly unchanged]""",
-    "treatment": """You are an expert clinical management assistant...
-    Given a structured patient profile and clinical question, recommend the
-    most appropriate treatment or next step in management.
-    Focus on: evidence-based treatment guidelines, patient-specific factors,
-    contraindications, and prioritized management steps.
-    Generate a ranked list of management options (not diagnoses)...""",
-    "mechanism": """You are an expert in medical pathophysiology...
-    Given a clinical scenario, explain the underlying mechanism,
-    pathophysiology, or pharmacological principle being tested.
-    Focus on: molecular/cellular mechanism, physiological pathways,
-    drug mechanisms of action...""",
-    "default": """[existing SYSTEM_PROMPT as fallback]""",
-}
-```
-### 7c. Extend clinical reasoning output model
-```python
-# In schemas.py — new model for non-diagnostic reasoning
-class ClinicalAnalysisResult(BaseModel):
-    """Flexible clinical analysis output that adapts to question type."""
-    analysis_mode: str = Field("diagnostic", description="What type of analysis was performed")
-    differential_diagnosis: List[DiagnosisCandidate] = Field(default_factory=list)
-    management_options: List[RecommendedAction] = Field(default_factory=list)
-    mechanism_explanation: str = Field("", description="Pathophysiology/mechanism explanation")
-    recommended_workup: List[RecommendedAction] = Field(default_factory=list)
-    reasoning_chain: str = Field("")
-    risk_assessment: Optional[str] = None
-    direct_answer: Optional[str] = Field(
-        None,
-        description="Direct answer to the clinical question (when applicable)",
-    )
-```
-### 7d. Orchestrator routing
-```python
-# In orchestrator.py — _step_reason() adapts based on question type
-async def _step_reason(self):
-    question_type = self._case.question_type or "diagnostic"
-    result = await self.clinical_reasoning.run(
-        self._state.patient_profile,
-        mode=question_type,
-    )
-    ...
-```
-### Scope warning
-This is a multi-file, multi-model refactor. Do it only after Steps 1-6 are
-working and validated. The validation improvements (Steps 1-6) will already
-give us honest metrics; Step 7 is about actually improving the pipeline's ability
-to handle non-diagnostic questions.
----
-## Testing Strategy
-### Unit tests (no LLM calls needed)
-| Test file | What it tests |
-|-----------|---------------|
-| `test_fuzzy_match.py` | P5: fuzzy_match with short/long targets, edge cases |
-| `test_question_classifier.py` | P1: classification accuracy on known questions |
-| `test_split_question.py` | P3: vignette/stem separation on real MedQA samples |
-| `test_score_case.py` | P4: type-aware scoring with mock CDSReport objects |
-### Integration tests (need LLM endpoint)
-| Test | What it tests | Cost |
-|------|---------------|------|
-| 3-case smoke test with MCQ | P6: MCQ selection works | ~$0.50 |
-| 10-case run with stratified reporting | P7: reporting output is correct | ~$2.00 |
-| 50-case full run with all fixes | All: end-to-end accuracy comparison | ~$5.00 |
-### Comparison protocol
-Run 50-case MedQA (seed=42) twice:
-1. **Before:** Current code (baseline: 36% top-1, 38% mentioned)
-2. **After:** All fixes applied
-Compare:
-- Overall accuracy (should be similar or slightly higher)
-- Diagnostic-only accuracy (should be similar — same pipeline, better matching)
-- MCQ accuracy (expected 60-70%+ — this is the big win)
-- Pipeline-appropriate accuracy (expected higher than overall)
-- Stratified breakdown by question type
----
-## File Change Summary
-| File | Changes | Step |
-|------|---------|------|
-| `validation/base.py` | Rewrite `fuzzy_match()`, add `_content_tokens()`, `_MEDICAL_STOPWORDS`. Add `score_case()` and per-type scorers. Modify `print_summary()`. | P5, P4, P7 |
-| `validation/harness_medqa.py` | Replace `_extract_vignette()` with `_split_question()`. Update `fetch_medqa()` metadata. Refactor scoring block to use `score_case()`. Add `select_mcq_answer()`. Update aggregation. | P3, P4, P6, P7 |
-| `validation/question_classifier.py` | **NEW FILE.** `QuestionType` enum, `classify_question()`, `_STEM_PATTERNS`. | P1 |
-| `app/models/schemas.py` | Add `question_context`, `question_type` to `CaseSubmission`. Add `ClinicalAnalysisResult`. | P2 (Step 7 only) |
-| `app/tools/clinical_reasoning.py` | Add mode-specific system prompts. Accept `mode` param. | P2 (Step 7 only) |
-| `app/agent/orchestrator.py` | Route reasoning step based on question type. | P2 (Step 7 only) |
-**Steps 1-6 touch only validation code.** The production pipeline is unchanged
-until Step 7.

competition/download_data.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- kaggle competitions download -c med-gemma-impact-challenge

competition/overview.txt DELETED Viewed

@@ -1,167 +0,0 @@
-The MedGemma Impact Challenge
-Build human-centered AI applications with MedGemma and other open models from Google’s Health AI Developer Foundations (HAI-DEF).
-The MedGemma Impact Challenge
-View Writeups
-Overview
-In this competition, you’ll use MedGemma and other open models from Google’s Health AI Developer Foundations (HAI-DEF) to build human-centered AI applications.
-Start
-a month ago
-Close
-11 days to go
-Description
-AI is already reshaping medicine, from diagnostics to drug discovery. But many clinical environments can’t rely on large, closed models that require constant internet access or centralized infrastructure. They need adaptable, privacy-focused tools that can run anywhere care is delivered.
-To meet this need, Google has released open-weight models specifically designed to help developers more efficiently create novel healthcare and life sciences applications. MedGemma and the rest of HAI-DEF collection give developers a starting point for building powerful tools while allowing them full control over the models and associated infrastructure.
-In this competition, you’ll use these models to build full fledged demonstration applications. Whether you’re building apps to streamline workflows, support patient communication, or facilitate diagnostics, your solution should demonstrate how these tools can enhance healthcare.
-Evaluation
-Minimum requirements
-To be considered a valid contribution, your submission should include:
-a high-quality writeup describing use of a specific HAI-DEF model,
-associated reproducible code for your initial results, and
-a video for judging.
-Your complete submission consists of a single package containing your video (3 minutes or less) and write-up (3 pages or less). This single entry can be submitted to the main competition track, and one special technology award, so separate submissions are not required. Read the section Submission Instructions for more details. Please follow the provided write-up template and refer to the judging criteria for all content requirements.
-Evaluation Criteria
-Submissions are evaluated on the following criteria:
-Criteria (percentage)	Description
-Effective use of HAI-DEF models
-(20%)	Are HAI-DEF models used appropriately?
-You will be assessed on: whether the submission proposes an application that uses HAI-DEF models to their fullest potential, where other solutions would likely be less effective.
-Note: Use of at least one of HAI-DEF models such as MedGemma is mandatory.
-Problem domain
-(15%)	How important is this problem to solve and how plausible is it that AI is the right solution?
-You will be assessed on: storytelling, clarity of problem definition, clarity on whether there is an unmet need, the magnitude of the problem, who the user is and their improved journey given your solution.
-Impact potential
-(15%)	If the solution works, what impact would it have?
-You will be assessed on: clear articulation of real or anticipated impact of your application within the given problem domain and description of how you calculated your estimates.
-Product feasibility
-(20%)	Is the technical solution clearly feasible?
-You will be assessed on: technical documentation detailing model fine-tuning, model's performance analysis, your user-facing application stack, deployment challenges and how you plan on overcoming them. Consideration of how a product might be used in practice, rather than only for benchmarking.
-Execution and communication (30%)	What is the quality of your project's execution and your clear and concise communication of your work? Your main submission package follows the provided template and includes a mandatory video demo and a write-up with links to your source material.
-You will be assessed on: the clarity, polish, and effectiveness of your video demonstration; the completeness and readability of your technical write-up; and the quality of your source code (e.g., organization, comments, reusability). Judges will look for a cohesive and compelling narrative across all submitted materials that effectively articulates how you meet the rest of the judging criteria.
-Timeline
-January 13, 2026 - Start Date.
-February 24, 2026 - Final Submission Deadline.
-March 17 - 24, 2026 - Anticipated Results Announcement - Time required to evaluate results is dependent on the number of submissions.
-All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.
-Judges
-Fereshteh Mahvar
-Staff Medical Software Engineer & Solutions Architect, Google Health AI
-Omar Sanseviero
-Developer Experience Lead, Google DeepMind
-Glenn Cameron
-Sr. PMM, Google
-Can "John" Kirmizi
-Software Engineer, Google Research
-Andrew Sellergren
-Software Engineer, Google Research
-Dave Steiner
-Clinical Research Scientist, Google
-Sunny Virmani
-Group Product Manager, Google Research
-Liron Yatziv
-Research Engineer, Google Research
-Daniel Golden
-Engineering Manager, Google Research
-Yun Liu
-Research Scientist, Google Research
-Rebecca Hemenway
-Health AI Strategic Partnerships, Google Research
-Fayaz Jamil
-Technical Program Manager, Google Research
-Tracks and Awards
-Main Track · $75,000
-Description
-These prizes are awarded to the best overall projects that demonstrate exceptional vision, technical execution, and potential for real-world impact.
-Track Awards
-1st Place
-$30,000
-2nd Place
-$20,000
-3rd Place
-$15,000
-4th Place
-$10,000
-Agentic Workflow Prize · $10,000
-Description
-It is awarded for the project that most effectively reimagines a complex workflow by deploying HAI-DEF models as intelligent agents or callable tools. The winning solution will demonstrate a significant overhaul of a challenging process, showcasing the power of agentic AI to improve efficiency and outcomes.
-Track Awards
-Agentic Workflow Prize 1
-$5,000
-Agentic Workflow Prize 2
-$5,000
-The Novel Task Prize · $10,000
-Description
-Awarded for the most impressive fine-tuned model that successfully adapts a HAI-DEF model to perform a useful task for which it was not originally trained on pre-release.
-Track Awards
-The Novel Task Prize 1
-$5,000
-The Novel Task Prize 2
-$5,000
-The Edge AI Prize · $5,000
-Description
-This prize is awarded to the most impressive solution that brings AI out of the cloud and into the field. It will be awarded to the team that best adapts a HAI-DEF model to run effectively on a local device like a mobile phone, portable scanner, lab instrument, or other edge hardware.
-Track Awards
-The Edge AI Prize
-$5,000
-Submission Instructions
-Your submission must be a Kaggle Writeup and it must be attached to this page. To create a new Writeup, click on the "New Writeup" button here. After you have saved your Writeup, you should see a "Submit" button in the top right corner. Each team is limited to submitting only a single Writeup, but that same Writeup can be un-submitted, edited, and re-submitted as many times as you'd like. Your Writeup should contain a summary of your overall project along with links to supporting resources.
-Choosing a track
-All submissions compete in the Main Track, and are eligible to win one special award prize (Agentic Workflow Prize, The Novel Task Prize, or The Edge of AI Prize). While you will have the option to select multiple tracks when you create your writeup, you can only chose the main track and one special award prize. If you choose multiple special awards, we will only consider your submission for one of your indicated special awards (randomly selected).
-Links
-Required: Video (3 min or less)
-Required: Public code repository
-Bonus: Public interactive live demo app
-Bonus: Open-weight Hugging Face model tracing to a HAI-DEF model
-Proposed Writeup template
-Use the following structure and in 3 pages or less present your work. Less is more! You should take advantage of the video to convey most of the concepts and keep the write-up as high level as possible.
-### Project name
-[A concise name for your project.]
-### Your team
-[Name your team members, their speciality and the role they played.]
-### Problem statement
-[Your answer to the “Problem domain” & “Impact potential” criteria]
-### Overall solution:
-[Your answer to “Effective use of HAI-DEF models” criterion]
-### Technical details
-[Your answer to “Product feasibility” criterion]
-Note: If you attach a private Kaggle Resource to your public Kaggle Writeup, your private Resource will automatically be made public after the deadline.
-Citation
-Fereshteh Mahvar, Yun Liu, Daniel Golden, Fayaz Jamil, Sunny Jansen, Can Kirmizi, Rory Pilgrim, David F. Steiner, Andrew Sellergren, Richa Tiwari, Sunny Virmani, Liron Yatziv, Rebecca Hemenway, Yossi Matias, Ronit Levavi Morad, Avinatan Hassidim, Shravya Shetty, and María Cruz. The MedGemma Impact Challenge. https://kaggle.com/competitions/med-gemma-impact-challenge, 2026. Kaggle.

competition/rules.txt DELETED Viewed

@@ -1,163 +0,0 @@
-Competition Rules
-ENTRY IN THIS COMPETITION CONSTITUTES YOUR ACCEPTANCE OF THESE OFFICIAL COMPETITION RULES.
-See Section 3.18 for defined terms
-The Competition named below is a skills-based competition to promote and further the field of data science. You must register via the Competition Website to enter. To enter the Competition, you must agree to these Official Competition Rules, which incorporate by reference the provisions and content of the Competition Website and any Specific Competition Rules herein (collectively, the "Rules"). Please read these Rules carefully before entry to ensure you understand and agree. You further agree that Submission in the Competition constitutes agreement to these Rules. You may not submit to the Competition and are not eligible to receive the prizes associated with this Competition unless you agree to these Rules. These Rules form a binding legal agreement between you and the Competition Sponsor with respect to the Competition. Your competition Submissions must conform to the requirements stated on the Competition Website. Your Submissions will be scored based on the evaluation metric described on the Competition Website. Subject to compliance with the Competition Rules, Prizes, if any, will be awarded to Participants with the best scores, based on the merits of the data science models submitted. See below for the complete Competition Rules. For Competitions designated as hackathons by the Competition Sponsor (“Hackathons”), your Submissions will be judged by the Competition Sponsor based on the evaluation rubric set forth on the Competition Website (“Evaluation Rubric”). The Prizes, if any, will be awarded to Participants with the highest ranking(s) as determined by the Competition Sponsor based on such rubric.
-You cannot sign up to Kaggle from multiple accounts and therefore you cannot enter or submit from multiple accounts.
-1. COMPETITION-SPECIFIC TERMS
-1. COMPETITION TITLE
-The MedGemma Impact Challenge
-2. COMPETITION SPONSOR
-Google Research
-3. COMPETITION SPONSOR ADDRESS
-1600 Amphitheatre Parkway, Mountain View, California 94043 USA
-4. COMPETITION WEBSITE
-https://www.kaggle.com/competitions/med-gemma-impact-challenge
-5. TOTAL PRIZES AVAILABLE: $100,000
-Main track: $75,000
-1st Place: $30,000
-2nd Place: $20,000
-3rd Place: $15,000
-4th Place: $10,000
-Special Technology Awards: $25,000
-Agentic Workflow prize: $10,000 (Two prizes of $5,000)
-The Edge AI Prize: $5,000
-The Novel Task Prize: $10,000 (Two prizes of $5,000)
-6. WINNER LICENSE TYPE
-CC BY 4.0
-7. DATA ACCESS AND USE
-No data is provided for this competition. Use of HAI-DEF and MedGemma are subject to the HAI-DEF Terms of Use.
-2. COMPETITION-SPECIFIC RULES
-In addition to the provisions of the General Competition Rules below, you understand and agree to these Competition-Specific Rules required by the Competition Sponsor:
-1. TEAM LIMITS
-The maximum Team size is five (5). b. Team mergers are allowed and can be performed by the Team leader. In order to merge, the combined Team must have a total Submission count less than or equal to the maximum allowed as of the Team Merger Deadline. The maximum allowed is the number of Submissions per day multiplied by the number of days the competition has been running. For Hackathons, each team is allowed one (1) Submission; any Submissions submitted by Participants before merging into a Team will be unsubmitted.
-2. SUBMISSION LIMITS
-For Hackathons, each Team may submit one (1) Submission. This single entry can be submitted to the main competition track, and one special technology award, so separate submissions are not required.
-3. COMPETITION TIMELINE
-Competition Timeline dates (including Entry Deadline, Final Submission Deadline, Start Date, and Team Merger Deadline, as applicable) are reflected on the competition’s Overview > Timeline page.
-4. COMPETITION DATA
-a. Data Access and Use
-None. Competition Data will not be provided by Competition Sponsor for this Competition.
-b. Data Security
-You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.
-5. WINNER LICENSE
-a. Under Section 2.8 (Winners Obligations) of the General Rules below, you hereby grant and will grant the Competition Sponsor the following license(s) with respect to your Submission if you are a Competition winner:
-Open Source: You hereby license and will license your winning Submission and the source code used to generate the Submission to the Competition Sponsor under CC BY 4.0 that in no event limits commercial use of such code or model containing or depending on such code.
-For generally commercially available software that you used to generate your Submission that is not owned by you, but that can be procured by the Competition Sponsor without undue expense, you do not need to grant the license in the preceding Section for that software.
-In the event that input data or pretrained models with an incompatible license are used to generate your winning solution, you do not need to grant an open source license in the preceding Section for that data and/or model(s).
-b. You may be required by the Sponsor to provide a detailed description of how the winning Submission was generated, to the Competition Sponsor’s specifications, as outlined in Section 2.8, Winner’s Obligations. This may include a detailed description of methodology, where one must be able to reproduce the approach by reading the description, and includes a detailed explanation of the architecture, preprocessing, loss function, training details, hyper-parameters, etc. The description should also include a link to a code repository with complete and detailed instructions so that the results obtained can be reproduced.
-6. EXTERNAL DATA AND TOOLS
-a. You may use data other than the Competition Data (“External Data”) to develop and test your Submissions. However, you will ensure the External Data is either publicly available and equally accessible to use by all Participants of the Competition for purposes of the competition at no cost to the other Participants, or satisfies the Reasonableness criteria as outlined in Section 2.6.b below. The ability to use External Data under this Section does not limit your other obligations under these Competition Rules, including but not limited to Section 2.8 (Winners Obligations).
-b. Use of HAI-DEF and MedGemma are subject to the HAI-DEF Terms of Use
-c. The use of external data and models is acceptable unless specifically prohibited by the Host. Because of the potential costs or restrictions (e.g., “geo restrictions”) associated with obtaining rights to use external data or certain software and associated tools, their use must be “reasonably accessible to all” and of “minimal cost”. Also, regardless of the cost challenges as they might affect all Participants during the course of the competition, the costs of potentially procuring a license for software used to generate a Submission, must also be considered. The Host will employ an assessment of whether or not the following criteria can exclude the use of the particular LLM, data set(s), or tool(s):
-Are Participants being excluded from a competition because of the "excessive" costs for access to certain LLMs, external data, or tools that might be used by other Participants. The Host will assess the excessive cost concern by applying a “Reasonableness” standard (the “Reasonableness Standard”). The Reasonableness Standard will be determined and applied by the Host in light of things like cost thresholds and accessibility.
-By way of example only, a small subscription charge to use additional elements of a large language model such as Gemini Advanced are acceptable if meeting the Reasonableness Standard of Sec. 8.2. Purchasing a license to use a proprietary dataset that exceeds the cost of a prize in the competition would not be considered reasonable.
-d. Automated Machine Learning Tools (“AMLT”)
-Individual Participants and Teams may use automated machine learning tool(s) (“AMLT”) (e.g., Google toML, H2O Driverless AI, etc.) to create a Submission, provided that the Participant or Team ensures that they have an appropriate license to the AMLT such that they are able to comply with the Competition Rules.
-7. ELIGIBILITY
-a. Unless otherwise stated in the Competition-Specific Rules above or prohibited by internal policies of the Competition Entities, employees, interns, contractors, officers and directors of Competition Entities may enter and participate in the Competition, but are not eligible to win any Prizes. "Competition Entities" means the Competition Sponsor, Kaggle Inc., and their respective parent companies, subsidiaries and affiliates. If you are such a Participant from a Competition Entity, you are subject to all applicable internal policies of your employer with respect to your participation.
-8. WINNER’S OBLIGATIONS
-a. As a condition to being awarded a Prize, a Prize winner must fulfill the following obligations:
-Deliver to the Competition Sponsor the final model's software code as used to generate the winning Submission and associated documentation. The delivered software code should follow these documentation guidelines, must be capable of generating the winning Submission, and contain a description of resources required to build and/or run the executable code successfully. For avoidance of doubt, delivered software code should include training code, inference code, and a description of the required computational environment. For Hackathons, the Submission deliverables will be as described on the Competition Website, which may be information or materials that are not software code.
-b. To the extent that the final model’s software code includes generally commercially available software that is not owned by you, but that can be procured by the Competition Sponsor without undue expense, then instead of delivering the code for that software to the Competition Sponsor, you must identify that software, method for procuring it, and any parameters or other information necessary to replicate the winning Submission; Individual Participants and Teams who create a Submission using an AMLT may win a Prize. However, for clarity, the potential winner’s Submission must still meet the requirements of these Rules, including but not limited to Section 2.5 (Winners License), Section 2.8 (Winners Obligations), and Section 3.14 (Warranty, Indemnity, and Release).”
-c. Individual Participants and Teams who create a Submission using an AMLT may win a Prize. However, for clarity, the potential winner’s Submission must still meet the requirements of these Rules,
-Grant to the Competition Sponsor the license to the winning Submission stated in the Competition Specific Rules above, and represent that you have the unrestricted right to grant that license;
-Sign and return all Prize acceptance documents as may be required by Competition Sponsor or Kaggle, including without limitation: (a) eligibility certifications; (b) licenses, releases and other agreements required under the Rules; and (c) U.S. tax forms (such as IRS Form W-9 if U.S. resident, IRS Form W-8BEN if foreign resident, or future equivalents).
-9. GOVERNING LAW
-a. Unless otherwise provided in the Competition Specific Rules above, all claims arising out of or relating to these Rules will be governed by California law, excluding its conflict of laws rules, and will be litigated exclusively in the Federal or State courts of Santa Clara County, California, USA. The parties consent to personal jurisdiction in those courts. If any provision of these Rules is held to be invalid or unenforceable, all remaining provisions of the Rules will remain in full force and effect.
-3. GENERAL COMPETITION RULES - BINDING AGREEMENT
-1. ELIGIBILITY
-a. To be eligible to enter the Competition, you must be:
-a registered account holder at Kaggle.com;
-the older of 18 years old or the age of majority in your jurisdiction of residence (unless otherwise agreed to by Competition Sponsor and appropriate parental/guardian consents have been obtained by Competition Sponsor);
-not a resident of Crimea, so-called Donetsk People's Republic (DNR) or Luhansk People's Republic (LNR), Cuba, Iran, Syria, or North Korea; and
-not a person or representative of an entity under U.S. export controls or sanctions (see: https://www.treasury.gov/resourcecenter/sanctions/Programs/Pages/Programs.aspx).
-b. Competitions are open to residents of the United States and worldwide, except that if you are a resident of Crimea, so-called Donetsk People's Republic (DNR) or Luhansk People's Republic (LNR), Cuba, Iran, Syria, North Korea, or are subject to U.S. export controls or sanctions, you may not enter the Competition. Other local rules and regulations may apply to you, so please check your local laws to ensure that you are eligible to participate in skills-based competitions. The Competition Host reserves the right to forego or award alternative Prizes where needed to comply with local laws. If a winner is located in a country where prizes cannot be awarded, then they are not eligible to receive a prize.
-c. If you are entering as a representative of a company, educational institution or other legal entity, or on behalf of your employer, these rules are binding on you, individually, and the entity you represent or where you are an employee. If you are acting within the scope of your employment, or as an agent of another party, you warrant that such party or your employer has full knowledge of your actions and has consented thereto, including your potential receipt of a Prize. You further warrant that your actions do not violate your employer's or entity's policies and procedures.
-d. The Competition Sponsor reserves the right to verify eligibility and to adjudicate on any dispute at any time. If you provide any false information relating to the Competition concerning your identity, residency, mailing address, telephone number, email address, ownership of right, or information required for entering the Competition, you may be immediately disqualified from the Competition.
-2. SPONSOR AND HOSTING PLATFORM
-a. The Competition is sponsored by Competition Sponsor named above. The Competition is hosted on behalf of Competition Sponsor by Kaggle Inc. ("Kaggle"). Kaggle is an independent contractor of Competition Sponsor, and is not a party to this or any agreement between you and Competition Sponsor. You understand that Kaggle has no responsibility with respect to selecting the potential Competition winner(s) or awarding any Prizes. Kaggle will perform certain administrative functions relating to hosting the Competition, and you agree to abide by the provisions relating to Kaggle under these Rules. As a Kaggle.com account holder and user of the Kaggle competition platform, remember you have accepted and are subject to the Kaggle Terms of Service at www.kaggle.com/terms in addition to these Rules.
-3. COMPETITION PERIOD
-a. For the purposes of Prizes, the Competition will run from the Start Date and time to the Final Submission Deadline (such duration the “Competition Period”). The Competition Timeline is subject to change, and Competition Sponsor may introduce additional hurdle deadlines during the Competition Period. Any updated or additional deadlines will be publicized on the Competition Website. It is your responsibility to check the Competition Website regularly to stay informed of any deadline changes. YOU ARE RESPONSIBLE FOR DETERMINING THE CORRESPONDING TIME ZONE IN YOUR LOCATION.
-4. COMPETITION ENTRY
-a. NO PURCHASE NECESSARY TO ENTER OR WIN. To enter the Competition, you must register on the Competition Website prior to the Entry Deadline, and follow the instructions for developing and entering your Submission through the Competition Website. Your Submissions must be made in the manner and format, and in compliance with all other requirements, stated on the Competition Website (the "Requirements"). Submissions must be received before any Submission deadlines stated on the Competition Website. Submissions not received by the stated deadlines will not be eligible to receive a Prize. b. Except as expressly allowed in Hackathons as set forth on the Competition Website, submissions may not use or incorporate information from hand labeling or human prediction of the validation dataset or test data records. c. If the Competition is a multi-stage competition with temporally separate training and/or test data, one or more valid Submissions may be required during each Competition stage in the manner described on the Competition Website in order for the Submissions to be Prize eligible. d. Submissions are void if they are in whole or part illegible, incomplete, damaged, altered, counterfeit, obtained through fraud, or late. Competition Sponsor reserves the right to disqualify any entrant who does not follow these Rules, including making a Submission that does not meet the Requirements.
-5. INDIVIDUALS AND TEAMS
-a. Individual Account. You may make Submissions only under one, unique Kaggle.com account. You will be disqualified if you make Submissions through more than one Kaggle account, or attempt to falsify an account to act as your proxy. You may submit up to the maximum number of Submissions per day as specified on the Competition Website. b. Teams. If permitted under the Competition Website guidelines, multiple individuals may collaborate as a Team; however, you may join or form only one Team. Each Team member must be a single individual with a separate Kaggle account. You must register individually for the Competition before joining a Team. You must confirm your Team membership to make it official by responding to the Team notification message sent to your Kaggle account. Team membership may not exceed the Maximum Team Size stated on the Competition Website. c. Team Merger. Teams (or individual Participants) may request to merge via the Competition Website. Team mergers may be allowed provided that: (i) the combined Team does not exceed the Maximum Team Size; (ii) the number of Submissions made by the merging Teams does not exceed the number of Submissions permissible for one Team at the date of the merger request; (iii) the merger is completed before the earlier of: any merger deadline or the Competition deadline; and (iv) the proposed combined Team otherwise meets all the requirements of these Rules. d. Private Sharing. No private sharing outside of Teams. Privately sharing code or data outside of Teams is not permitted. It's okay to share code if made available to all Participants on the forums.
-6. SUBMISSION CODE REQUIREMENTS
-a. Private Code Sharing. Unless otherwise specifically permitted under the Competition Website or Competition Specific Rules above, during the Competition Period, you are not allowed to privately share source or executable code developed in connection with or based upon the Competition Data or other source or executable code relevant to the Competition (“Competition Code”). This prohibition includes sharing Competition Code between separate Teams, unless a Team merger occurs. Any such sharing of Competition Code is a breach of these Competition Rules and may result in disqualification. b. Public Code Sharing. You are permitted to publicly share Competition Code, provided that such public sharing does not violate the intellectual property rights of any third party. If you do choose to share Competition Code or other such code, you are required to share it on Kaggle.com on the discussion forum or notebooks associated specifically with the Competition for the benefit of all competitors. By so sharing, you are deemed to have licensed the shared code under an Open Source Initiative-approved license (see www.opensource.org) that in no event limits commercial use of such Competition Code or model containing or depending on such Competition Code. c. Use of Open Source. Unless otherwise stated in the Specific Competition Rules above, if open source code is used in the model to generate the Submission, then you must only use open source code licensed under an Open Source Initiative-approved license (see www.opensource.org) that in no event limits commercial use of such code or model containing or depending on such code.
-7. DETERMINING WINNERS
-a. Each Submission will be scored and/or ranked by the evaluation metric, or Evaluation Rubric (in the case of Hackathon Competitions),stated on the Competition Website. During the Competition Period, the current ranking will be visible on the Competition Website's Public Leaderboard. The potential winner(s) are determined solely by the leaderboard ranking on the Private Leaderboard, subject to compliance with these Rules. The Public Leaderboard will be based on the public test set and the Private Leaderboard will be based on the private test set. There will be no leaderboards for Hackathon Competitions. b. In the event of a tie, the Submission that was entered first to the Competition will be the winner. In the event a potential winner is disqualified for any reason, the Submission that received the next highest score rank will be chosen as the potential winner. For Hackathon Competitions, each of the top Submissions will get a unique ranking and there will be no tiebreakers.
-8. NOTIFICATION OF WINNERS & DISQUALIFICATION
-a. The potential winner(s) will be notified by email. b. If a potential winner (i) does not respond to the notification attempt within one (1) week from the first notification attempt or (ii) notifies Kaggle within one week after the Final Submission Deadline that the potential winner does not want to be nominated as a winner or does not want to receive a Prize, then, in each case (i) and (ii) such potential winner will not receive any Prize, and an alternate potential winner will be selected from among all eligible entries received based on the Competition’s judging criteria. c. In case (i) and (ii) above Kaggle may disqualify the Participant. However, in case (ii) above, if requested by Kaggle, such potential winner may provide code and documentation to verify the Participant’s compliance with these Rules. If the potential winner provides code and documentation to the satisfaction of Kaggle, the Participant will not be disqualified pursuant to this paragraph. d. Competition Sponsor reserves the right to disqualify any Participant from the Competition if the Competition Sponsor reasonably believes that the Participant has attempted to undermine the legitimate operation of the Competition by cheating, deception, or other unfair playing practices or abuses, threatens or harasses any other Participants, Competition Sponsor or Kaggle. e. A disqualified Participant may be removed from the Competition leaderboard, at Kaggle's sole discretion. If a Participant is removed from the Competition Leaderboard, additional winning features associated with the Kaggle competition platform, for example Kaggle points or medals, may also not be awarded. f. The final leaderboard list will be publicly displayed at Kaggle.com. Determinations of Competition Sponsor are final and binding.
-9. PRIZES
-a. Prize(s) are as described on the Competition Website and are only available for winning during the time period described on the Competition Website. The odds of winning any Prize depends on the number of eligible Submissions received during the Competition Period and the skill of the Participants. b. All Prizes are subject to Competition Sponsor's review and verification of the Participant’s eligibility and compliance with these Rules, and the compliance of the winning Submissions with the Submissions Requirements. In the event that the Submission demonstrates non-compliance with these Competition Rules, Competition Sponsor may at its discretion take either of the following actions: (i) disqualify the Submission(s); or (ii) require the potential winner to remediate within one week after notice all issues identified in the Submission(s) (including, without limitation, the resolution of license conflicts, the fulfillment of all obligations required by software licenses, and the removal of any software that violates the software restrictions). c. A potential winner may decline to be nominated as a Competition winner in accordance with Section 3.8. d. Potential winners must return all required Prize acceptance documents within two (2) weeks following notification of such required documents, or such potential winner will be deemed to have forfeited the prize and another potential winner will be selected. Prize(s) will be awarded within approximately thirty (30) days after receipt by Competition Sponsor or Kaggle of the required Prize acceptance documents. Transfer or assignment of a Prize is not allowed. e. You are not eligible to receive any Prize if you do not meet the Eligibility requirements in Section 2.7 and Section 3.1 above. f. If a Team wins a monetary Prize, the Prize money will be allocated in even shares between the eligible Team members, unless the Team unanimously opts for a different Prize split and notifies Kaggle before Prizes are issued.
-10. TAXES
-a. ALL TAXES IMPOSED ON PRIZES ARE THE SOLE RESPONSIBILITY OF THE WINNERS. Payments to potential winners are subject to the express requirement that they submit all documentation requested by Competition Sponsor or Kaggle for compliance with applicable state, federal, local and foreign (including provincial) tax reporting and withholding requirements. Prizes will be net of any taxes that Competition Sponsor is required by law to withhold. If a potential winner fails to provide any required documentation or comply with applicable laws, the Prize may be forfeited and Competition Sponsor may select an alternative potential winner. Any winners who are U.S. residents will receive an IRS Form-1099 in the amount of their Prize.
-11. GENERAL CONDITIONS
-a. All federal, state, provincial and local laws and regulations apply.
-12. PUBLICITY
-a. You agree that Competition Sponsor, Kaggle and its affiliates may use your name and likeness for advertising and promotional purposes without additional compensation, unless prohibited by law.
-13. PRIVACY
-a. You acknowledge and agree that Competition Sponsor and Kaggle may collect, store, share and otherwise use personally identifiable information provided by you during the Kaggle account registration process and the Competition, including but not limited to, name, mailing address, phone number, and email address (“Personal Information”). Kaggle acts as an independent controller with regard to its collection, storage, sharing, and other use of this Personal Information, and will use this Personal Information in accordance with its Privacy Policy <www.kaggle.com/privacy>, including for administering the Competition. As a Kaggle.com account holder, you have the right to request access to, review, rectification, portability or deletion of any personal data held by Kaggle about you by logging into your account and/or contacting Kaggle Support at <www.kaggle.com/contact>. b. As part of Competition Sponsor performing this contract between you and the Competition Sponsor, Kaggle will transfer your Personal Information to Competition Sponsor, which acts as an independent controller with regard to this Personal Information. As a controller of such Personal Information, Competition Sponsor agrees to comply with all U.S. and foreign data protection obligations with regard to your Personal Information. Kaggle will transfer your Personal Information to Competition Sponsor in the country specified in the Competition Sponsor Address listed above, which may be a country outside the country of your residence. Such country may not have privacy laws and regulations similar to those of the country of your residence.
-14. WARRANTY, INDEMNITY AND RELEASE
-a. You warrant that your Submission is your own original work and, as such, you are the sole and exclusive owner and rights holder of the Submission, and you have the right to make the Submission and grant all required licenses. You agree not to make any Submission that: (i) infringes any third party proprietary rights, intellectual property rights, industrial property rights, personal or moral rights or any other rights, including without limitation, copyright, trademark, patent, trade secret, privacy, publicity or confidentiality obligations, or defames any person; or (ii) otherwise violates any applicable U.S. or foreign state or federal law. b. To the maximum extent permitted by law, you indemnify and agree to keep indemnified Competition Entities at all times from and against any liability, claims, demands, losses, damages, costs and expenses resulting from any of your acts, defaults or omissions and/or a breach of any warranty set forth herein. To the maximum extent permitted by law, you agree to defend, indemnify and hold harmless the Competition Entities from and against any and all claims, actions, suits or proceedings, as well as any and all losses, liabilities, damages, costs and expenses (including reasonable attorneys fees) arising out of or accruing from: (a) your Submission or other material uploaded or otherwise provided by you that infringes any third party proprietary rights, intellectual property rights, industrial property rights, personal or moral rights or any other rights, including without limitation, copyright, trademark, patent, trade secret, privacy, publicity or confidentiality obligations, or defames any person; (b) any misrepresentation made by you in connection with the Competition; (c) any non-compliance by you with these Rules or any applicable U.S. or foreign state or federal law; (d) claims brought by persons or entities other than the parties to these Rules arising from or related to your involvement with the Competition; and (e) your acceptance, possession, misuse or use of any Prize, or your participation in the Competition and any Competition-related activity. c. You hereby release Competition Entities from any liability associated with: (a) any malfunction or other problem with the Competition Website; (b) any error in the collection, processing, or retention of any Submission; or (c) any typographical or other error in the printing, offering or announcement of any Prize or winners.
-15. INTERNET
-a. Competition Entities are not responsible for any malfunction of the Competition Website or any late, lost, damaged, misdirected, incomplete, illegible, undeliverable, or destroyed Submissions or entry materials due to system errors, failed, incomplete or garbled computer or other telecommunication transmission malfunctions, hardware or software failures of any kind, lost or unavailable network connections, typographical or system/human errors and failures, technical malfunction(s) of any telephone network or lines, cable connections, satellite transmissions, servers or providers, or computer equipment, traffic congestion on the Internet or at the Competition Website, or any combination thereof, which may limit a Participant’s ability to participate.
-16. RIGHT TO CANCEL, MODIFY OR DISQUALIFY
-a. If for any reason the Competition is not capable of running as planned, including infection by computer virus, bugs, tampering, unauthorized intervention, fraud, technical failures, or any other causes which corrupt or affect the administration, security, fairness, integrity, or proper conduct of the Competition, Competition Sponsor reserves the right to cancel, terminate, modify or suspend the Competition. Competition Sponsor further reserves the right to disqualify any Participant who tampers with the submission process or any other part of the Competition or Competition Website. Any attempt by a Participant to deliberately damage any website, including the Competition Website, or undermine the legitimate operation of the Competition is a violation of criminal and civil laws. Should such an attempt be made, Competition Sponsor and Kaggle each reserves the right to seek damages from any such Participant to the fullest extent of the applicable law.
-17. NOT AN OFFER OR CONTRACT OF EMPLOYMENT
-a. Under no circumstances will the entry of a Submission, the awarding of a Prize, or anything in these Rules be construed as an offer or contract of employment with Competition Sponsor or any of the Competition Entities. You acknowledge that you have submitted your Submission voluntarily and not in confidence or in trust. You acknowledge that no confidential, fiduciary, agency, employment or other similar relationship is created between you and Competition Sponsor or any of the Competition Entities by your acceptance of these Rules or your entry of your Submission.
-18. DEFINITIONS
-a. "Competition Data" are the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to Participants. b. An “Entry” is when a Participant has joined, signed up, or accepted the rules of a competition. Entry is required to make a Submission to a competition. c. A “Final Submission” is the Submission selected by the user, or automatically selected by Kaggle in the event not selected by the user, that is/are used for final placement on the competition leaderboard. d. A “Participant” or “Participant User” is an individual who participates in a competition by entering the competition and making a Submission. e. The “Private Leaderboard” is a ranked display of Participants’ Submission scores against the private test set. The Private Leaderboard determines the final standing in the competition. f. The “Public Leaderboard” is a ranked display of Participants’ Submission scores against a representative sample of the test data. This leaderboard is visible throughout the competition. g. A “Sponsor” is responsible for hosting the competition, which includes but is not limited to providing the data for the competition, determining winners, and enforcing competition rules. h. A “Submission” is anything provided by the Participant to the Sponsor to be evaluated for competition purposes and determine leaderboard position. A Submission may be made as a model, notebook, prediction file, or other format as determined by the Sponsor. i. A “Team” is one or more Participants participating together in a Kaggle competition, by officially merging together as a Team within the competition platform.

docs/deploy_medgemma_hf.md CHANGED Viewed

@@ -8,7 +8,7 @@ OpenAI-compatible API.
 | Feature | Details |
 |---|---|
-| **Model** | `google/medgemma-27b-text-it` (HAI-DEF, competition-required) |
 | **Cost** | ~$2.50/hr (1× A100 80 GB on AWS) |
 | **Scale-to-zero** | Yes — no charges while idle |
 | **API format** | OpenAI-compatible (TGI) — zero code changes |
@@ -101,7 +101,7 @@ python -m validation.run_validation --medqa --max-cases 50 --seed 42 --delay 2
 |---|---|---|
 | Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
 | Development / debugging (4 hrs) | ~4 hrs | ~$10 |
-| Competition demo recording | ~1 hr | ~$2.50 |
 | **Total estimated** | **~7 hrs** | **~$17.50** |
 With scale-to-zero enabled, the endpoint automatically shuts down after 15 min

 | Feature | Details |
 |---|---|
+| **Model** | `google/medgemma-27b-text-it` (HAI-DEF) |
 | **Cost** | ~$2.50/hr (1× A100 80 GB on AWS) |
 | **Scale-to-zero** | Yes — no charges while idle |
 | **API format** | OpenAI-compatible (TGI) — zero code changes |
 |---|---|---|
 | Validation run (120 cases @ ~1 min/case) | ~2 hrs | ~$5 |
 | Development / debugging (4 hrs) | ~4 hrs | ~$10 |
+| Demo recording | ~1 hr | ~$2.50 |
 | **Total estimated** | **~7 hrs** | **~$17.50** |
 With scale-to-zero enabled, the endpoint automatically shuts down after 15 min

docs/kaggle_writeup.md DELETED Viewed

@@ -1,87 +0,0 @@
-# CDS Agent — Agentic Clinical Decision Support System
-### Project name
-**CDS Agent** — An agentic pipeline that orchestrates MedGemma across six specialized clinical reasoning steps, augmented with drug safety APIs and guideline RAG, to produce comprehensive decision support reports in real time.
-### Your team
-| Name | Specialty | Role |
-|------|-----------|------|
-| [Your Name] | Software Engineering / AI | Architecture, full-stack development, agent pipeline, RAG system, validation framework |
-### Problem statement
-**The problem:** Clinical decision-making is among the most cognitively demanding tasks in medicine. For every patient encounter, a clinician must simultaneously parse the clinical narrative, generate a differential diagnosis, recall drug interactions across the medication list, remember relevant clinical guidelines, and synthesize all of this into a care plan — often while fatigued and managing multiple patients.
-This cognitive burden has real consequences. Diagnostic errors affect approximately 12 million Americans annually. Medication errors harm over 1.5 million people per year. Many of these errors are not from lack of knowledge, but from the difficulty of integrating information from multiple sources under time pressure.
-**Who benefits:** Emergency physicians, hospitalists, and primary care clinicians — anyone making complex diagnostic and treatment decisions at the point of care. Patients benefit from more thorough, evidence-based care with fewer diagnostic and medication errors.
-**Impact potential:** The U.S. alone sees ~140 million ED visits per year. Even a modest improvement in diagnostic completeness or medication safety across a fraction of these encounters represents significant harm reduction. Our system surfaces specific, actionable conflicts between clinical guidelines and patient data — the kind of gap that leads to missed diagnoses, omitted treatments, and monitoring failures. By automating the information-gathering and synthesis steps of clinical reasoning, CDS Agent gives clinicians back cognitive bandwidth for the parts of medicine that require human judgment.
-### Overall solution
-**HAI-DEF model:** MedGemma (`google/medgemma-27b-text-it`) — Google's medical-domain model from the Health AI Developer Foundations collection, deployed on a HuggingFace Dedicated Endpoint (1× A100 80 GB, TGI, bfloat16).
-**Why MedGemma is essential, not bolted on:** MedGemma is the reasoning engine in four of six pipeline steps. It is not a wrapper around a general-purpose model — it leverages MedGemma's medical training to:
-1. **Parse** free-text clinical narratives into structured patient profiles (demographics, vitals, labs, medications, allergies, history)
-2. **Reason** about the case via chain-of-thought to produce a ranked differential diagnosis with explicit evidence for/against each candidate
-3. **Detect conflicts** between guideline recommendations and the patient's actual data — identifying omissions, contradictions, dosage concerns, and monitoring gaps
-4. **Synthesize** all pipeline outputs into a comprehensive CDS report with recommendations, warnings, and citations
-Steps 3 and 4 augment MedGemma with external tools: **OpenFDA + RxNorm APIs** for drug interaction data, and **ChromaDB RAG** over 62 curated clinical guidelines spanning 14 specialties (sourced from ACC/AHA, ADA, GOLD, GINA, IDSA, ACOG, AAN, and others).
-The agentic architecture is critical: no single LLM call can parse patient data, check drug interactions against federal databases, retrieve specialty-specific guidelines, AND cross-reference those guidelines against the patient's profile. The orchestrated pipeline produces results that no individual component could achieve alone.
-### Technical details
-**Architecture:**
-```
-Frontend (Next.js 14)  ←WebSocket→  Backend (FastAPI)
-                                        │
-                              Orchestrator (6-step pipeline)
-                              ├── Step 1: Parse Patient Data (MedGemma)
-                              ├── Step 2: Clinical Reasoning (MedGemma)
-                              ├── Step 3: Drug Interaction Check (OpenFDA + RxNorm)
-                              ├── Step 4: Guideline Retrieval (ChromaDB RAG, 62 guidelines)
-                              ├── Step 5: Conflict Detection (MedGemma)
-                              └── Step 6: Synthesis (MedGemma)
-```
-All inter-step data is strongly typed (Pydantic v2). Each step streams its status to the frontend via WebSocket — the clinician watches the pipeline execute in real time, building trust through transparency.
-**Key design decisions:**
-- **Custom orchestrator** over LangChain — simpler, more transparent, no framework overhead
-- **Conflict detection over confidence scores** — we deliberately rejected numeric "confidence" scores (uncalibrated LLM outputs create dangerous anchoring bias). Instead, we compare guidelines against patient data to surface specific, actionable conflicts with cited sources and suggested resolutions.
-- **RAG with curated guidelines** — 62 guidelines across 14 specialties, indexed with sentence-transformer embeddings (all-MiniLM-L6-v2). 100% top-1 retrieval accuracy across 30 test queries.
-**Validation results:**
-| Test | Result |
-|------|--------|
-| RAG retrieval accuracy | 30/30 (100%) — correct guideline ranked #1 for every query |
-| E2E pipeline (ACS case) | All 6 steps passed, 75 s total |
-| Clinical test suite | 22 scenarios across 14 specialties |
-| MedQA (50 USMLE cases) | 94% pipeline success, 36% top-1 diagnostic accuracy, 38% mentioned |
-| MedQA diagnostic-only (36 cases) | 39% mentioned correct diagnosis in report |
-The 36% top-1 on MedQA reflects that many questions are non-diagnostic (treatment, mechanism, statistics) — the pipeline generates differential diagnoses, not multiple-choice answers. On diagnostic questions specifically, 39% mentioned the correct diagnosis.
-**Deployment:**
-- **Model hosting:** HuggingFace Dedicated Endpoint (`medgemma-27b-cds`), 1× A100 80 GB, scale-to-zero billing
-- **HIPAA path:** MedGemma is open-weight and can be self-hosted on-premises, eliminating external data transmission
-- **Scalability:** FastAPI async + uvicorn workers; production path includes task queue and horizontal scaling
-- **EHR integration:** Current input is manual text paste; production system would use FHIR APIs for automatic patient data extraction
-**Stack:** Python 3.10, FastAPI, ChromaDB, sentence-transformers, Next.js 14, React 18, TypeScript, Tailwind CSS
----
-**Links:**
-- **Video:** [TODO — insert video link]
-- **Code:** [github.com/bshepp/clinical-decision-support-agent](https://github.com/bshepp/clinical-decision-support-agent)
-- **Live Demo:** [TODO — insert demo link if deployed]
-- **HuggingFace Model:** [google/medgemma-27b-text-it](https://huggingface.co/google/medgemma-27b-text-it)

docs/video_script.md DELETED Viewed

@@ -1,125 +0,0 @@
-# CDS Agent — Demo Video Script
-> **Target length:** 3 minutes (max)
-> **Format:** Screen recording with voiceover
-> **Tool suggestion:** OBS Studio, Loom, or similar
----
-## PRE-RECORDING CHECKLIST
-- [ ] Ensure HF Dedicated Endpoint is running (check `https://bshepp-cds-agent.hf.space/api/health/config`)
-- [ ] Open browser to `https://demo.briansheppard.com` (or `https://bshepp-cds-agent.hf.space`)
-- [ ] Close unnecessary tabs/notifications
-- [ ] Submit one case end-to-end before recording to confirm model is warm (watch for warm-up screen)
-- [ ] Browser zoom ~110-125% for readability on video
-- [ ] **Local fallback** (if Space is down): `cd src/backend && uvicorn app.main:app --host 0.0.0.0 --port 8002` + `cd src/frontend && npm run dev`, then open `http://localhost:3000`
----
-## SCRIPT
-### OPENING — The Problem (0:00 – 0:30)
-**[SCREEN: Title slide or the app landing page]**
-> "Clinical decision-making is one of the most cognitively demanding tasks in medicine. For every patient, a clinician must simultaneously parse the history, generate a differential, recall drug interactions, remember guidelines, and synthesize a care plan — all under time pressure.
->
-> Diagnostic errors affect 12 million Americans annually. Many aren't from lack of knowledge — they're from the difficulty of integrating information from multiple sources at once.
->
-> CDS Agent solves this with an agentic pipeline powered by MedGemma."
----
-### LIVE DEMO — The Pipeline in Action (0:30 – 2:00)
-**[SCREEN: App interface — PatientInput component visible]**
-> "Let me show you how it works. I'll load a built-in sample case — a 55-year-old male presenting to the ED with acute substernal chest pain radiating to his left arm and jaw, with diaphoresis and nausea. He has hypertension, type 2 diabetes, and hyperlipidemia, and he's on metformin, lisinopril, atorvastatin, and aspirin."
-**[ACTION: Click the "Chest Pain (55M)" sample case button, then click "Analyze Patient Case"]**
-> "When I submit this case, the agent pipeline kicks off. You can see each step executing in real time on the left."
-**[SCREEN: AgentPipeline component showing steps lighting up one by one]**
-> "Step 1 — MedGemma parses the free-text narrative into structured patient data: demographics, vitals, labs, medications, allergies, history."
-**[Wait for Step 1 to complete]**
-> "Step 2 — Clinical reasoning. MedGemma generates a ranked differential diagnosis with chain-of-thought reasoning. It's considering ACS, GERD, PE, aortic dissection — weighing evidence for and against each."
-**[Wait for Step 2 to complete]**
-> "Steps 3 and 4 run in parallel. Step 3 — Drug interaction check. This isn't the LLM guessing — it's querying the actual OpenFDA and RxNorm databases for his four medications. Real API data, not hallucination. Step 4 — Guideline retrieval. Our RAG system searches 62 curated clinical guidelines across 14 specialties. For this case it pulls the ACC/AHA chest pain and ACS guidelines."
-**[Wait for Steps 3 & 4 to complete]**
-> "Step 5 — and this is what makes it a real safety tool — Conflict Detection. MedGemma compares what the guidelines recommend against what the patient is actually receiving. It surfaces omissions, contradictions, dosage concerns, and monitoring gaps."
-**[Wait for Step 5 to complete]**
-> "Step 6 — Synthesis. Everything gets integrated into a single comprehensive report."
-**[Wait for Step 6 to complete. Total pipeline ~2-3 minutes]**
----
-### THE REPORT — Reviewing Results (2:00 – 2:40)
-**[SCREEN: Scroll through the CDSReport component]**
-> "Here's the CDS report. At the top — the ranked differential diagnosis. ACS is correctly identified as the leading diagnosis, with clear reasoning. The elevated troponin and ST elevation in II, III, and aVF support an inferior STEMI."
-**[ACTION: Scroll to drug interactions section]**
-> "Drug interaction warnings pulled from federal databases — not LLM-generated, real data."
-**[ACTION: Scroll to Conflicts & Gaps section — highlight the red-bordered cards]**
-> "This is the most important section — Conflicts and Gaps. Each card shows a specific conflict: what the guideline recommends, what the patient data shows, the severity, and a suggested resolution. These are the gaps that lead to missed diagnoses and omitted treatments in real clinical practice."
-**[ACTION: Scroll to guidelines section]**
-> "Cited guideline recommendations from authoritative sources — ACC/AHA, ADA, and others."
-**[ACTION: Click the "Download .md" button in the left panel]**
-> "And clinicians can download the full report as Markdown for their records."
----
-### CLOSING — Technical & Impact (2:40 – 3:00)
-**[SCREEN: Back to app overview or a summary slide]**
-> "Under the hood: MedGemma 27B powers four of six pipeline steps — parsing, reasoning, conflict detection, and synthesis. It's augmented with OpenFDA and RxNorm APIs for drug safety, and a 62-guideline RAG corpus for evidence-based recommendations.
->
-> We validated on 50 MedQA USMLE cases with 94% pipeline reliability and 38% diagnostic mention rate — before any fine-tuning.
->
-> With 140 million ED visits per year in the U.S. alone, even a modest improvement in diagnostic completeness and medication safety represents lives saved. CDS Agent is built to make that happen."
-**[END]**
----
-## TIMING SUMMARY
-| Section | Duration | Cumulative |
-|---------|----------|------------|
-| Opening — The Problem | 30 sec | 0:30 |
-| Live Demo — Pipeline Execution | 90 sec | 2:00 |
-| Report Review | 40 sec | 2:40 |
-| Closing — Tech & Impact | 20 sec | 3:00 |
-> **Note on timing:** The pipeline typically takes 2-3 minutes on the live endpoint. You can speed up the wait portions (1.5x-2x) in post-editing while keeping narration at normal speed to fit within 3 minutes. Alternatively, record narration separately and overlay it.
-## TIPS
-- **Warm up before recording** — Submit a test case first. If the model has scaled to zero you'll see a "Model Warming Up" spinner; wait for it to complete (~1-2 min) before the real recording
-- **Speak during pipeline wait times** — the pipeline execution is perfect narration time
-- **Don't rush** — the real-time pipeline visualization IS the demo; let it breathe
-- **Zoom into the Conflicts section** — it's the most visually impressive and differentiating feature
-- **If the endpoint is slow** — speed up wait portions in post-editing (1.5x-2x) while keeping narration at normal speed
-- **Retry resilience** — if a pipeline run fails, the "Try Again" button lets you retry without reloading the page
-- **Backup plan** — if the HF endpoint is down, you can use Google AI Studio with Gemma 3 27B IT as a fallback (update .env accordingly)

docs/writeup_draft.md DELETED Viewed

@@ -1,169 +0,0 @@
-# CDS Agent — Project Writeup
-> Competition writeup template filled in with actual project details.
-> Also serves as the primary project summary document.
----
-### Project name
-**CDS Agent** — Agentic Clinical Decision Support System
-### Your team
-| Name | Specialty | Role |
-|------|-----------|------|
-| (Developer) | Software Engineering / AI | Full-stack development, agent architecture, RAG system, testing |
-### Problem statement
-**The Problem:**
-Clinical decision-making is one of the most cognitively demanding tasks in medicine. A clinician seeing a patient must simultaneously: review the patient's history and current presentation, mentally generate a differential diagnosis, recall drug interactions for current and proposed medications, remember relevant clinical guidelines, and synthesize all of this into a coherent care plan — often while fatigued, time-pressured, and managing multiple patients.
-Medical errors remain a leading cause of patient harm. Studies estimate that diagnostic errors affect approximately 12 million Americans annually, and medication errors harm over 1.5 million people per year. Many of these errors stem not from lack of knowledge, but from the cognitive burden of integrating information from multiple sources under time pressure.
-**Who is affected:**
-- **Clinicians** (primary users) — physicians, nurse practitioners, physician assistants in emergency departments, urgent care, and inpatient settings where rapid, comprehensive decision-making is critical
-- **Patients** — who benefit from more thorough, evidence-based care with fewer diagnostic and medication errors
-- **Health systems** — which bear the cost of medical errors, readmissions, and liability
-**Why AI is the right solution:**
-This problem cannot be solved with traditional rule-based systems because:
-1. Clinical reasoning requires understanding free-text narratives, not just coded data
-2. Differential diagnosis generation requires probabilistic reasoning over thousands of conditions
-3. Guideline retrieval requires semantic understanding of clinical context
-4. Synthesis requires integrating heterogeneous data (structured labs, free-text guidelines, API-sourced drug data) into coherent recommendations
-Large language models — specifically medical-domain models like Gemma — can perform all of these tasks. But a single LLM call is insufficient. The agent architecture orchestrates the LLM across multiple specialized steps, augmented with external tools (drug APIs, RAG) to produce a result that no single component could achieve alone.
-**Impact potential:**
-If deployed, this system could:
-- Reduce diagnostic error rates by providing systematic differential diagnosis generation for every patient encounter
-- Catch drug interactions that clinicians might miss, especially in polypharmacy patients
-- Ensure guideline-concordant care by surfacing relevant, current clinical guidelines at the point of care
-- Save clinician time by automating the information-gathering and synthesis steps of clinical reasoning
-Estimated reach: There are approximately 140 million ED visits per year in the US alone. Even a modest improvement in diagnostic accuracy or medication safety across a fraction of these encounters would represent significant impact.
-### Overall solution
-**HAI-DEF models used:**
-- **MedGemma** (`google/medgemma-27b-text-it`) — Google's medical-domain model from the Health AI Developer Foundations (HAI-DEF) collection
-- Development/validation also performed with **Gemma 3 27B IT** (`gemma-3-27b-it`) via Google AI Studio for rapid iteration
-**Why MedGemma:**
-MedGemma is purpose-built for medical applications and is part of Google's HAI-DEF collection:
-- Trained specifically for health and biomedical tasks, providing stronger clinical reasoning than general-purpose models
-- Open-weight model that can be self-hosted for HIPAA compliance in production
-- Large enough (27B parameters) for complex chain-of-thought clinical reasoning
-- Designed to be the foundation for healthcare AI applications — exactly what this competition demands
-**How the model is used:**
-The model serves as the reasoning engine in a 6-step agentic pipeline:
-1. **Patient Data Parsing** (LLM) — Extracts structured patient data from free-text clinical narratives
-2. **Clinical Reasoning** (LLM) — Generates ranked differential diagnoses with chain-of-thought reasoning
-3. **Drug Interaction Check** (External APIs) — Queries OpenFDA and RxNorm for medication safety
-4. **Guideline Retrieval** (RAG) — Retrieves relevant clinical guidelines from a 62-guideline corpus using ChromaDB
-5. **Conflict Detection** (LLM) — Compares guideline recommendations against patient data to identify omissions, contradictions, dosage concerns, monitoring gaps, allergy risks, and interaction gaps
-6. **Synthesis** (LLM) — Integrates all outputs into a comprehensive CDS report with conflicts prominently featured
-The model is used in Steps 1, 2, 5, and 6 — parsing, reasoning, conflict detection, and synthesis. This demonstrates the model used "to its fullest potential" across multiple distinct clinical tasks within a single workflow.
-### Technical details
-**Architecture:**
-```
-Frontend (Next.js 14)  ←→  Backend (FastAPI + Python 3.10)
-                              │
-                    Orchestrator (6-step pipeline)
-                    ├── Step 1: Patient Parser (LLM)
-                    ├── Step 2: Clinical Reasoning (LLM)
-                    ├── Step 3: Drug Check (OpenFDA + RxNorm APIs)
-                    ├── Step 4: Guideline Retrieval (ChromaDB RAG)
-                    ├── Step 5: Conflict Detection (LLM)
-                    └── Step 6: Synthesis (LLM)
-```
-All inter-step data is strongly typed with Pydantic v2 models. The pipeline streams each step's progress to the frontend via WebSocket for real-time visibility.
-**Fine-tuning:**
-No fine-tuning was performed in the current version. The base MedGemma model (`medgemma-27b-text-it`) was used with carefully crafted prompt engineering for each pipeline step. Fine-tuning on clinical reasoning datasets is a planned improvement.
-**Performance analysis:**
-| Test | Result |
-|------|--------|
-| E2E pipeline (chest pain / ACS) | All 6 steps passed, ~75–85 s total |
-| RAG retrieval quality | 30/30 queries passed (100%), avg relevance 0.639 |
-| Clinical test suite | 22 scenarios across 14 specialties |
-| Top-1 RAG accuracy | 100% — correct guideline ranked #1 for all queries |
-| **MedQA 50-case validation** | **36% top-1, 38% top-3, 38% mentioned, 94% pipeline success** |
-| MedQA diagnostic-only (36 cases) | 39% mentioned, 14% differential |
-**Application stack:**
-| Layer | Technology |
-|-------|-----------|
-| Frontend | Next.js 14, React 18, TypeScript, Tailwind CSS |
-| Backend | FastAPI, Python 3.10, Pydantic v2, WebSocket |
-| LLM | MedGemma 27B Text IT (HAI-DEF) + Gemma 3 27B IT for dev |
-| RAG | ChromaDB + sentence-transformers (all-MiniLM-L6-v2) |
-| Drug Data | OpenFDA API, RxNorm / NLM API |
-**Deployment considerations:**
-- **HIPAA compliance:** MedGemma is an open-weight model that can be self-hosted on-premises, eliminating the need to send patient data to external APIs. This is critical for healthcare deployment.
-- **Latency:** Current pipeline takes ~75 s for a single E2E case (local), or ~204 s avg on the HuggingFace Dedicated Endpoint (50-case MedQA validation). For production, this could be reduced with: smaller/distilled models, parallel LLM calls, or GPU-accelerated inference with higher throughput.
-- **Scalability:** FastAPI + uvicorn supports async request handling. For high-throughput deployment, add worker processes and a task queue (e.g., Celery).
-- **EHR integration:** Current input is manual text paste. A production system would integrate with EHR systems via FHIR APIs for automatic patient data extraction.
-### Validation methodology
-The project includes an external dataset validation framework (`src/backend/validation/`) that tests the full pipeline against real-world clinical data:
-| Dataset | Source | What It Tests |
-|---------|--------|---------------|
-| **MedQA (USMLE)** | HuggingFace (1,273 test cases) | Diagnostic accuracy — does the pipeline's top differential match the USMLE correct answer? |
-| **MTSamples** | GitHub (~5,000 medical transcriptions) | Parse quality, field completeness, specialty alignment on real clinical notes |
-| **PMC Case Reports** | PubMed E-utilities (dynamic) | Diagnostic accuracy on published case reports with known diagnoses |
-The validation harness calls the `Orchestrator` directly (no HTTP server), enabling rapid batch testing. Each dataset has a dedicated harness that fetches data, converts it to patient narratives, runs the pipeline, and scores the output against ground truth.
-**Initial smoke test (3 MedQA cases):** 100% parse success, 66.7% top-1 diagnostic accuracy, ~94 s avg per case.
-**50-case MedQA validation (MedGemma 27B via HF Endpoint):** 94% pipeline success, 36% top-1 diagnostic accuracy, 38% mentioned in report, 204 s avg per case. On diagnostic-only questions (36/50), 39% mentioned the correct diagnosis. Full results in [docs/test_results.md](docs/test_results.md).
-**Practical usage:**
-In a real clinical setting, the system would be used at the point of care:
-1. Clinician opens the CDS Agent interface (embedded in the EHR or as a standalone app)
-2. Patient data is automatically pulled from the EHR (or pasted manually)
-3. The agent pipeline runs in ~60–90 seconds, during which the clinician can continue other tasks
-4. The CDS report appears with:
-   - Ranked differential diagnoses with reasoning chains (transparent AI)
-   - Drug interaction warnings with severity levels
-   - **Conflicts & gaps** between guideline recommendations and the patient's actual data — prominently displayed with specific guideline citations, patient data comparisons, and suggested resolutions
-   - Relevant clinical guideline excerpts with citations to authoritative sources
-   - Suggested next steps (immediate, short-term, long-term)
-5. The clinician reviews the recommendations and incorporates them into their clinical judgment
-The system is explicitly designed as a **decision support** tool, not a decision-making tool. All recommendations include caveats and limitations. The clinician retains full authority over patient care.
----
-**Links:**
-- Video: [To be recorded]
-- Code Repository: [github.com/bshepp/clinical-decision-support-agent](https://github.com/bshepp/clinical-decision-support-agent)
-- Live Demo: [To be deployed]
-- Hugging Face Model: [google/medgemma-27b-text-it](https://huggingface.co/google/medgemma-27b-text-it)