# TRACKS.md — Experimental Track Registry > **Single source of truth** for all experimental tracks, their file ownership, tagging conventions, and isolation rules. > Referenced by [CLAUDE.md](CLAUDE.md). Read that file first for general project context. --- ## Why Tracks? The baseline pipeline (Track A) achieves 36% top-1 diagnostic accuracy on MedQA. To improve this, we are evaluating **multiple independent strategies** in parallel. Each strategy is an isolated "track" with its own code, configuration, and results — so we can compare them fairly without cross-contamination. --- ## Track Registry | ID | Name | Directory | Strategy | |----|------|-----------|----------| | **A** | Baseline | `src/backend/app/` | The production 6-step pipeline. No modifications for experiments. | | **B** | RAG Variants | `src/backend/tracks/rag_variants/` | Test different chunking sizes, segment strategies, and embedding models to optimize guideline retrieval quality and downstream diagnostic accuracy. | | **C** | Iterative Refinement | `src/backend/tracks/iterative/` | Run the diagnosis step in a serial loop — each iteration critiques and refines the previous output. Continue until the marginal improvement drops below a cost/benefit threshold. Produces a convergence chart. | | **D** | Arbitrated Parallel | `src/backend/tracks/arbitrated/` | Run multiple specialist reasoning agents in parallel. An arbiter agent evaluates all outputs, tailors resubmission prompts for each specialist based on their strengths/weaknesses, and repeats until the cost/benefit ratio plateaus. Produces a cost/benefit chart. | | **E** | Combined | `src/backend/tracks/combined/` | Compose per-axis winners from B/C/D/F/G/H. Tests 3 composition patterns (breadth-then-depth, depth-within-breadth, bookend). **Phase 3 — build after Phase 1+2 data.** | | **F** | Prompt Architecture | `src/backend/tracks/prompt_arch/` | Test how reasoning prompt structure affects accuracy: structured template, few-shot, reverse reasoning, Bayesian framing. **Phase 2.** | | **G** | Multi-Sample Voting | `src/backend/tracks/voting/` | Self-consistency via repeated sampling + majority/weighted vote. 1/3/5 samples at varying temperatures. **Phase 2.** | | **H** | Evidence Verification | `src/backend/tracks/verification/` | Post-hoc grounding check: verify each diagnosis against patient evidence, re-rank by grounding score. **Phase 2.** | | **—** | Shared | `src/backend/tracks/shared/` | Cross-track utilities: cost tracking, comparison harness, chart generation. Not a track itself. | --- ## File Tagging Convention **Every file owned by a track MUST carry a track tag on line 1.** This makes ownership unambiguous when reading any file in isolation. ### Format by file type | File Type | Tag Format | Example | |-----------|-----------|---------| | Python (`.py`) | `# [Track X: Name]` | `# [Track B: RAG Variants]` | | JSON (`.json`) | First key in object | `{"_track": "Track B: RAG Variants", ...}` | | Markdown (`.md`) | HTML comment | `` | | Config (`.env`, `.yaml`) | Comment | `# [Track B: RAG Variants]` | ### Track A exception Track A files (`src/backend/app/`) were written before the track system existed. They are tagged with `# [Track A: Baseline]` on line 1, but their code is NOT modified for experimental purposes. Experiments extend or wrap Track A code from within their own track directory. --- ## Isolation Rules These rules prevent cross-contamination between experimental tracks: ### 1. File Ownership - Each file belongs to exactly **one track** (identified by its line-1 tag and directory). - Files in `src/backend/app/` belong to **Track A**. - Files in `src/backend/tracks//` belong to the corresponding track. - Files in `src/backend/tracks/shared/` are shared utilities, not owned by any single track. ### 2. No Cross-Modification - **Never modify a Track A file to serve an experiment.** Instead, import and extend from your track's directory. - **Never modify a Track B file from Track C code**, and so forth. - If two tracks need the same utility, put it in `shared/`. ### 3. Import Direction ``` Track B/C/D code → may import from → Track A (app/) and shared/ Track A code → NEVER imports → Track B/C/D shared/ code → may import from → Track A (app/) only ``` ### 4. Results Isolation - Each track stores results in `src/backend/tracks//results/`. - Result filenames include the track ID prefix (e.g., `trackB_medqa_20260215.json`). - Cross-track comparison is done **only** via `src/backend/tracks/shared/compare.py`. ### 5. Configuration Isolation - Track-specific parameters live in each track's own config or constants — not in `app/config.py`. - The shared `app/config.py` provides only baseline/global settings (API keys, endpoints, etc.). --- ## Track Details ### Track A: Baseline **Purpose:** The production-ready pipeline. The control group for all experiments. **Pipeline:** Parse → Reason → Drug Check → Guideline Retrieval → Conflict Detection → Synthesis **Key parameters:** - Embedding: `all-MiniLM-L6-v2` (384 dims) - RAG top-k: 5 - No guideline chunking (each guideline = 1 document) - Clinical reasoning temperature: 0.3 - Synthesis temperature: 0.2 - Single-pass reasoning (no iteration) **Baseline accuracy (50-case MedQA):** 36% top-1, 38% mentioned --- ### Track B: RAG Variants **Purpose:** Determine whether retrieval quality improvements translate to better diagnostic accuracy. **Experiments:** 1. **Chunking strategies** — Split each guideline into smaller segments (100-word chunks, 200-word chunks, sentence-level) with configurable overlap 2. **Embedding models** — Compare `all-MiniLM-L6-v2` (384d) vs `all-mpnet-base-v2` (768d) vs `bge-base-en-v1.5` (768d) vs `medcpt` (medical-specific) 3. **Top-k variation** — Test k=3, k=5, k=8, k=10 to find optimal retrieval breadth 4. **Re-ranking** — Add a cross-encoder re-ranking step after initial retrieval **Measured outcomes:** - RAG retrieval accuracy (30-query test suite) - MedQA diagnostic accuracy (same 50-case seed=42) - Retrieval latency per query **Key files:** - `src/backend/tracks/rag_variants/config.py` — Variant definitions - `src/backend/tracks/rag_variants/chunker.py` — Guideline chunking strategies - `src/backend/tracks/rag_variants/retriever.py` — Modified retrieval with configurable embedding/chunking - `src/backend/tracks/rag_variants/run_variants.py` — Runner that tests all configurations - `src/backend/tracks/rag_variants/results/` — Per-variant results --- ### Track C: Iterative Refinement **Purpose:** Determine whether repeated self-critique improves diagnostic accuracy, and find the point of diminishing returns. **Method:** 1. Run baseline clinical reasoning (iteration 0) 2. Feed the output back along with the patient data and a critique prompt 3. The model reviews its own differential, identifies weaknesses, and produces a refined version 4. Repeat until: (a) max iterations reached, or (b) the differential stops changing meaningfully 5. Track accuracy and LLM cost at each iteration to produce a convergence/cost-benefit chart **Measured outcomes:** - Accuracy at each iteration (top-1, top-3, mentioned) - LLM token cost at each iteration - Convergence curve: accuracy vs. cumulative cost - Iteration at which improvement drops below threshold **Key files:** - `src/backend/tracks/iterative/config.py` — Max iterations, convergence threshold - `src/backend/tracks/iterative/refiner.py` — Iterative reasoning loop with self-critique - `src/backend/tracks/iterative/run_iterative.py` — Runner with per-iteration scoring - `src/backend/tracks/iterative/results/` — Per-iteration results and charts --- ### Track D: Arbitrated Parallel **Purpose:** Determine whether multiple specialist agents, coordinated by an arbiter, outperform a single-pass generalist — and at what cost. **Method:** 1. Run N specialist reasoning agents **in parallel**, each with a domain-specific system prompt (e.g., cardiologist, neurologist, infectious disease specialist) 2. An **arbiter agent** receives all N specialist outputs plus the patient data 3. The arbiter evaluates each specialist's differential, identifies agreements and disagreements 4. The arbiter generates **tailored resubmission prompts** for each specialist — telling the cardiologist "the neurologist raised X, reconsider Y" and vice versa 5. Specialists run again with the arbiter's feedback 6. Repeat until: (a) consensus reached, (b) max rounds, or (c) cost/benefit drops below threshold 7. The arbiter produces the final merged differential 8. Track accuracy and cost at each round to produce a cost/benefit chart **Measured outcomes:** - Accuracy at each arbitration round (top-1, top-3, mentioned) - Per-specialist accuracy contribution - LLM token cost per round (N specialists + 1 arbiter) - Cost/benefit convergence chart - Consensus rate across rounds **Key files:** - `src/backend/tracks/arbitrated/config.py` — Specialist definitions, max rounds, threshold - `src/backend/tracks/arbitrated/specialists.py` — Domain-specific reasoning agents - `src/backend/tracks/arbitrated/arbiter.py` — Arbiter agent that evaluates and coordinates - `src/backend/tracks/arbitrated/run_arbitrated.py` — Runner with per-round scoring - `src/backend/tracks/arbitrated/results/` — Per-round results and charts --- ## Adding a New Track 1. Choose an unused letter ID (E, F, ...). 2. Create `src/backend/tracks//` with `__init__.py`. 3. Add the track to the **Track Registry** table above. 4. Tag every new file on line 1 with `# [Track X: Name]`. 5. Store results in `src/backend/tracks//results/`. 6. Add a comparison entry in `src/backend/tracks/shared/compare.py`. 7. Never import from another track's directory — only from `app/` and `shared/`.