TRACKS.md β Experimental Track Registry
Single source of truth for all experimental tracks, their file ownership, tagging conventions, and isolation rules.
Referenced by CLAUDE.md. Read that file first for general project context.
Why Tracks?
The baseline pipeline (Track A) achieves 36% top-1 diagnostic accuracy on MedQA. To improve this, we are evaluating multiple independent strategies in parallel. Each strategy is an isolated "track" with its own code, configuration, and results β so we can compare them fairly without cross-contamination.
Track Registry
| ID | Name | Directory | Strategy |
|---|---|---|---|
| A | Baseline | src/backend/app/ |
The production 6-step pipeline. No modifications for experiments. |
| B | RAG Variants | src/backend/tracks/rag_variants/ |
Test different chunking sizes, segment strategies, and embedding models to optimize guideline retrieval quality and downstream diagnostic accuracy. |
| C | Iterative Refinement | src/backend/tracks/iterative/ |
Run the diagnosis step in a serial loop β each iteration critiques and refines the previous output. Continue until the marginal improvement drops below a cost/benefit threshold. Produces a convergence chart. |
| D | Arbitrated Parallel | src/backend/tracks/arbitrated/ |
Run multiple specialist reasoning agents in parallel. An arbiter agent evaluates all outputs, tailors resubmission prompts for each specialist based on their strengths/weaknesses, and repeats until the cost/benefit ratio plateaus. Produces a cost/benefit chart. |
| E | Combined | src/backend/tracks/combined/ |
Compose per-axis winners from B/C/D/F/G/H. Tests 3 composition patterns (breadth-then-depth, depth-within-breadth, bookend). Phase 3 β build after Phase 1+2 data. |
| F | Prompt Architecture | src/backend/tracks/prompt_arch/ |
Test how reasoning prompt structure affects accuracy: structured template, few-shot, reverse reasoning, Bayesian framing. Phase 2. |
| G | Multi-Sample Voting | src/backend/tracks/voting/ |
Self-consistency via repeated sampling + majority/weighted vote. 1/3/5 samples at varying temperatures. Phase 2. |
| H | Evidence Verification | src/backend/tracks/verification/ |
Post-hoc grounding check: verify each diagnosis against patient evidence, re-rank by grounding score. Phase 2. |
| β | Shared | src/backend/tracks/shared/ |
Cross-track utilities: cost tracking, comparison harness, chart generation. Not a track itself. |
File Tagging Convention
Every file owned by a track MUST carry a track tag on line 1. This makes ownership unambiguous when reading any file in isolation.
Format by file type
| File Type | Tag Format | Example |
|---|---|---|
Python (.py) |
# [Track X: Name] |
# [Track B: RAG Variants] |
JSON (.json) |
First key in object | {"_track": "Track B: RAG Variants", ...} |
Markdown (.md) |
HTML comment | <!-- [Track B: RAG Variants] --> |
Config (.env, .yaml) |
Comment | # [Track B: RAG Variants] |
Track A exception
Track A files (src/backend/app/) were written before the track system existed. They are tagged with # [Track A: Baseline] on line 1, but their code is NOT modified for experimental purposes. Experiments extend or wrap Track A code from within their own track directory.
Isolation Rules
These rules prevent cross-contamination between experimental tracks:
1. File Ownership
- Each file belongs to exactly one track (identified by its line-1 tag and directory).
- Files in
src/backend/app/belong to Track A. - Files in
src/backend/tracks/<dir>/belong to the corresponding track. - Files in
src/backend/tracks/shared/are shared utilities, not owned by any single track.
2. No Cross-Modification
- Never modify a Track A file to serve an experiment. Instead, import and extend from your track's directory.
- Never modify a Track B file from Track C code, and so forth.
- If two tracks need the same utility, put it in
shared/.
3. Import Direction
Track B/C/D code β may import from β Track A (app/) and shared/
Track A code β NEVER imports β Track B/C/D
shared/ code β may import from β Track A (app/) only
4. Results Isolation
- Each track stores results in
src/backend/tracks/<dir>/results/. - Result filenames include the track ID prefix (e.g.,
trackB_medqa_20260215.json). - Cross-track comparison is done only via
src/backend/tracks/shared/compare.py.
5. Configuration Isolation
- Track-specific parameters live in each track's own config or constants β not in
app/config.py. - The shared
app/config.pyprovides only baseline/global settings (API keys, endpoints, etc.).
Track Details
Track A: Baseline
Purpose: The production-ready pipeline. The control group for all experiments.
Pipeline: Parse β Reason β Drug Check β Guideline Retrieval β Conflict Detection β Synthesis
Key parameters:
- Embedding:
all-MiniLM-L6-v2(384 dims) - RAG top-k: 5
- No guideline chunking (each guideline = 1 document)
- Clinical reasoning temperature: 0.3
- Synthesis temperature: 0.2
- Single-pass reasoning (no iteration)
Baseline accuracy (50-case MedQA): 36% top-1, 38% mentioned
Track B: RAG Variants
Purpose: Determine whether retrieval quality improvements translate to better diagnostic accuracy.
Experiments:
- Chunking strategies β Split each guideline into smaller segments (100-word chunks, 200-word chunks, sentence-level) with configurable overlap
- Embedding models β Compare
all-MiniLM-L6-v2(384d) vsall-mpnet-base-v2(768d) vsbge-base-en-v1.5(768d) vsmedcpt(medical-specific) - Top-k variation β Test k=3, k=5, k=8, k=10 to find optimal retrieval breadth
- Re-ranking β Add a cross-encoder re-ranking step after initial retrieval
Measured outcomes:
- RAG retrieval accuracy (30-query test suite)
- MedQA diagnostic accuracy (same 50-case seed=42)
- Retrieval latency per query
Key files:
src/backend/tracks/rag_variants/config.pyβ Variant definitionssrc/backend/tracks/rag_variants/chunker.pyβ Guideline chunking strategiessrc/backend/tracks/rag_variants/retriever.pyβ Modified retrieval with configurable embedding/chunkingsrc/backend/tracks/rag_variants/run_variants.pyβ Runner that tests all configurationssrc/backend/tracks/rag_variants/results/β Per-variant results
Track C: Iterative Refinement
Purpose: Determine whether repeated self-critique improves diagnostic accuracy, and find the point of diminishing returns.
Method:
- Run baseline clinical reasoning (iteration 0)
- Feed the output back along with the patient data and a critique prompt
- The model reviews its own differential, identifies weaknesses, and produces a refined version
- Repeat until: (a) max iterations reached, or (b) the differential stops changing meaningfully
- Track accuracy and LLM cost at each iteration to produce a convergence/cost-benefit chart
Measured outcomes:
- Accuracy at each iteration (top-1, top-3, mentioned)
- LLM token cost at each iteration
- Convergence curve: accuracy vs. cumulative cost
- Iteration at which improvement drops below threshold
Key files:
src/backend/tracks/iterative/config.pyβ Max iterations, convergence thresholdsrc/backend/tracks/iterative/refiner.pyβ Iterative reasoning loop with self-critiquesrc/backend/tracks/iterative/run_iterative.pyβ Runner with per-iteration scoringsrc/backend/tracks/iterative/results/β Per-iteration results and charts
Track D: Arbitrated Parallel
Purpose: Determine whether multiple specialist agents, coordinated by an arbiter, outperform a single-pass generalist β and at what cost.
Method:
- Run N specialist reasoning agents in parallel, each with a domain-specific system prompt (e.g., cardiologist, neurologist, infectious disease specialist)
- An arbiter agent receives all N specialist outputs plus the patient data
- The arbiter evaluates each specialist's differential, identifies agreements and disagreements
- The arbiter generates tailored resubmission prompts for each specialist β telling the cardiologist "the neurologist raised X, reconsider Y" and vice versa
- Specialists run again with the arbiter's feedback
- Repeat until: (a) consensus reached, (b) max rounds, or (c) cost/benefit drops below threshold
- The arbiter produces the final merged differential
- Track accuracy and cost at each round to produce a cost/benefit chart
Measured outcomes:
- Accuracy at each arbitration round (top-1, top-3, mentioned)
- Per-specialist accuracy contribution
- LLM token cost per round (N specialists + 1 arbiter)
- Cost/benefit convergence chart
- Consensus rate across rounds
Key files:
src/backend/tracks/arbitrated/config.pyβ Specialist definitions, max rounds, thresholdsrc/backend/tracks/arbitrated/specialists.pyβ Domain-specific reasoning agentssrc/backend/tracks/arbitrated/arbiter.pyβ Arbiter agent that evaluates and coordinatessrc/backend/tracks/arbitrated/run_arbitrated.pyβ Runner with per-round scoringsrc/backend/tracks/arbitrated/results/β Per-round results and charts
Adding a New Track
- Choose an unused letter ID (E, F, ...).
- Create
src/backend/tracks/<dir_name>/with__init__.py. - Add the track to the Track Registry table above.
- Tag every new file on line 1 with
# [Track X: Name]. - Store results in
src/backend/tracks/<dir_name>/results/. - Add a comparison entry in
src/backend/tracks/shared/compare.py. - Never import from another track's directory β only from
app/andshared/.