| # EXPERIMENT_PLAN.md β 4-Phase Accuracy Optimization Plan | |
| > **Purpose:** Step-by-step execution plan for an AI agent or human to follow. | |
| > Each step is atomic, has clear inputs/outputs, and explicit success criteria. | |
| > | |
| > **Context:** Baseline accuracy is 36% top-1 on 50-case MedQA (seed=42). Our | |
| > goal is to find the best composite strategy before the Feb 24, 2026 deadline. | |
| > | |
| > **Prerequisite reading:** `CLAUDE.md` β `TRACKS.md` β this file. | |
| --- | |
| ## Infrastructure Prerequisites | |
| Before ANY phase, ensure: | |
| 1. **HF Endpoint is running.** | |
| - Go to https://ui.endpoints.huggingface.co β `medgemma-27b-cds` β Resume | |
| - Wait until status shows "Running" (5β15 min cold start) | |
| - Cost: ~$2.50/hr β **pause when done** | |
| 2. **Virtual environment is active.** | |
| ```powershell | |
| cd f:\kaggle\medgemma_impact_challenge\src\backend | |
| .\venv\Scripts\Activate.ps1 | |
| ``` | |
| 3. **Dependencies installed.** | |
| ```powershell | |
| pip install -r requirements.txt | |
| pip install sentence-transformers # Needed for Track B embedding variants | |
| ``` | |
| 4. **Environment variables set.** | |
| - `.env` file in `src/backend/` must have `HF_TOKEN`, `MEDGEMMA_API_KEY`, `MEDGEMMA_BASE_URL` | |
| - Verify: `python -c "from app.config import Settings; s = Settings(); print(s.medgemma_base_url)"` | |
| 5. **Quick health check.** Run 1 case through baseline to confirm the endpoint responds: | |
| ```powershell | |
| python -m validation.run_validation --medqa --max-cases 1 | |
| ``` | |
| **Success:** Pipeline returns a `CDSReport` without timeout errors. | |
| --- | |
| ## Phase 1 β Independent Axis Sweeps | |
| **Goal:** Find the best single-axis configuration for B, C, and D independently. | |
| **Estimated cost:** ~$15β25 of endpoint time (6β10 hours) | |
| **Estimated cases:** 50 per config Γ (10 + 4 + 4) = 900 total pipeline runs | |
| ### Phase 1A β Track B: RAG Variants | |
| **What we're testing:** Which retrieval configuration gets the best documents in front of the model? | |
| #### Step 1A.1: Smoke Test (3 cases Γ 10 variants = 30 runs) | |
| ```powershell | |
| cd f:\kaggle\medgemma_impact_challenge\src\backend | |
| python -m tracks.rag_variants.run_variants --max-cases 3 | |
| ``` | |
| **Check for:** | |
| - [ ] All 10 variants complete without errors | |
| - [ ] Each variant produces a result JSON in `tracks/rag_variants/results/` | |
| - [ ] MedCPT and MPNet embedding models download successfully | |
| - [ ] Reranking variant (B9) loads the cross-encoder model | |
| - [ ] Output shows a comparison table with per-variant scores | |
| **If any variant fails:** Fix the error, then re-run with `--variant <id>` to test just that one: | |
| ```powershell | |
| python -m tracks.rag_variants.run_variants --variant B6_medcpt --max-cases 3 | |
| ``` | |
| **Common failure modes:** | |
| - `sentence-transformers` not installed β `pip install sentence-transformers` | |
| - MedCPT download fails β check `HF_TOKEN` is set | |
| - ChromaDB lock β delete `tracks/rag_variants/data/chroma/` and retry | |
| #### Step 1A.2: Full Sweep (50 cases Γ 10 variants = 500 runs) | |
| ```powershell | |
| python -m tracks.rag_variants.run_variants | |
| ``` | |
| **Expected runtime:** 3β5 hours (50 cases Γ 10 variants, ~2 min/case with API latency) | |
| **Output:** Results in `tracks/rag_variants/results/` β one JSON per variant. | |
| #### Step 1A.3: Identify B* | |
| Read the comparison table printed at the end, or run: | |
| ```powershell | |
| python -m tracks.shared.compare --tracks B --dataset medqa | |
| ``` | |
| **Record the winner:** | |
| ``` | |
| B* = ____________ (variant_id) | |
| B* top-1 accuracy = _____% | |
| B* improvement over B0_baseline = +_____% | |
| ``` | |
| **Decision rules:** | |
| - If the best variant beats B0 by <2%, retrieval isn't the bottleneck. Note this, but still carry B* forward. | |
| - If multiple variants tie within 1%, prefer the one with lower latency/complexity. | |
| - If reranking (B9) wins, note the added latency cost. | |
| --- | |
| ### Phase 1B β Track C: Iterative Refinement | |
| **What we're testing:** Does repeated self-critique improve diagnostic accuracy? At what point do returns diminish? | |
| #### Step 1B.1: Smoke Test (3 cases Γ 4 configs = 12 runs) | |
| ```powershell | |
| python -m tracks.iterative.run_iterative --max-cases 3 | |
| ``` | |
| **Check for:** | |
| - [ ] All 4 configs complete without errors | |
| - [ ] Per-iteration accuracy and cost data is printed | |
| - [ ] Convergence detection works (C0_2rounds should always run all 2 iterations; C2_5rounds might converge early) | |
| - [ ] Cost ledger populates correctly | |
| **If a config hangs:** Likely an LLM timeout. Check that the endpoint is warm. The iterative track makes 2-10Γ more LLM calls per case than baseline. | |
| #### Step 1B.2: Full Sweep (50 cases Γ 4 configs) | |
| ```powershell | |
| python -m tracks.iterative.run_iterative | |
| ``` | |
| **Expected runtime:** 2β4 hours (C0 fastest, C3 slowest) | |
| **Output:** Results in `tracks/iterative/results/` | |
| #### Step 1B.3: Identify C* | |
| ```powershell | |
| python -m tracks.shared.compare --tracks C --dataset medqa | |
| ``` | |
| **Record the winner:** | |
| ``` | |
| C* = ____________ (config_id) | |
| C* top-1 accuracy = _____% | |
| C* avg iterations used = _____ | |
| C* cost per case = $_____ | |
| C* improvement over baseline = +_____% | |
| ``` | |
| **Key data to extract:** The per-iteration accuracy curve. Plot or record: | |
| ``` | |
| Iteration 0 (baseline): ___% top-1 | |
| Iteration 1 (first critique): ___% top-1 | |
| Iteration 2: ___% top-1 | |
| Iteration 3: ___% top-1 (if applicable) | |
| ... | |
| ``` | |
| **Decision rules:** | |
| - The winning config is the one with the best accuracy/cost ratio, not necessarily the one with the highest absolute accuracy. | |
| - If C2_5rounds converges at iteration 2 in most cases, the extra rounds aren't helping β C1_3rounds is probably enough. | |
| - If C3_aggressive loses accuracy (the critic is too harsh), note this as a failure mode. | |
| --- | |
| ### Phase 1C β Track D: Arbitrated Parallel | |
| **What we're testing:** Do multiple specialist perspectives, coordinated by an arbiter, find diagnoses a generalist misses? | |
| #### Step 1C.1: Smoke Test (3 cases Γ 4 configs = 12 runs) | |
| ```powershell | |
| python -m tracks.arbitrated.run_arbitrated --max-cases 3 | |
| ``` | |
| **Check for:** | |
| - [ ] All 4 configs complete without errors | |
| - [ ] Specialist outputs show domain-specific reasoning (cardiologist emphasizes cardiac, etc.) | |
| - [ ] Arbiter merge output is a coherent consensus differential, not just concatenation | |
| - [ ] For multi-round configs (D2, D3): tailored resubmission prompts are generated | |
| - [ ] For multi-round configs: second-round specialist outputs differ from first round | |
| - [ ] Cost tracking shows escalating cost with more specialists/rounds | |
| **If the arbiter produces garbage:** The merge prompt may need tuning. Check `tracks/arbitrated/arbiter.py` ARBITER_MERGE_PROMPT. | |
| #### Step 1C.2: Full Sweep (50 cases Γ 4 configs) | |
| ```powershell | |
| python -m tracks.arbitrated.run_arbitrated | |
| ``` | |
| **Expected runtime:** 3β6 hours (D0 fastest, D3 slowest β D3 runs 5 specialists Γ 2 rounds = 12 LLM calls/case) | |
| **Output:** Results in `tracks/arbitrated/results/` | |
| #### Step 1C.3: Identify D* | |
| ```powershell | |
| python -m tracks.shared.compare --tracks D --dataset medqa | |
| ``` | |
| **Record the winner:** | |
| ``` | |
| D* = ____________ (config_id) | |
| D* top-1 accuracy = _____% | |
| D* cost per case = $_____ | |
| D* improvement over baseline = +_____% | |
| ``` | |
| **Additional data to record:** | |
| ``` | |
| Per-specialist contribution analysis: | |
| Cardiologist: Contributed unique correct dx in ___% of cases | |
| Neurologist: ____% | |
| ID Specialist: ____% | |
| General Internist: ____% | |
| Emergency Med: ____% | |
| Arbitration consensus rate: ____% of cases where >3 specialists agreed on top-1 | |
| Round 2 lift (if applicable): +____% over round 1 | |
| ``` | |
| **Decision rules:** | |
| - If D0 (3-spec, 1-round) matches D3 (5-spec, 2-rounds), the extra cost isn't justified. | |
| - If specialists all agree in round 1, round 2 is wasted computation β future configs can drop it. | |
| - If one specialist consistently disagrees with the correct answer, consider removing it from the ensemble. | |
| --- | |
| ### Phase 1D β Cross-Track Comparison | |
| After all three tracks complete, run the unified comparison: | |
| ```powershell | |
| python -m tracks.shared.compare --dataset medqa | |
| ``` | |
| **Expected output:** | |
| ``` | |
| Cross-Track Comparison: MEDQA | |
| ------------------------------------------------------------- | |
| Track Top-1 Top-3 Mentioned Pipeline Cost | |
| ------------------------------------------------------------- | |
| A: Baseline 36.0% -- 38.0% 94.0% $X.XX | |
| B: RAG Variants ___% -- ___% ___% $X.XX | |
| C: Iterative ___% -- ___% ___% $X.XX | |
| D: Arbitrated ___% -- ___% ___% $X.XX | |
| ------------------------------------------------------------- | |
| ``` | |
| **Record Phase 1 summary:** | |
| ``` | |
| B* = __________, accuracy = ____%, delta = +____% | |
| C* = __________, accuracy = ____%, delta = +____% | |
| D* = __________, accuracy = ____%, delta = +____% | |
| Best single axis: Track ___ | |
| ``` | |
| **Go/No-Go for Phase 2:** | |
| - If ALL tracks are within 2% of baseline β the model itself may be the bottleneck, | |
| not the pipeline. Consider investigating prompt architecture (Phase 2) more aggressively. | |
| - If ANY single track shows β₯5% lift β strong signal, proceed to Phase 2 and Phase 3. | |
| - If results are noisy (high variance) β increase to 100 cases or use a different seed | |
| to get more statistical power. | |
| --- | |
| ## Phase 2 β New Axes (F, G, H) | |
| **Goal:** Test 3 lightweight axes that are cheap to implement and orthogonal to B/C/D. | |
| **Build these ONLY after Phase 1 data is in.** Phase 1 results inform which axes matter most. | |
| ### Phase 2A β Track F: Prompt Architecture | |
| **Axis:** *How* the model is asked to reason, independent of depth (C) or breadth (D). | |
| **Why:** This is the cheapest axis to test β same token count, different structure. If prompt architecture matters more than retrieval or iteration, we want to know early. | |
| #### Step 2A.1: Build Track F | |
| Create `src/backend/tracks/prompt_arch/` with the track system conventions (see TRACKS.md "Adding a New Track"). | |
| **Files to create:** | |
| ``` | |
| tracks/prompt_arch/ | |
| __init__.py # Track tag, package init | |
| config.py # PromptVariant dataclass + 5 variants | |
| reasoner.py # Modified clinical_reasoning that accepts prompt templates | |
| run_prompt_arch.py # Runner following same pattern as other tracks | |
| results/ # Output directory | |
| ``` | |
| **Variant definitions:** | |
| | ID | Name | Strategy | Prompt Change | | |
| |----|------|----------|---------------| | |
| | F0 | Baseline | Current free-form | No change (control) | | |
| | F1 | Structured Template | Force structured output | System prompt: "For each symptom, list 3 possible causes. Identify diagnoses appearing in β₯2 symptom lists. Rank by frequency of appearance." | | |
| | F2 | Few-Shot | 2 worked examples | Add 2 solved MedQA cases (NOT from test set) to the system prompt as worked examples with reasoning chains | | |
| | F3 | Reverse Reasoning | Falsification | After initial differential: "For each of your top 5 diagnoses, list the findings you would EXPECT. Mark which are present, absent, or unknown in this patient. Re-rank based on match percentage." | | |
| | F4 | Bayesian | Prior updating | "Assign a prior probability to each diagnosis based on prevalence. For each finding, update posterior probability. Show the Bayesian reasoning chain. Final differential ordered by posterior." | | |
| **Implementation notes:** | |
| - `reasoner.py` should accept a `prompt_template: str` parameter and inject it into the system prompt or user prompt of the clinical reasoning call. | |
| - F0 uses the exact same system prompt as `app/tools/clinical_reasoning.py` β this is the control. | |
| - Few-shot examples (F2) need to come from MedQA TRAIN set, not the 50-case test set. Pick 2 from `validation/data/medqa_test.jsonl` that are NOT in the seed=42 sample, or create synthetic examples from textbook cases. | |
| - F3 and F4 require TWO LLM calls: first the initial differential, then the structured verification/update. This makes them comparable to C in cost but different in mechanism (structured verification vs. open-ended critique). | |
| #### Step 2A.2: Run Track F | |
| ```powershell | |
| # Smoke test | |
| python -m tracks.prompt_arch.run_prompt_arch --max-cases 3 | |
| # Full sweep | |
| python -m tracks.prompt_arch.run_prompt_arch | |
| ``` | |
| #### Step 2A.3: Identify F* | |
| ``` | |
| F* = ____________ | |
| F* top-1 accuracy = _____% | |
| F* improvement over F0 = +_____% | |
| ``` | |
| --- | |
| ### Phase 2B β Track G: Multi-Sample Voting (Self-Consistency) | |
| **Axis:** Statistical diversity via repeated sampling at higher temperature. | |
| **Why:** Self-consistency is one of the most reliable accuracy boosters in the CoT literature. It's embarrassingly parallel and requires no new prompts β just `asyncio.gather()` over N samples. | |
| #### Step 2B.1: Build Track G | |
| Create `src/backend/tracks/voting/`. | |
| **Files:** | |
| ``` | |
| tracks/voting/ | |
| __init__.py | |
| config.py # VotingConfig: n_samples, temperature, aggregation_method | |
| voter.py # Generate N reasoning outputs, extract top-k diagnoses, vote | |
| run_voting.py | |
| results/ | |
| ``` | |
| **Variant definitions:** | |
| | ID | Samples | Temp | Aggregation | Description | | |
| |----|---------|------|-------------|-------------| | |
| | G0 | 1 | 0.3 | N/A | Control (identical to baseline) | | |
| | G1 | 3 | 0.5 | Majority vote | 3 samples, majority wins | | |
| | G2 | 5 | 0.5 | Majority vote | 5 samples, majority wins | | |
| | G3 | 5 | 0.7 | Weighted vote | 5 samples at higher diversity, weighted by internal consistency | | |
| | G4 | 3 | 0.5 | Best-of-N | 3 samples, pick the one whose differential best matches retrieved guidelines | | |
| **Implementation notes:** | |
| - `voter.py` calls `medgemma.generate()` N times in parallel with `asyncio.gather()`. | |
| - Temperature must be high enough to get diversity (β₯0.5), otherwise all N samples will be nearly identical. | |
| - **Majority vote aggregation:** Extract top-1 diagnosis from each sample. The diagnosis appearing most frequently wins. If tied, use the one from the sample with the longest reasoning (proxy for confidence). | |
| - **Weighted vote (G3):** For each sample, check how many of its diagnoses are mentioned in the retrieved guidelines. Weight = number of guideline-grounded diagnoses. This penalizes hallucinated differentials. | |
| - **Best-of-N (G4):** Score each sample's differential against the retrieved guidelines using fuzzy_match overlap. Pick the highest-scoring sample wholesale. | |
| - Cost scales linearly: G2 costs 5Γ baseline reasoning per case. | |
| #### Step 2B.2: Run Track G | |
| ```powershell | |
| python -m tracks.voting.run_voting --max-cases 3 # smoke | |
| python -m tracks.voting.run_voting # full | |
| ``` | |
| #### Step 2B.3: Identify G* | |
| ``` | |
| G* = ____________ | |
| G* top-1 accuracy = _____% | |
| G* cost multiplier vs baseline = _____Γ | |
| ``` | |
| --- | |
| ### Phase 2C β Track H: Evidence Verification (Post-Hoc Grounding) | |
| **Axis:** A structured fact-check pass that re-ranks the differential based on evidence alignment. | |
| **Why:** The model might rank a diagnosis #1 that isn't actually supported by the evidence. H catches this. It's different from C (which is open-ended self-critique) β H is specifically checking "does the evidence support this ranking?" | |
| #### Step 2C.1: Build Track H | |
| Create `src/backend/tracks/verification/`. | |
| **Files:** | |
| ``` | |
| tracks/verification/ | |
| __init__.py | |
| config.py # VerificationConfig | |
| verifier.py # Post-hoc evidence grounding check | |
| run_verification.py | |
| results/ | |
| ``` | |
| **Method for each case:** | |
| 1. Run baseline pipeline β get differential with top-5 diagnoses | |
| 2. For EACH diagnosis in the differential, make ONE LLM call: | |
| ``` | |
| Patient findings: {summary} | |
| Retrieved guidelines: {relevant_guidelines} | |
| Diagnosis under review: {diagnosis_name} | |
| Task: List the specific findings from this patient that SUPPORT this diagnosis, | |
| the findings that ARGUE AGAINST it, and the findings that are NEUTRAL. | |
| Give a grounding score from 0-10 based on evidence alignment. | |
| ``` | |
| 3. Re-rank the differential by grounding score (descending) | |
| 4. Use the re-ranked differential for scoring | |
| **Variant definitions:** | |
| | ID | Method | LLM Calls | Description | | |
| |----|--------|-----------|-------------| | |
| | H0 | None | 0 extra | Control | | |
| | H1 | Top-5 re-rank | 5 extra | Verify and re-rank all 5 diagnoses | | |
| | H2 | Top-3 re-rank | 3 extra | Verify only top 3 (cheaper) | | |
| | H3 | Eliminate-only | 5 extra | Don't re-rank β just DROP any diagnosis with score β€3 and promote the rest | | |
| **Implementation notes:** | |
| - Use `medgemma.generate_structured()` with a Pydantic model for the grounding output: | |
| ```python | |
| class GroundingResult(BaseModel): | |
| diagnosis: str | |
| supporting_findings: List[str] | |
| opposing_findings: List[str] | |
| neutral_findings: List[str] | |
| grounding_score: int # 0-10 | |
| ``` | |
| - Temperature: 0.1 (this is extraction/evaluation, not generation) | |
| - Each verification call is independent β run all 5 in parallel with `asyncio.gather()` | |
| #### Step 2C.2: Run Track H | |
| ```powershell | |
| python -m tracks.verification.run_verification --max-cases 3 | |
| python -m tracks.verification.run_verification | |
| ``` | |
| #### Step 2C.3: Identify H* | |
| ``` | |
| H* = ____________ | |
| H* top-1 accuracy = _____% | |
| H* improvement over baseline = +_____% | |
| ``` | |
| --- | |
| ### Phase 2D β Phase 2 Cross-Comparison | |
| After F, G, H are done: | |
| ```powershell | |
| python -m tracks.shared.compare --dataset medqa | |
| ``` | |
| Update the shared compare.py to include tracks E/F/G/H before running (add entries to `TRACK_DIRS`). | |
| **Record Phase 2 summary:** | |
| ``` | |
| F* = __________, accuracy = ____%, delta = +____% | |
| G* = __________, accuracy = ____%, delta = +____%, cost = _____Γ | |
| H* = __________, accuracy = ____%, delta = +____% | |
| ``` | |
| **Rank all 6 axes by accuracy lift:** | |
| ``` | |
| 1. Track ___ : +____% (cost: ___Γ) | |
| 2. Track ___ : +____% (cost: ___Γ) | |
| 3. Track ___ : +____% (cost: ___Γ) | |
| 4. Track ___ : +____% (cost: ___Γ) | |
| 5. Track ___ : +____% (cost: ___Γ) | |
| 6. Track ___ : +____% (cost: ___Γ) | |
| ``` | |
| --- | |
| ## Phase 3 β Composition (Track E: Combined) | |
| **Goal:** Wire the per-axis winners together and test whether gains are additive. | |
| **Only start this after Phase 1 and Phase 2 data is in hand.** | |
| ### Step 3.1: Build Track E | |
| Create `src/backend/tracks/combined/`. | |
| **Files:** | |
| ``` | |
| tracks/combined/ | |
| __init__.py | |
| config.py # CombinedConfig: which B*/C*/D*/F*/G*/H* to compose | |
| pipeline.py # The composite pipeline that wires winners together | |
| run_combined.py | |
| results/ | |
| ``` | |
| **CombinedConfig should reference winner IDs from Phase 1 and 2:** | |
| ```python | |
| @dataclass | |
| class CombinedConfig: | |
| config_id: str | |
| rag_variant_id: Optional[str] # B* winner (or None = baseline retrieval) | |
| iterative_config_id: Optional[str] # C* winner (or None = no iteration) | |
| arbitrated_config_id: Optional[str] # D* winner (or None = single generalist) | |
| prompt_variant_id: Optional[str] # F* winner (or None = default prompt) | |
| voting_config_id: Optional[str] # G* winner (or None = single sample) | |
| verification_config_id: Optional[str] # H* winner (or None = no verification) | |
| composition_pattern: str # "E1", "E2", or "E3" | |
| description: str = "" | |
| ``` | |
| ### Step 3.2: Implement 3 Composition Patterns | |
| **Pattern E1: Breadth-then-Depth** (recommended starting point) | |
| ``` | |
| Parse | |
| β B* retriever (swap guideline retrieval) | |
| β F* prompt template (swap reasoning prompt) | |
| β D* specialists in parallel (each uses F* prompt) | |
| β D* arbiter merge β consensus differential | |
| β C* iterative refinement on consensus | |
| β H* evidence verification on refined output | |
| β G* voting: run the above N times and vote (if G* β G0) | |
| β Drug Check + Conflict Detection | |
| β Synthesis | |
| ``` | |
| **Pattern E2: Depth-within-Breadth** | |
| ``` | |
| Parse | |
| β B* retriever | |
| β D* specialists, each with F* prompt, each running C* internal iteration | |
| β D* arbiter merge over refined specialist outputs | |
| β H* evidence verification | |
| β G* voting over the above | |
| β Drug Check + Conflict Detection | |
| β Synthesis | |
| ``` | |
| **Pattern E3: Bookend (full loop)** | |
| ``` | |
| Parse | |
| β B* retriever | |
| β D* specialists (round 1, F* prompt) | |
| β D* arbiter merge β rough consensus | |
| β C* iterative refinement on consensus | |
| β D* specialists again (round 2, with refined consensus as additional context) | |
| β D* arbiter re-merge β final differential | |
| β H* evidence verification | |
| β G* voting | |
| β Drug Check + Conflict Detection | |
| β Synthesis | |
| ``` | |
| **Implementation guidance:** | |
| - Import existing track modules β do NOT duplicate code | |
| ```python | |
| from tracks.rag_variants.retriever import VariantRetriever | |
| from tracks.rag_variants.config import VARIANTS | |
| from tracks.iterative.refiner import IterativeRefiner | |
| from tracks.iterative.config import CONFIGS as ITERATIVE_CONFIGS | |
| from tracks.arbitrated.specialists import run_specialists_parallel | |
| from tracks.arbitrated.arbiter import Arbiter | |
| from tracks.arbitrated.config import CONFIGS as ARBITRATED_CONFIGS | |
| ``` | |
| - The orchestrator's tools are swappable: `orchestrator.guideline_retrieval = variant_retriever` | |
| - Use a single `CostLedger` that spans ALL stages so the total cost is tracked | |
| ### Step 3.3: Run Compositions | |
| ```powershell | |
| # Start with E1 (simplest) | |
| python -m tracks.combined.run_combined --pattern E1 --max-cases 3 # smoke | |
| python -m tracks.combined.run_combined --pattern E1 # full 50 cases | |
| # Then E2 and E3 if E1 shows promise | |
| python -m tracks.combined.run_combined --pattern E2 --max-cases 10 | |
| python -m tracks.combined.run_combined --pattern E3 --max-cases 10 | |
| ``` | |
| ### Step 3.4: Evaluate Composition | |
| **Record:** | |
| ``` | |
| E1 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s | |
| E2 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s | |
| E3 top-1 accuracy = _____% | cost/case = $_____ | runtime/case = ___s | |
| Best single track: Track ___ at ____% | |
| Best composition: Pattern ___ at ____% | |
| Composition lift vs best single track: +____% | |
| ``` | |
| **Key questions to answer:** | |
| 1. Are the gains from B/C/D/F/G/H additive when composed? (If E1 β best single track, they're not.) | |
| 2. Which pattern gives the best accuracy/cost ratio? | |
| 3. Is there a simpler 2-axis composition (e.g., B+C only) that gets 80% of the E1 benefit at 30% of the cost? | |
| ### Step 3.5: Test Partial Compositions | |
| Based on the Phase 1+2 ranking, test 2-axis combos of the top 3 axes: | |
| ``` | |
| E_BC: B* + C* only (better retrieval + iteration) | |
| E_BD: B* + D* only (better retrieval + specialists) | |
| E_BF: B* + F* only (better retrieval + prompt architecture) | |
| E_CD: C* + D* only (iteration + specialists) | |
| E_BH: B* + H* only (better retrieval + verification) | |
| ``` | |
| This tells us which pairs compose well and which interfere. Run each at 50 cases. | |
| **Record pair interaction matrix:** | |
| ``` | |
| B* C* D* F* G* H* | |
| B* - ____% ____% ____% ____% ____% | |
| C* - ____% ____% ____% ____% | |
| D* - ____% ____% ____% | |
| F* - ____% ____% | |
| G* - ____% | |
| H* - | |
| ``` | |
| (Each cell = top-1 accuracy of that 2-axis composition) | |
| --- | |
| ## Phase 4 β Cherry-Pick and Finalize | |
| **Goal:** Take the best composition from Phase 3 and apply any remaining optimizations. | |
| ### Step 4.1: Lock the Winner | |
| Based on Phase 3 data, select the final pipeline configuration: | |
| ``` | |
| FINAL CONFIG: | |
| Retrieval: ____________ (B variant or baseline) | |
| Prompt: ____________ (F variant or baseline) | |
| Reasoning: ____________ (D config, or single generalist) | |
| Iteration: ____________ (C config, or none) | |
| Verification: ____________ (H config, or none) | |
| Voting: ____________ (G config, or single sample) | |
| Composition: ____________ (E pattern) | |
| Top-1 accuracy: ____% | |
| Cost per case: $____ | |
| Runtime per case: ____s | |
| ``` | |
| ### Step 4.2: 100-Case Validation | |
| Run the final config against an expanded dataset to confirm the result isn't a fluke: | |
| ```powershell | |
| # If possible, run 100 MedQA cases (load more from the JSONL) | |
| python -m tracks.combined.run_combined --pattern <winner> --max-cases 100 | |
| ``` | |
| **If 100-case accuracy is within Β±3% of 50-case accuracy:** The result is stable. | |
| **If it drops by >5%:** We overfit to the 50-case sample. Re-evaluate. | |
| ### Step 4.3: Run Complementary Benchmarks | |
| Run the winner through MTSamples and PMC harnesses (if available) to show generalization: | |
| ```powershell | |
| # These may need adaptation to work with the combined pipeline | |
| python -m validation.run_validation --mtsamples --max-cases 20 | |
| python -m validation.run_validation --pmc --max-cases 10 | |
| ``` | |
| ### Step 4.4: Update Submission Materials | |
| 1. **Update `docs/kaggle_writeup.md`** with final accuracy numbers, the winning configuration, | |
| and the experimental journey (which axes mattered, which didn't, composition effects). | |
| 2. **Update `docs/video_script.md`** if the demo pipeline changed significantly (e.g., if the | |
| best config uses specialists, the video should show the specialist pipeline). | |
| 3. **Update `docs/architecture.md`** with the final pipeline diagram. | |
| 4. **Push to GitHub:** | |
| ```powershell | |
| git add -A | |
| git commit -m "Phase 4: Final pipeline configuration - XX% top-1 accuracy" | |
| git push | |
| ``` | |
| ### Step 4.5: Record Demo Video | |
| Follow `docs/video_script.md` with the FINAL pipeline configuration running live. | |
| ### Step 4.6: Submit on Kaggle | |
| Follow `docs/kaggle_writeup.md` submission steps. Include: | |
| - Final writeup with experimental results | |
| - Video link | |
| - GitHub repo link | |
| - (Optional) Live demo URL if deployed | |
| --- | |
| ## Decision Log | |
| Use this section to record key decisions as you execute the plan. | |
| ### Phase 1 Results | |
| ``` | |
| Date: ___________ | |
| B* = ___________ accuracy: ____% delta: +____% latency: ____ms | |
| C* = ___________ accuracy: ____% delta: +____% avg_iters: ____ | |
| D* = ___________ accuracy: ____% delta: +____% cost/case: $____ | |
| Best single axis: Track ___ | |
| Notes: | |
| ``` | |
| ### Phase 2 Results | |
| ``` | |
| Date: ___________ | |
| F* = ___________ accuracy: ____% delta: +____% | |
| G* = ___________ accuracy: ____% delta: +____% cost: ____Γ | |
| H* = ___________ accuracy: ____% delta: +____% | |
| Ranked axes (by lift): | |
| 1. ___ 2. ___ 3. ___ 4. ___ 5. ___ 6. ___ | |
| Notes: | |
| ``` | |
| ### Phase 3 Results | |
| ``` | |
| Date: ___________ | |
| E1 accuracy: ____% cost/case: $____ | |
| E2 accuracy: ____% cost/case: $____ | |
| E3 accuracy: ____% cost/case: $____ | |
| Best pair: ___ + ___ accuracy: ____% | |
| Best triple: ___ + ___ + ___ accuracy: ____% | |
| Notes: | |
| ``` | |
| ### Phase 4 Final | |
| ``` | |
| Date: ___________ | |
| Final config: ___________________________ | |
| Final accuracy (50-case): ____% | |
| Final accuracy (100-case): ____% | |
| Cost per case: $____ | |
| Runtime per case: ____s | |
| Submitted: [ ] Yes [ ] No | |
| Video recorded: [ ] Yes [ ] No | |
| ``` | |
| --- | |
| ## Time Budget | |
| | Phase | Estimated Endpoint Hours | Estimated Wall Clock | Estimated Cost | | |
| |-------|-------------------------|---------------------|---------------| | |
| | Phase 1 (B+C+D) | 8β12 hrs | 1β2 days | $20β30 | | |
| | Phase 2 (F+G+H) | 6β10 hrs | 1β2 days | $15β25 | | |
| | Phase 3 (Compositions) | 4β8 hrs | 1 day | $10β20 | | |
| | Phase 4 (Finalize) | 2β3 hrs | 1 day | $5β8 | | |
| | **Total** | **20β33 hrs** | **4β7 days** | **$50β83** | | |
| **Deadline:** February 24, 2026, 11:59 PM UTC | |
| **Today:** February 15, 2026 | |
| **Available:** ~9 days | |
| **Suggested schedule:** | |
| - Feb 15β16: Phase 1 (run overnight, collect in morning) | |
| - Feb 17β18: Phase 2 (build F/G/H, run overnight) | |
| - Feb 19β20: Phase 3 (compositions) | |
| - Feb 21β22: Phase 4 (finalize, video, writeup update) | |
| - Feb 23: Buffer day + final submission | |
| - Feb 24: Deadline | |
| --- | |
| ## Abort Conditions | |
| Stop and re-evaluate the strategy if: | |
| 1. **Endpoint costs exceed $100 total** β we're overspending for marginal gains | |
| 2. **All Phase 1 tracks show <2% lift** β the model, not the pipeline, is the bottleneck. Consider: | |
| - Switching to `medgemma-4b-it` for faster iteration on prompts | |
| - Focusing entirely on prompt architecture (Track F) | |
| - Reducing scope to best-effort with current accuracy + strong writeup | |
| 3. **Phase 3 compositions LOSE accuracy vs single tracks** β negative interaction effects. Simplify back to best single track. | |
| 4. **Consistent pipeline failures (>10% error rate)** β endpoint stability issue. Fix infrastructure before continuing experiments. | |
| 5. **February 22 reached without Phase 3 complete** β lock whatever is best so far and move directly to Phase 4 (finalize + submit). Do not risk missing the deadline for marginal gains. | |