# Skill-Uplift Eval — methodology (read before trusting any number) ## The claim we are testing A generated SKILL.md gives a capable LLM real uplift on a *related future task*, over that same model solving the task with no skill. This is the judge's bar: "useful even to a frontier model that is already capable without it." ## Why this is hard to measure honestly (and how we handle each trap) 1. **Confounding by model strength.** A strong model may ace the task with or without the skill, hiding any uplift. We pick tasks at the *edge* of the model's ability (it sometimes fails them unaided), where a skill can move the needle. We report the unaided baseline pass rate so the headroom is visible. 2. **Leakage / circularity.** If the eval task is the *same* problem the skill was built from, the skill is just an answer key — meaningless. So the held-out task is a DIFFERENT problem in the SAME class as the session. The skill must transfer, not memorize. We state the session→task pairing explicitly. 3. **Grader bias.** A grader that sees which answer used the skill will favor it. The grader is BLIND: it receives the two answers in randomized order with the condition labels stripped, and judges only correctness/quality. 4. **Cherry-picking.** We fix the task set and the sessions BEFORE running, list them here, and report every item including failures. No post-hoc dropping. 5. **The "skill is just hints" objection.** A skill that smuggles the answer is not uplift, it's cheating. We verify each skill contains transferable PROCEDURE (gotchas, what-not-to-do), not the specific solution to the eval task. Any skill whose gotchas name the eval task's exact answer is disqualified and noted. ## Design - N sessions, each paired with a DISTINCT held-out task in the same problem class. - For each task, the SAME model answers twice: A) NO-SKILL: task only. B) WITH-SKILL: task + the SKILL.md generated from the paired session. Order of which is generated first is irrelevant (separate calls), but the two answers are handed to the grader in RANDOM order with labels stripped. - A blind grader (a separate strong model) scores each answer 0..1 on task success, not knowing which had the skill. We also run a small HUMAN-labeled calibration set first (like the groundedness eval) to check the grader agrees with us. - Uplift = mean(with_skill_score) - mean(no_skill_score). We report: - per-task scores (both conditions), - the unaided baseline (headroom), - the win/tie/loss count (how often skill helped / didn't / hurt), - the calibration agreement, - and the raw generations, saved to disk, so anyone can re-score. ## What an honest result looks like We commit IN ADVANCE to reporting the number as-is. Possible honest outcomes: - Clear positive uplift -> the skill works; report it. - Near-zero uplift -> the skill is pleasant but not load-bearing; say so. - Negative on some tasks -> the skill sometimes misleads; report which and why. Any of these is a credible result. Only a hidden or massaged number is not. ## Honest limitations (stated up front) - Small N. This is an indicative eval, not a benchmark. We report N and treat the result as directional, exactly as we did with the 25-transcript groundedness eval. - Single grader model. Grader bias is reduced by blinding but not eliminated; the calibration set is how we keep ourselves honest about it. - Task-class choice matters. We pick classes where a skill *could* plausibly help (procedural/gotcha-heavy domains); we do not claim uplift on trivia.