# Skill-Uplift Eval — methodology (read before trusting any number)

## The claim we are testing

A generated SKILL.md gives a capable LLM real uplift on a *related future task*,
over that same model solving the task with no skill. This is the judge's bar:
"useful even to a frontier model that is already capable without it."

## Why this is hard to measure honestly (and how we handle each trap)

1. **Confounding by model strength.** A strong model may ace the task with or
   without the skill, hiding any uplift. We pick tasks at the *edge* of the
   model's ability (it sometimes fails them unaided), where a skill can move the
   needle. We report the unaided baseline pass rate so the headroom is visible.

2. **Leakage / circularity.** If the eval task is the *same* problem the skill was
   built from, the skill is just an answer key — meaningless. So the held-out task
   is a DIFFERENT problem in the SAME class as the session. The skill must
   transfer, not memorize. We state the session→task pairing explicitly.

3. **Grader bias.** A grader that sees which answer used the skill will favor it.
   The grader is BLIND: it receives the two answers in randomized order with the
   condition labels stripped, and judges only correctness/quality.

4. **Cherry-picking.** We fix the task set and the sessions BEFORE running, list
   them here, and report every item including failures. No post-hoc dropping.

5. **The "skill is just hints" objection.** A skill that smuggles the answer is not
   uplift, it's cheating. We verify each skill contains transferable PROCEDURE
   (gotchas, what-not-to-do), not the specific solution to the eval task. Any
   skill whose gotchas name the eval task's exact answer is disqualified and noted.

## Design

- N sessions, each paired with a DISTINCT held-out task in the same problem class.
- For each task, the SAME model answers twice:
    A) NO-SKILL: task only.
    B) WITH-SKILL: task + the SKILL.md generated from the paired session.
  Order of which is generated first is irrelevant (separate calls), but the two
  answers are handed to the grader in RANDOM order with labels stripped.
- A blind grader (a separate strong model) scores each answer 0..1 on task success,
  not knowing which had the skill. We also run a small HUMAN-labeled calibration
  set first (like the groundedness eval) to check the grader agrees with us.
- Uplift = mean(with_skill_score) - mean(no_skill_score). We report:
    - per-task scores (both conditions),
    - the unaided baseline (headroom),
    - the win/tie/loss count (how often skill helped / didn't / hurt),
    - the calibration agreement,
    - and the raw generations, saved to disk, so anyone can re-score.

## What an honest result looks like

We commit IN ADVANCE to reporting the number as-is. Possible honest outcomes:
  - Clear positive uplift -> the skill works; report it.
  - Near-zero uplift -> the skill is pleasant but not load-bearing; say so.
  - Negative on some tasks -> the skill sometimes misleads; report which and why.
Any of these is a credible result. Only a hidden or massaged number is not.

## Honest limitations (stated up front)

- Small N. This is an indicative eval, not a benchmark. We report N and treat the
  result as directional, exactly as we did with the 25-transcript groundedness eval.
- Single grader model. Grader bias is reduced by blinding but not eliminated; the
  calibration set is how we keep ourselves honest about it.
- Task-class choice matters. We pick classes where a skill *could* plausibly help
  (procedural/gotcha-heavy domains); we do not claim uplift on trivia.