Spaces:

build-small-hackathon
/

TurboSkillSlug

Running

App Files Files Community

TurboSkillSlug / METHODOLOGY.md

legendarydragontamer

deploy

51a9974 19 days ago

preview code

Raw

History Blame Contribute Delete

3.63 kB

	# Skill-Uplift Eval — methodology (read before trusting any number)

	## The claim we are testing

	A generated SKILL.md gives a capable LLM real uplift on a related future task,
	over that same model solving the task with no skill. This is the judge's bar:
	"useful even to a frontier model that is already capable without it."

	## Why this is hard to measure honestly (and how we handle each trap)

	1. Confounding by model strength. A strong model may ace the task with or
	without the skill, hiding any uplift. We pick tasks at the edge of the
	model's ability (it sometimes fails them unaided), where a skill can move the
	needle. We report the unaided baseline pass rate so the headroom is visible.

	2. Leakage / circularity. If the eval task is the same problem the skill was
	built from, the skill is just an answer key — meaningless. So the held-out task
	is a DIFFERENT problem in the SAME class as the session. The skill must
	transfer, not memorize. We state the session→task pairing explicitly.

	3. Grader bias. A grader that sees which answer used the skill will favor it.
	The grader is BLIND: it receives the two answers in randomized order with the
	condition labels stripped, and judges only correctness/quality.

	4. Cherry-picking. We fix the task set and the sessions BEFORE running, list
	them here, and report every item including failures. No post-hoc dropping.

	5. The "skill is just hints" objection. A skill that smuggles the answer is not
	uplift, it's cheating. We verify each skill contains transferable PROCEDURE
	(gotchas, what-not-to-do), not the specific solution to the eval task. Any
	skill whose gotchas name the eval task's exact answer is disqualified and noted.

	## Design

	- N sessions, each paired with a DISTINCT held-out task in the same problem class.
	- For each task, the SAME model answers twice:
	A) NO-SKILL: task only.
	B) WITH-SKILL: task + the SKILL.md generated from the paired session.
	Order of which is generated first is irrelevant (separate calls), but the two
	answers are handed to the grader in RANDOM order with labels stripped.
	- A blind grader (a separate strong model) scores each answer 0..1 on task success,
	not knowing which had the skill. We also run a small HUMAN-labeled calibration
	set first (like the groundedness eval) to check the grader agrees with us.
	- Uplift = mean(with_skill_score) - mean(no_skill_score). We report:
	- per-task scores (both conditions),
	- the unaided baseline (headroom),
	- the win/tie/loss count (how often skill helped / didn't / hurt),
	- the calibration agreement,
	- and the raw generations, saved to disk, so anyone can re-score.

	## What an honest result looks like

	We commit IN ADVANCE to reporting the number as-is. Possible honest outcomes:
	- Clear positive uplift -> the skill works; report it.
	- Near-zero uplift -> the skill is pleasant but not load-bearing; say so.
	- Negative on some tasks -> the skill sometimes misleads; report which and why.
	Any of these is a credible result. Only a hidden or massaged number is not.

	## Honest limitations (stated up front)

	- Small N. This is an indicative eval, not a benchmark. We report N and treat the
	result as directional, exactly as we did with the 25-transcript groundedness eval.
	- Single grader model. Grader bias is reduced by blinding but not eliminated; the
	calibration set is how we keep ourselves honest about it.
	- Task-class choice matters. We pick classes where a skill could plausibly help
	(procedural/gotcha-heavy domains); we do not claim uplift on trivia.