composer-replication-framework / docs /adrs /ADR-015-holdout-killswitch.md

Baladithya Balamurugan

Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)

bd0c358 24 days ago

10.8 kB

	---
	status: accepted
	date: 2026-06-08
	deciders: [Codeseys, ARIA]
	builds-on: [ADR-010 (FeatureDeletion datagen — per-task controls), ADR-012 (curriculum + provenance review findings)]
	---

	# ADR-015: Held-out disjoint eval + depth/generation kill-switch (HeldOutGuard)

	## Context and Problem Statement

	The framework drives a self-evolving RL flywheel: a generator proposes tasks,
	the policy is optimized against an in-loop (proxy / oracle) reward, and the loop
	repeats across generations. ADR-010 gave this loop its per-task safety
	controls — the 4-gate solvability validator, the `HackMonitor` provenance check,
	and the sandbox denylist (now hardened by `DockerSandbox`, see API §16). What was
	still missing is the run-level / across-generation control: a watcher sitting
	ABOVE the per-task gates that asks, every generation, *"is the proxy reward
	improving because the policy got better, or because it learned to game the
	proxy?"* — and HALTS the run when the answer is the latter.

	The literature is unambiguous that a held-out eval + a hard stop is the
	load-bearing control here, not a nice-to-have:

	- Reward hacking rises monotonically with optimization depth. Zhao et al.,
	"Reward Hacking in Self-Improving Code Agents" (ICLR 2026 Workshop on RSI,
	OpenReview `ikrQWGgxYg`), show that going from 10 → 100 optimization steps
	drives the hacking rate from 26.4% → 57.8% (+31.4 points), and that 73.8% of
	KernelBench / 46.8% of ALE-Bench optimizations show **proxy gains without real
	gains. They define Hacking Gap = proxy gain − real gain** and label an
	optimization reward-hacking when it *"improves the public metric WITHOUT
	improving the private metric"* — the canonical signature a run-level tripwire
	must fire on. Because the hacking rate climbs with depth, a one-time eval is
	insufficient; the control has to be an online per-generation tripwire.

	- Closed-loop RL on self-generated data collapses. The self-evolving-agents
	survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) §8.3 names *"model
	collapse from closed-loop RL on static synthetic data"* and prescribes
	"continuous monitoring … to detect long-horizon value drift." Shumailov et
	al. (Nature 2024, "AI models collapse when trained on recursively generated
	data") show self-training first loses the distribution tails, then converges to
	a low-variance point estimate. The mitigation that matters here: the held-out
	eval must stay anchored to **REAL tasks that are NEVER fed back to the
	generator**, otherwise the eval drifts with the train set and the gap signal
	becomes meaningless.

	- Self-critique alone is insufficient. The RSI paper's "retrospection"
	self-critique sometimes increased hacking; its conclusion: *"mitigating
	reward hacking likely requires stronger evaluations and constraints beyond
	self-critique alone."* So the control is a genuinely disjoint held-out eval +
	hard stop, not a critique hook.

	- A verifiable / held-out reward is still gameable. EvilGenie
	(arXiv 2511.21654) found *"only minimal improvement from the use of held out
	test cases"* in isolation and that *"holdout tests have many surprising failure
	modes."* The implication is NOT "skip the held-out eval" but "do not treat it
	as sufficient on its own" — layer it explicitly as defense-in-depth on top
	of `HackMonitor` (provenance) and the KL hard stop. Catastrophic Goodhart
	(OpenReview `UXuBzWoZGK`) likewise proves KL regularization alone does NOT
	prevent heavy-tailed reward misspecification — so the KL stop is one tripwire
	among several, never the sole control.

	## Decision

	Build `composer_replication.safety` — a pure-Python, no-torch / no-cloud,
	fully CPU-testable run-level safeguard — centered on the `HeldOutGuard`
	kill-switch. The guard is stateful and is called once per checkpoint /
	generation (`update(round_idx, in_loop_reward, heldout_score, kl_to_init=…)`),
	the same cadence as `DifficultyCurriculum.update`. It maintains denoised EMAs of
	every metric (raw single-step values are too noisy to threshold) and returns a
	structured `TripwireStatus`.

	### The 3 fire conditions

	`HeldOutGuard.update` returns `fire=True` (alias `halt`) when ANY of:

	- (a) collapse-caught-in-the-act — the in-loop reward EMA is RISING while the
	held-out score EMA has DECLINED for `>= decline_patience` consecutive
	checkpoints (default 3, the "monotone for ≥3 checkpoints" rule). This is the
	canonical reward-hacking signature: proxy up, real down. A held-out dip during
	an in-loop dip is treated as noise (a hard batch), not hacking — the decline
	streak only grows when in-loop is simultaneously rising.

	- (b) KL-to-init hard stop — the `kl_to_init` EMA exceeds `kl_hard_stop`
	(default 0.08 nats/token) on/after `min_steps`. Checked first as the
	cheapest unambiguous breach.

	- (c) proxy-real gap blowout — the Hacking Gap (proxy gain − real gain since a
	run-start baseline) widens beyond `max_proxy_real_gap` (default 0.10), catching
	a fast single-generation divergence even before the full decline window
	elapses. `HeldOutGuard.proxy_real_gap()` returns exactly the RSI Hacking-Gap
	quantity.

	No tripwire fires before `min_steps` (default 20) to avoid halting on early-run
	warm-up noise. Once fired, the verdict is latched — every subsequent `update`
	keeps `fire=True`, so a transient post-collapse recovery cannot silently un-halt
	the run.

	### HeldoutSplit disjointness discipline (design-of-record)

	The `heldout_score` fed to the guard MUST come from a **disjoint held-out eval
	pool** — REAL tasks the generator NEVER trains on (the `HeldoutSplit`
	discipline). This is the load-bearing precondition: per the self-evolving survey
	§8.3 / Shumailov collapse dynamics, if the held-out set is allowed to drift with
	the train set, the proxy-real gap signal degenerates and the guard becomes blind
	to the exact collapse it exists to catch. The split is documented here as the
	design-of-record; the guard consumes a scalar `heldout_score` and does not
	itself partition data — the caller is responsible for keeping the split disjoint
	and never feeding held-out tasks back into the generator.

	### The 0.08 nats/token KL hard-stop default

	The GRPO "healthy progression" band (Orchestra Research GRPO skill) climbs
	0.02 → 0.05 → 0.08 → 0.12 nats/token over a run, with **0.08 the top of the "good
	progression" band** and just below the code-generation drift zone (0.05–0.15
	per-token; >0.5 is "diverging too much"). So 0.08 nats/token is a sound
	hard-stop default. `calibrate_kl_threshold(baseline_kls, factor=3.0)` lets a run
	adapt the ceiling to its own KL scale ("record baseline KL over the first ~100
	steps, set max to 3× that") — but with a safety clamp: calibration may only
	ever TIGHTEN the stop (`min(3×baseline, current)`), never loosen it past the
	documented collapse band, so a noisy / already-drifting baseline cannot raise the
	ceiling above 0.08.

	> UNITS GOTCHA (load-bearing). `kl_to_init` is **token-mean KL in
	> nats/token**, matching `composer_replication.integrations.altered_minds.
	> kl_logging.token_mean_kl`. It is NOT comparable to a sequence-level /
	> sequence-summed KL (whose healthy band is ~0.05–10). Passing a sequence-summed
	> KL into the per-token hard stop will fire it instantly.

	### Public surface

	`composer_replication.safety` re-exports:
	`HeldOutGuard`, `TripwireStatus`, `CollapseStopError`,
	`kl_token_trust_filter`. The guard exposes both flag-checking
	(`should_halt()` / `status.fire` / `status.halt`) and exception-based
	(`raise_if_fired(status) -> CollapseStopError`) control flow so a trainer loop can
	use whichever convention it prefers. `kl_token_trust_filter` is the per-token
	torchrl-style "KL-Mask" sibling (caller passes `0.5·(log π/π_ref)²`; returns
	True to mask the token) — same 0.08 band, kept torch-free.

	## Consequences

	- Positive: the flywheel gains a run-level, online collapse tripwire that
	fires on the literature's exact reward-hacking signature (proxy-up / real-down),
	is denoised against single-step noise, and latches so a detected collapse
	cannot un-halt. It is layered defense-in-depth ON TOP OF the per-task ADR-010
	controls — neither sufficient alone (per EvilGenie / Catastrophic Goodhart).
	- Positive: pure-Python and CPU-testable — `kl_to_init` is a float the caller
	computes upstream, so the guard pulls no torch / cloud dependency and is unit
	testable without a model.
	- Positive: the thresholds are calibratable and the KL stop only ever
	tightens, so the safety property (ceiling ≤ documented band) is preserved
	across calibration.
	- Negative / honest: a held-out eval is necessary but NOT sufficient by
	itself (EvilGenie); the guard's value depends entirely on the caller honoring
	the `HeldoutSplit` disjointness discipline. The KL stop is one tripwire among
	several, not a Goodhart-proof guarantee. `entropy` / `reward_std` are tracked
	and exposed but are NOT yet hard gates (early-warning instruments only).
	- Neutral: `HeldoutSplit` ships as a documented design-of-record discipline
	rather than an enforced data-partitioning class in this wave; the guard
	consumes the scalar held-out score the caller provides.

	## Acceptance gate

	- [x] `HeldOutGuard.update(...)` folds in-loop / held-out / KL (+ entropy /
	reward_std) EMAs and returns a `TripwireStatus`; fires on (a) collapse-in-the-act,
	(b) KL > 0.08 nats/token, (c) proxy-real gap blowout; no fire before `min_steps`;
	latched after first fire.
	- [x] `proxy_real_gap()` returns the RSI Hacking-Gap (in-loop gain − held-out gain
	since baseline); `should_halt()` / `last_status` are idempotent query helpers;
	`raise_if_fired()` converts a fired verdict into `CollapseStopError`.
	- [x] `calibrate_kl_threshold()` only ever TIGHTENS the hard stop (safety clamp);
	raises on empty input.
	- [x] `kl_token_trust_filter()` per-token KL-Mask helper, torch-free.
	- [x] Pure-Python, CPU-only; `composer_replication.safety.__init__` re-exports the
	public surface and references this ADR.
	- [x] Documented in `docs/API_REFERENCE.md` §17.

	## More Information

	- `composer_replication/safety/kill_switch.py` — the implementation + the
	primary-source citations inline.
	- ADR-010 (FeatureDeletion datagen) — the per-task controls this layers above.
	- `docs/API_REFERENCE.md` §16 (`DockerSandbox`) / §17 (`composer_replication.safety`).
	- Zhao et al. RSI (OpenReview `ikrQWGgxYg`); Gao et al. self-evolving survey
	§8.3 (arXiv 2507.21046 v4); Shumailov et al. (Nature 2024); EvilGenie
	(arXiv 2511.21654); Catastrophic Goodhart (OpenReview `UXuBzWoZGK`).