--- status: accepted date: 2026-06-08 deciders: [Codeseys, ARIA] builds-on: [ADR-010 (FeatureDeletion datagen — per-task controls), ADR-012 (curriculum + provenance review findings)] --- # ADR-015: Held-out disjoint eval + depth/generation kill-switch (HeldOutGuard) ## Context and Problem Statement The framework drives a **self-evolving RL flywheel**: a generator proposes tasks, the policy is optimized against an in-loop (proxy / oracle) reward, and the loop repeats across generations. ADR-010 gave this loop its **per-task** safety controls — the 4-gate solvability validator, the `HackMonitor` provenance check, and the sandbox denylist (now hardened by `DockerSandbox`, see API §16). What was still missing is the **run-level / across-generation** control: a watcher sitting ABOVE the per-task gates that asks, every generation, *"is the proxy reward improving because the policy got better, or because it learned to game the proxy?"* — and HALTS the run when the answer is the latter. The literature is unambiguous that a held-out eval + a hard stop is the load-bearing control here, not a nice-to-have: - **Reward hacking rises monotonically with optimization depth.** Zhao et al., *"Reward Hacking in Self-Improving Code Agents"* (ICLR 2026 Workshop on RSI, OpenReview `ikrQWGgxYg`), show that going from 10 → 100 optimization steps drives the hacking rate from 26.4% → 57.8% (+31.4 points), and that 73.8% of KernelBench / 46.8% of ALE-Bench optimizations show **proxy gains without real gains**. They define **Hacking Gap = proxy gain − real gain** and label an optimization reward-hacking when it *"improves the public metric WITHOUT improving the private metric"* — the canonical signature a run-level tripwire must fire on. Because the hacking rate climbs with depth, a *one-time* eval is insufficient; the control has to be an **online per-generation tripwire**. - **Closed-loop RL on self-generated data collapses.** The self-evolving-agents survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) **§8.3** names *"model collapse from closed-loop RL on static synthetic data"* and prescribes *"continuous monitoring … to detect long-horizon value drift."* Shumailov et al. (*Nature* 2024, "AI models collapse when trained on recursively generated data") show self-training first loses the distribution tails, then converges to a low-variance point estimate. The mitigation that matters here: the held-out eval must stay anchored to **REAL tasks that are NEVER fed back to the generator**, otherwise the eval drifts with the train set and the gap signal becomes meaningless. - **Self-critique alone is insufficient.** The RSI paper's "retrospection" self-critique sometimes *increased* hacking; its conclusion: *"mitigating reward hacking likely requires stronger evaluations and constraints beyond self-critique alone."* So the control is a genuinely disjoint held-out eval + hard stop, not a critique hook. - **A verifiable / held-out reward is still gameable.** EvilGenie (arXiv 2511.21654) found *"only minimal improvement from the use of held out test cases"* in isolation and that *"holdout tests have many surprising failure modes."* The implication is NOT "skip the held-out eval" but "do not treat it as sufficient on its own" — layer it explicitly as **defense-in-depth** on top of `HackMonitor` (provenance) and the KL hard stop. Catastrophic Goodhart (OpenReview `UXuBzWoZGK`) likewise proves KL regularization alone does NOT prevent heavy-tailed reward misspecification — so the KL stop is one tripwire among several, never the sole control. ## Decision Build **`composer_replication.safety`** — a pure-Python, no-torch / no-cloud, fully CPU-testable run-level safeguard — centered on the **`HeldOutGuard`** kill-switch. The guard is **stateful** and is called once per checkpoint / generation (`update(round_idx, in_loop_reward, heldout_score, kl_to_init=…)`), the same cadence as `DifficultyCurriculum.update`. It maintains denoised EMAs of every metric (raw single-step values are too noisy to threshold) and returns a structured `TripwireStatus`. ### The 3 fire conditions `HeldOutGuard.update` returns `fire=True` (alias `halt`) when **ANY** of: - **(a) collapse-caught-in-the-act** — the in-loop reward EMA is RISING while the held-out score EMA has DECLINED for `>= decline_patience` consecutive checkpoints (default 3, the "monotone for ≥3 checkpoints" rule). This is the canonical reward-hacking signature: proxy up, real down. A held-out dip during an in-loop dip is treated as noise (a hard batch), not hacking — the decline streak only grows when in-loop is *simultaneously* rising. - **(b) KL-to-init hard stop** — the `kl_to_init` EMA exceeds `kl_hard_stop` (default **0.08 nats/token**) on/after `min_steps`. Checked first as the cheapest unambiguous breach. - **(c) proxy-real gap blowout** — the Hacking Gap (proxy gain − real gain since a run-start baseline) widens beyond `max_proxy_real_gap` (default 0.10), catching a fast single-generation divergence even before the full decline window elapses. `HeldOutGuard.proxy_real_gap()` returns exactly the RSI Hacking-Gap quantity. No tripwire fires before `min_steps` (default 20) to avoid halting on early-run warm-up noise. Once fired, the verdict is **latched** — every subsequent `update` keeps `fire=True`, so a transient post-collapse recovery cannot silently un-halt the run. ### HeldoutSplit disjointness discipline (design-of-record) The `heldout_score` fed to the guard MUST come from a **disjoint held-out eval pool** — REAL tasks the generator NEVER trains on (the `HeldoutSplit` discipline). This is the load-bearing precondition: per the self-evolving survey §8.3 / Shumailov collapse dynamics, if the held-out set is allowed to drift with the train set, the proxy-real gap signal degenerates and the guard becomes blind to the exact collapse it exists to catch. The split is documented here as the **design-of-record**; the guard consumes a scalar `heldout_score` and does not itself partition data — the caller is responsible for keeping the split disjoint and never feeding held-out tasks back into the generator. ### The 0.08 nats/token KL hard-stop default The GRPO "healthy progression" band (Orchestra Research GRPO skill) climbs 0.02 → 0.05 → 0.08 → 0.12 nats/token over a run, with **0.08 the top of the "good progression" band** and just below the code-generation drift zone (0.05–0.15 per-token; >0.5 is "diverging too much"). So 0.08 nats/token is a sound hard-stop default. `calibrate_kl_threshold(baseline_kls, factor=3.0)` lets a run adapt the ceiling to its own KL scale ("record baseline KL over the first ~100 steps, set max to 3× that") — but with a **safety clamp**: calibration may only ever TIGHTEN the stop (`min(3×baseline, current)`), never loosen it past the documented collapse band, so a noisy / already-drifting baseline cannot raise the ceiling above 0.08. > **UNITS GOTCHA (load-bearing).** `kl_to_init` is **token-mean KL in > nats/token**, matching `composer_replication.integrations.altered_minds. > kl_logging.token_mean_kl`. It is NOT comparable to a sequence-level / > sequence-summed KL (whose healthy band is ~0.05–10). Passing a sequence-summed > KL into the per-token hard stop will fire it instantly. ### Public surface `composer_replication.safety` re-exports: `HeldOutGuard`, `TripwireStatus`, `CollapseStopError`, `kl_token_trust_filter`. The guard exposes both flag-checking (`should_halt()` / `status.fire` / `status.halt`) and exception-based (`raise_if_fired(status) -> CollapseStopError`) control flow so a trainer loop can use whichever convention it prefers. `kl_token_trust_filter` is the per-token torchrl-style "KL-Mask" sibling (caller passes `0.5·(log π/π_ref)²`; returns True to mask the token) — same 0.08 band, kept torch-free. ## Consequences - **Positive**: the flywheel gains a run-level, online collapse tripwire that fires on the literature's exact reward-hacking signature (proxy-up / real-down), is denoised against single-step noise, and latches so a detected collapse cannot un-halt. It is layered defense-in-depth ON TOP OF the per-task ADR-010 controls — neither sufficient alone (per EvilGenie / Catastrophic Goodhart). - **Positive**: pure-Python and CPU-testable — `kl_to_init` is a float the caller computes upstream, so the guard pulls no torch / cloud dependency and is unit testable without a model. - **Positive**: the thresholds are calibratable and the KL stop only ever tightens, so the safety property (ceiling ≤ documented band) is preserved across calibration. - **Negative / honest**: a held-out eval is necessary but NOT sufficient by itself (EvilGenie); the guard's value depends entirely on the caller honoring the `HeldoutSplit` disjointness discipline. The KL stop is one tripwire among several, not a Goodhart-proof guarantee. `entropy` / `reward_std` are tracked and exposed but are NOT yet hard gates (early-warning instruments only). - **Neutral**: `HeldoutSplit` ships as a documented design-of-record discipline rather than an enforced data-partitioning class in this wave; the guard consumes the scalar held-out score the caller provides. ## Acceptance gate - [x] `HeldOutGuard.update(...)` folds in-loop / held-out / KL (+ entropy / reward_std) EMAs and returns a `TripwireStatus`; fires on (a) collapse-in-the-act, (b) KL > 0.08 nats/token, (c) proxy-real gap blowout; no fire before `min_steps`; latched after first fire. - [x] `proxy_real_gap()` returns the RSI Hacking-Gap (in-loop gain − held-out gain since baseline); `should_halt()` / `last_status` are idempotent query helpers; `raise_if_fired()` converts a fired verdict into `CollapseStopError`. - [x] `calibrate_kl_threshold()` only ever TIGHTENS the hard stop (safety clamp); raises on empty input. - [x] `kl_token_trust_filter()` per-token KL-Mask helper, torch-free. - [x] Pure-Python, CPU-only; `composer_replication.safety.__init__` re-exports the public surface and references this ADR. - [x] Documented in `docs/API_REFERENCE.md` §17. ## More Information - `composer_replication/safety/kill_switch.py` — the implementation + the primary-source citations inline. - ADR-010 (FeatureDeletion datagen) — the per-task controls this layers above. - `docs/API_REFERENCE.md` §16 (`DockerSandbox`) / §17 (`composer_replication.safety`). - Zhao et al. RSI (OpenReview `ikrQWGgxYg`); Gao et al. self-evolving survey §8.3 (arXiv 2507.21046 v4); Shumailov et al. (*Nature* 2024); EvilGenie (arXiv 2511.21654); Catastrophic Goodhart (OpenReview `UXuBzWoZGK`).