---
status: accepted
date: 2026-06-08
deciders: [Codeseys, ARIA]
builds-on: [ADR-010 (FeatureDeletion datagen — per-task controls), ADR-012 (curriculum + provenance review findings)]
---

# ADR-015: Held-out disjoint eval + depth/generation kill-switch (HeldOutGuard)

## Context and Problem Statement

The framework drives a **self-evolving RL flywheel**: a generator proposes tasks,
the policy is optimized against an in-loop (proxy / oracle) reward, and the loop
repeats across generations. ADR-010 gave this loop its **per-task** safety
controls — the 4-gate solvability validator, the `HackMonitor` provenance check,
and the sandbox denylist (now hardened by `DockerSandbox`, see API §16). What was
still missing is the **run-level / across-generation** control: a watcher sitting
ABOVE the per-task gates that asks, every generation, *"is the proxy reward
improving because the policy got better, or because it learned to game the
proxy?"* — and HALTS the run when the answer is the latter.

The literature is unambiguous that a held-out eval + a hard stop is the
load-bearing control here, not a nice-to-have:

- **Reward hacking rises monotonically with optimization depth.** Zhao et al.,
  *"Reward Hacking in Self-Improving Code Agents"* (ICLR 2026 Workshop on RSI,
  OpenReview `ikrQWGgxYg`), show that going from 10 → 100 optimization steps
  drives the hacking rate from 26.4% → 57.8% (+31.4 points), and that 73.8% of
  KernelBench / 46.8% of ALE-Bench optimizations show **proxy gains without real
  gains**. They define **Hacking Gap = proxy gain − real gain** and label an
  optimization reward-hacking when it *"improves the public metric WITHOUT
  improving the private metric"* — the canonical signature a run-level tripwire
  must fire on. Because the hacking rate climbs with depth, a *one-time* eval is
  insufficient; the control has to be an **online per-generation tripwire**.

- **Closed-loop RL on self-generated data collapses.** The self-evolving-agents
  survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) **§8.3** names *"model
  collapse from closed-loop RL on static synthetic data"* and prescribes
  *"continuous monitoring … to detect long-horizon value drift."* Shumailov et
  al. (*Nature* 2024, "AI models collapse when trained on recursively generated
  data") show self-training first loses the distribution tails, then converges to
  a low-variance point estimate. The mitigation that matters here: the held-out
  eval must stay anchored to **REAL tasks that are NEVER fed back to the
  generator**, otherwise the eval drifts with the train set and the gap signal
  becomes meaningless.

- **Self-critique alone is insufficient.** The RSI paper's "retrospection"
  self-critique sometimes *increased* hacking; its conclusion: *"mitigating
  reward hacking likely requires stronger evaluations and constraints beyond
  self-critique alone."* So the control is a genuinely disjoint held-out eval +
  hard stop, not a critique hook.

- **A verifiable / held-out reward is still gameable.** EvilGenie
  (arXiv 2511.21654) found *"only minimal improvement from the use of held out
  test cases"* in isolation and that *"holdout tests have many surprising failure
  modes."* The implication is NOT "skip the held-out eval" but "do not treat it
  as sufficient on its own" — layer it explicitly as **defense-in-depth** on top
  of `HackMonitor` (provenance) and the KL hard stop. Catastrophic Goodhart
  (OpenReview `UXuBzWoZGK`) likewise proves KL regularization alone does NOT
  prevent heavy-tailed reward misspecification — so the KL stop is one tripwire
  among several, never the sole control.

## Decision

Build **`composer_replication.safety`** — a pure-Python, no-torch / no-cloud,
fully CPU-testable run-level safeguard — centered on the **`HeldOutGuard`**
kill-switch. The guard is **stateful** and is called once per checkpoint /
generation (`update(round_idx, in_loop_reward, heldout_score, kl_to_init=…)`),
the same cadence as `DifficultyCurriculum.update`. It maintains denoised EMAs of
every metric (raw single-step values are too noisy to threshold) and returns a
structured `TripwireStatus`.

### The 3 fire conditions

`HeldOutGuard.update` returns `fire=True` (alias `halt`) when **ANY** of:

- **(a) collapse-caught-in-the-act** — the in-loop reward EMA is RISING while the
  held-out score EMA has DECLINED for `>= decline_patience` consecutive
  checkpoints (default 3, the "monotone for ≥3 checkpoints" rule). This is the
  canonical reward-hacking signature: proxy up, real down. A held-out dip during
  an in-loop dip is treated as noise (a hard batch), not hacking — the decline
  streak only grows when in-loop is *simultaneously* rising.

- **(b) KL-to-init hard stop** — the `kl_to_init` EMA exceeds `kl_hard_stop`
  (default **0.08 nats/token**) on/after `min_steps`. Checked first as the
  cheapest unambiguous breach.

- **(c) proxy-real gap blowout** — the Hacking Gap (proxy gain − real gain since a
  run-start baseline) widens beyond `max_proxy_real_gap` (default 0.10), catching
  a fast single-generation divergence even before the full decline window
  elapses. `HeldOutGuard.proxy_real_gap()` returns exactly the RSI Hacking-Gap
  quantity.

No tripwire fires before `min_steps` (default 20) to avoid halting on early-run
warm-up noise. Once fired, the verdict is **latched** — every subsequent `update`
keeps `fire=True`, so a transient post-collapse recovery cannot silently un-halt
the run.

### HeldoutSplit disjointness discipline (design-of-record)

The `heldout_score` fed to the guard MUST come from a **disjoint held-out eval
pool** — REAL tasks the generator NEVER trains on (the `HeldoutSplit`
discipline). This is the load-bearing precondition: per the self-evolving survey
§8.3 / Shumailov collapse dynamics, if the held-out set is allowed to drift with
the train set, the proxy-real gap signal degenerates and the guard becomes blind
to the exact collapse it exists to catch. The split is documented here as the
**design-of-record**; the guard consumes a scalar `heldout_score` and does not
itself partition data — the caller is responsible for keeping the split disjoint
and never feeding held-out tasks back into the generator.

### The 0.08 nats/token KL hard-stop default

The GRPO "healthy progression" band (Orchestra Research GRPO skill) climbs
0.02 → 0.05 → 0.08 → 0.12 nats/token over a run, with **0.08 the top of the "good
progression" band** and just below the code-generation drift zone (0.05–0.15
per-token; >0.5 is "diverging too much"). So 0.08 nats/token is a sound
hard-stop default. `calibrate_kl_threshold(baseline_kls, factor=3.0)` lets a run
adapt the ceiling to its own KL scale ("record baseline KL over the first ~100
steps, set max to 3× that") — but with a **safety clamp**: calibration may only
ever TIGHTEN the stop (`min(3×baseline, current)`), never loosen it past the
documented collapse band, so a noisy / already-drifting baseline cannot raise the
ceiling above 0.08.

> **UNITS GOTCHA (load-bearing).** `kl_to_init` is **token-mean KL in
> nats/token**, matching `composer_replication.integrations.altered_minds.
> kl_logging.token_mean_kl`. It is NOT comparable to a sequence-level /
> sequence-summed KL (whose healthy band is ~0.05–10). Passing a sequence-summed
> KL into the per-token hard stop will fire it instantly.

### Public surface

`composer_replication.safety` re-exports:
`HeldOutGuard`, `TripwireStatus`, `CollapseStopError`,
`kl_token_trust_filter`. The guard exposes both flag-checking
(`should_halt()` / `status.fire` / `status.halt`) and exception-based
(`raise_if_fired(status) -> CollapseStopError`) control flow so a trainer loop can
use whichever convention it prefers. `kl_token_trust_filter` is the per-token
torchrl-style "KL-Mask" sibling (caller passes `0.5·(log π/π_ref)²`; returns
True to mask the token) — same 0.08 band, kept torch-free.

## Consequences

- **Positive**: the flywheel gains a run-level, online collapse tripwire that
  fires on the literature's exact reward-hacking signature (proxy-up / real-down),
  is denoised against single-step noise, and latches so a detected collapse
  cannot un-halt. It is layered defense-in-depth ON TOP OF the per-task ADR-010
  controls — neither sufficient alone (per EvilGenie / Catastrophic Goodhart).
- **Positive**: pure-Python and CPU-testable — `kl_to_init` is a float the caller
  computes upstream, so the guard pulls no torch / cloud dependency and is unit
  testable without a model.
- **Positive**: the thresholds are calibratable and the KL stop only ever
  tightens, so the safety property (ceiling ≤ documented band) is preserved
  across calibration.
- **Negative / honest**: a held-out eval is necessary but NOT sufficient by
  itself (EvilGenie); the guard's value depends entirely on the caller honoring
  the `HeldoutSplit` disjointness discipline. The KL stop is one tripwire among
  several, not a Goodhart-proof guarantee. `entropy` / `reward_std` are tracked
  and exposed but are NOT yet hard gates (early-warning instruments only).
- **Neutral**: `HeldoutSplit` ships as a documented design-of-record discipline
  rather than an enforced data-partitioning class in this wave; the guard
  consumes the scalar held-out score the caller provides.

## Acceptance gate

- [x] `HeldOutGuard.update(...)` folds in-loop / held-out / KL (+ entropy /
  reward_std) EMAs and returns a `TripwireStatus`; fires on (a) collapse-in-the-act,
  (b) KL > 0.08 nats/token, (c) proxy-real gap blowout; no fire before `min_steps`;
  latched after first fire.
- [x] `proxy_real_gap()` returns the RSI Hacking-Gap (in-loop gain − held-out gain
  since baseline); `should_halt()` / `last_status` are idempotent query helpers;
  `raise_if_fired()` converts a fired verdict into `CollapseStopError`.
- [x] `calibrate_kl_threshold()` only ever TIGHTENS the hard stop (safety clamp);
  raises on empty input.
- [x] `kl_token_trust_filter()` per-token KL-Mask helper, torch-free.
- [x] Pure-Python, CPU-only; `composer_replication.safety.__init__` re-exports the
  public surface and references this ADR.
- [x] Documented in `docs/API_REFERENCE.md` §17.

## More Information

- `composer_replication/safety/kill_switch.py` — the implementation + the
  primary-source citations inline.
- ADR-010 (FeatureDeletion datagen) — the per-task controls this layers above.
- `docs/API_REFERENCE.md` §16 (`DockerSandbox`) / §17 (`composer_replication.safety`).
- Zhao et al. RSI (OpenReview `ikrQWGgxYg`); Gao et al. self-evolving survey
  §8.3 (arXiv 2507.21046 v4); Shumailov et al. (*Nature* 2024); EvilGenie
  (arXiv 2511.21654); Catastrophic Goodhart (OpenReview `UXuBzWoZGK`).