composer-replication-framework / docs /adrs /ADR-015-holdout-killswitch.md

Baladithya Balamurugan

Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)

bd0c358 24 days ago

10.8 kB

status: accepted
date: 2026-06-08T00:00:00.000Z
deciders:
  - Codeseys
  - ARIA
builds-on:
  - ADR-010 (FeatureDeletion datagen — per-task controls)
  - ADR-012 (curriculum + provenance review findings)

ADR-015: Held-out disjoint eval + depth/generation kill-switch (HeldOutGuard)

Context and Problem Statement

The framework drives a self-evolving RL flywheel: a generator proposes tasks, the policy is optimized against an in-loop (proxy / oracle) reward, and the loop repeats across generations. ADR-010 gave this loop its per-task safety controls — the 4-gate solvability validator, the HackMonitor provenance check, and the sandbox denylist (now hardened by DockerSandbox, see API §16). What was still missing is the run-level / across-generation control: a watcher sitting ABOVE the per-task gates that asks, every generation, "is the proxy reward improving because the policy got better, or because it learned to game the proxy?" — and HALTS the run when the answer is the latter.

The literature is unambiguous that a held-out eval + a hard stop is the load-bearing control here, not a nice-to-have:

Reward hacking rises monotonically with optimization depth. Zhao et al., "Reward Hacking in Self-Improving Code Agents" (ICLR 2026 Workshop on RSI, OpenReview ikrQWGgxYg), show that going from 10 → 100 optimization steps drives the hacking rate from 26.4% → 57.8% (+31.4 points), and that 73.8% of KernelBench / 46.8% of ALE-Bench optimizations show proxy gains without real gains. They define Hacking Gap = proxy gain − real gain and label an optimization reward-hacking when it "improves the public metric WITHOUT improving the private metric" — the canonical signature a run-level tripwire must fire on. Because the hacking rate climbs with depth, a one-time eval is insufficient; the control has to be an online per-generation tripwire.
Closed-loop RL on self-generated data collapses. The self-evolving-agents survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) §8.3 names "model collapse from closed-loop RL on static synthetic data" and prescribes "continuous monitoring … to detect long-horizon value drift." Shumailov et al. (Nature 2024, "AI models collapse when trained on recursively generated data") show self-training first loses the distribution tails, then converges to a low-variance point estimate. The mitigation that matters here: the held-out eval must stay anchored to REAL tasks that are NEVER fed back to the generator, otherwise the eval drifts with the train set and the gap signal becomes meaningless.
Self-critique alone is insufficient. The RSI paper's "retrospection" self-critique sometimes increased hacking; its conclusion: "mitigating reward hacking likely requires stronger evaluations and constraints beyond self-critique alone." So the control is a genuinely disjoint held-out eval + hard stop, not a critique hook.
A verifiable / held-out reward is still gameable. EvilGenie (arXiv 2511.21654) found "only minimal improvement from the use of held out test cases" in isolation and that "holdout tests have many surprising failure modes." The implication is NOT "skip the held-out eval" but "do not treat it as sufficient on its own" — layer it explicitly as defense-in-depth on top of HackMonitor (provenance) and the KL hard stop. Catastrophic Goodhart (OpenReview UXuBzWoZGK) likewise proves KL regularization alone does NOT prevent heavy-tailed reward misspecification — so the KL stop is one tripwire among several, never the sole control.

Decision

Build composer_replication.safety — a pure-Python, no-torch / no-cloud, fully CPU-testable run-level safeguard — centered on the HeldOutGuard kill-switch. The guard is stateful and is called once per checkpoint / generation (update(round_idx, in_loop_reward, heldout_score, kl_to_init=…)), the same cadence as DifficultyCurriculum.update. It maintains denoised EMAs of every metric (raw single-step values are too noisy to threshold) and returns a structured TripwireStatus.

The 3 fire conditions

HeldOutGuard.update returns fire=True (alias halt) when ANY of:

(a) collapse-caught-in-the-act — the in-loop reward EMA is RISING while the held-out score EMA has DECLINED for >= decline_patience consecutive checkpoints (default 3, the "monotone for ≥3 checkpoints" rule). This is the canonical reward-hacking signature: proxy up, real down. A held-out dip during an in-loop dip is treated as noise (a hard batch), not hacking — the decline streak only grows when in-loop is simultaneously rising.
(b) KL-to-init hard stop — the kl_to_init EMA exceeds kl_hard_stop (default 0.08 nats/token) on/after min_steps. Checked first as the cheapest unambiguous breach.
(c) proxy-real gap blowout — the Hacking Gap (proxy gain − real gain since a run-start baseline) widens beyond max_proxy_real_gap (default 0.10), catching a fast single-generation divergence even before the full decline window elapses. HeldOutGuard.proxy_real_gap() returns exactly the RSI Hacking-Gap quantity.

No tripwire fires before min_steps (default 20) to avoid halting on early-run warm-up noise. Once fired, the verdict is latched — every subsequent update keeps fire=True, so a transient post-collapse recovery cannot silently un-halt the run.

HeldoutSplit disjointness discipline (design-of-record)

The heldout_score fed to the guard MUST come from a disjoint held-out eval pool — REAL tasks the generator NEVER trains on (the HeldoutSplit discipline). This is the load-bearing precondition: per the self-evolving survey §8.3 / Shumailov collapse dynamics, if the held-out set is allowed to drift with the train set, the proxy-real gap signal degenerates and the guard becomes blind to the exact collapse it exists to catch. The split is documented here as the design-of-record; the guard consumes a scalar heldout_score and does not itself partition data — the caller is responsible for keeping the split disjoint and never feeding held-out tasks back into the generator.

The 0.08 nats/token KL hard-stop default

The GRPO "healthy progression" band (Orchestra Research GRPO skill) climbs 0.02 → 0.05 → 0.08 → 0.12 nats/token over a run, with 0.08 the top of the "good progression" band and just below the code-generation drift zone (0.05–0.15 per-token; >0.5 is "diverging too much"). So 0.08 nats/token is a sound hard-stop default. calibrate_kl_threshold(baseline_kls, factor=3.0) lets a run adapt the ceiling to its own KL scale ("record baseline KL over the first ~100 steps, set max to 3× that") — but with a safety clamp: calibration may only ever TIGHTEN the stop (min(3×baseline, current)), never loosen it past the documented collapse band, so a noisy / already-drifting baseline cannot raise the ceiling above 0.08.

UNITS GOTCHA (load-bearing). kl_to_init is token-mean KL in nats/token, matching composer_replication.integrations.altered_minds. kl_logging.token_mean_kl. It is NOT comparable to a sequence-level / sequence-summed KL (whose healthy band is ~0.05–10). Passing a sequence-summed KL into the per-token hard stop will fire it instantly.

Public surface

composer_replication.safety re-exports: HeldOutGuard, TripwireStatus, CollapseStopError, kl_token_trust_filter. The guard exposes both flag-checking (should_halt() / status.fire / status.halt) and exception-based (raise_if_fired(status) -> CollapseStopError) control flow so a trainer loop can use whichever convention it prefers. kl_token_trust_filter is the per-token torchrl-style "KL-Mask" sibling (caller passes 0.5·(log π/π_ref)²; returns True to mask the token) — same 0.08 band, kept torch-free.

Consequences

Positive: the flywheel gains a run-level, online collapse tripwire that fires on the literature's exact reward-hacking signature (proxy-up / real-down), is denoised against single-step noise, and latches so a detected collapse cannot un-halt. It is layered defense-in-depth ON TOP OF the per-task ADR-010 controls — neither sufficient alone (per EvilGenie / Catastrophic Goodhart).
Positive: pure-Python and CPU-testable — kl_to_init is a float the caller computes upstream, so the guard pulls no torch / cloud dependency and is unit testable without a model.
Positive: the thresholds are calibratable and the KL stop only ever tightens, so the safety property (ceiling ≤ documented band) is preserved across calibration.
Negative / honest: a held-out eval is necessary but NOT sufficient by itself (EvilGenie); the guard's value depends entirely on the caller honoring the HeldoutSplit disjointness discipline. The KL stop is one tripwire among several, not a Goodhart-proof guarantee. entropy / reward_std are tracked and exposed but are NOT yet hard gates (early-warning instruments only).
Neutral: HeldoutSplit ships as a documented design-of-record discipline rather than an enforced data-partitioning class in this wave; the guard consumes the scalar held-out score the caller provides.

Acceptance gate

HeldOutGuard.update(...) folds in-loop / held-out / KL (+ entropy / reward_std) EMAs and returns a TripwireStatus; fires on (a) collapse-in-the-act, (b) KL > 0.08 nats/token, (c) proxy-real gap blowout; no fire before min_steps; latched after first fire.
proxy_real_gap() returns the RSI Hacking-Gap (in-loop gain − held-out gain since baseline); should_halt() / last_status are idempotent query helpers; raise_if_fired() converts a fired verdict into CollapseStopError.
calibrate_kl_threshold() only ever TIGHTENS the hard stop (safety clamp); raises on empty input.
kl_token_trust_filter() per-token KL-Mask helper, torch-free.
Pure-Python, CPU-only; composer_replication.safety.__init__ re-exports the public surface and references this ADR.
Documented in docs/API_REFERENCE.md §17.

More Information

composer_replication/safety/kill_switch.py — the implementation + the primary-source citations inline.
ADR-010 (FeatureDeletion datagen) — the per-task controls this layers above.
docs/API_REFERENCE.md §16 (DockerSandbox) / §17 (composer_replication.safety).
Zhao et al. RSI (OpenReview ikrQWGgxYg); Gao et al. self-evolving survey §8.3 (arXiv 2507.21046 v4); Shumailov et al. (Nature 2024); EvilGenie (arXiv 2511.21654); Catastrophic Goodhart (OpenReview UXuBzWoZGK).