Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Baladithya Balamurugan
Wave 3: close the HIGH review findings (kill-switch wiring, HeldoutSplit, EKS entrypoint bug)
bd0c358 | status: accepted | |
| date: 2026-06-08 | |
| deciders: [Codeseys, ARIA] | |
| builds-on: [ADR-010 (FeatureDeletion datagen — per-task controls), ADR-012 (curriculum + provenance review findings)] | |
| # ADR-015: Held-out disjoint eval + depth/generation kill-switch (HeldOutGuard) | |
| ## Context and Problem Statement | |
| The framework drives a **self-evolving RL flywheel**: a generator proposes tasks, | |
| the policy is optimized against an in-loop (proxy / oracle) reward, and the loop | |
| repeats across generations. ADR-010 gave this loop its **per-task** safety | |
| controls — the 4-gate solvability validator, the `HackMonitor` provenance check, | |
| and the sandbox denylist (now hardened by `DockerSandbox`, see API §16). What was | |
| still missing is the **run-level / across-generation** control: a watcher sitting | |
| ABOVE the per-task gates that asks, every generation, *"is the proxy reward | |
| improving because the policy got better, or because it learned to game the | |
| proxy?"* — and HALTS the run when the answer is the latter. | |
| The literature is unambiguous that a held-out eval + a hard stop is the | |
| load-bearing control here, not a nice-to-have: | |
| - **Reward hacking rises monotonically with optimization depth.** Zhao et al., | |
| *"Reward Hacking in Self-Improving Code Agents"* (ICLR 2026 Workshop on RSI, | |
| OpenReview `ikrQWGgxYg`), show that going from 10 → 100 optimization steps | |
| drives the hacking rate from 26.4% → 57.8% (+31.4 points), and that 73.8% of | |
| KernelBench / 46.8% of ALE-Bench optimizations show **proxy gains without real | |
| gains**. They define **Hacking Gap = proxy gain − real gain** and label an | |
| optimization reward-hacking when it *"improves the public metric WITHOUT | |
| improving the private metric"* — the canonical signature a run-level tripwire | |
| must fire on. Because the hacking rate climbs with depth, a *one-time* eval is | |
| insufficient; the control has to be an **online per-generation tripwire**. | |
| - **Closed-loop RL on self-generated data collapses.** The self-evolving-agents | |
| survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) **§8.3** names *"model | |
| collapse from closed-loop RL on static synthetic data"* and prescribes | |
| *"continuous monitoring … to detect long-horizon value drift."* Shumailov et | |
| al. (*Nature* 2024, "AI models collapse when trained on recursively generated | |
| data") show self-training first loses the distribution tails, then converges to | |
| a low-variance point estimate. The mitigation that matters here: the held-out | |
| eval must stay anchored to **REAL tasks that are NEVER fed back to the | |
| generator**, otherwise the eval drifts with the train set and the gap signal | |
| becomes meaningless. | |
| - **Self-critique alone is insufficient.** The RSI paper's "retrospection" | |
| self-critique sometimes *increased* hacking; its conclusion: *"mitigating | |
| reward hacking likely requires stronger evaluations and constraints beyond | |
| self-critique alone."* So the control is a genuinely disjoint held-out eval + | |
| hard stop, not a critique hook. | |
| - **A verifiable / held-out reward is still gameable.** EvilGenie | |
| (arXiv 2511.21654) found *"only minimal improvement from the use of held out | |
| test cases"* in isolation and that *"holdout tests have many surprising failure | |
| modes."* The implication is NOT "skip the held-out eval" but "do not treat it | |
| as sufficient on its own" — layer it explicitly as **defense-in-depth** on top | |
| of `HackMonitor` (provenance) and the KL hard stop. Catastrophic Goodhart | |
| (OpenReview `UXuBzWoZGK`) likewise proves KL regularization alone does NOT | |
| prevent heavy-tailed reward misspecification — so the KL stop is one tripwire | |
| among several, never the sole control. | |
| ## Decision | |
| Build **`composer_replication.safety`** — a pure-Python, no-torch / no-cloud, | |
| fully CPU-testable run-level safeguard — centered on the **`HeldOutGuard`** | |
| kill-switch. The guard is **stateful** and is called once per checkpoint / | |
| generation (`update(round_idx, in_loop_reward, heldout_score, kl_to_init=…)`), | |
| the same cadence as `DifficultyCurriculum.update`. It maintains denoised EMAs of | |
| every metric (raw single-step values are too noisy to threshold) and returns a | |
| structured `TripwireStatus`. | |
| ### The 3 fire conditions | |
| `HeldOutGuard.update` returns `fire=True` (alias `halt`) when **ANY** of: | |
| - **(a) collapse-caught-in-the-act** — the in-loop reward EMA is RISING while the | |
| held-out score EMA has DECLINED for `>= decline_patience` consecutive | |
| checkpoints (default 3, the "monotone for ≥3 checkpoints" rule). This is the | |
| canonical reward-hacking signature: proxy up, real down. A held-out dip during | |
| an in-loop dip is treated as noise (a hard batch), not hacking — the decline | |
| streak only grows when in-loop is *simultaneously* rising. | |
| - **(b) KL-to-init hard stop** — the `kl_to_init` EMA exceeds `kl_hard_stop` | |
| (default **0.08 nats/token**) on/after `min_steps`. Checked first as the | |
| cheapest unambiguous breach. | |
| - **(c) proxy-real gap blowout** — the Hacking Gap (proxy gain − real gain since a | |
| run-start baseline) widens beyond `max_proxy_real_gap` (default 0.10), catching | |
| a fast single-generation divergence even before the full decline window | |
| elapses. `HeldOutGuard.proxy_real_gap()` returns exactly the RSI Hacking-Gap | |
| quantity. | |
| No tripwire fires before `min_steps` (default 20) to avoid halting on early-run | |
| warm-up noise. Once fired, the verdict is **latched** — every subsequent `update` | |
| keeps `fire=True`, so a transient post-collapse recovery cannot silently un-halt | |
| the run. | |
| ### HeldoutSplit disjointness discipline (design-of-record) | |
| The `heldout_score` fed to the guard MUST come from a **disjoint held-out eval | |
| pool** — REAL tasks the generator NEVER trains on (the `HeldoutSplit` | |
| discipline). This is the load-bearing precondition: per the self-evolving survey | |
| §8.3 / Shumailov collapse dynamics, if the held-out set is allowed to drift with | |
| the train set, the proxy-real gap signal degenerates and the guard becomes blind | |
| to the exact collapse it exists to catch. The split is documented here as the | |
| **design-of-record**; the guard consumes a scalar `heldout_score` and does not | |
| itself partition data — the caller is responsible for keeping the split disjoint | |
| and never feeding held-out tasks back into the generator. | |
| ### The 0.08 nats/token KL hard-stop default | |
| The GRPO "healthy progression" band (Orchestra Research GRPO skill) climbs | |
| 0.02 → 0.05 → 0.08 → 0.12 nats/token over a run, with **0.08 the top of the "good | |
| progression" band** and just below the code-generation drift zone (0.05–0.15 | |
| per-token; >0.5 is "diverging too much"). So 0.08 nats/token is a sound | |
| hard-stop default. `calibrate_kl_threshold(baseline_kls, factor=3.0)` lets a run | |
| adapt the ceiling to its own KL scale ("record baseline KL over the first ~100 | |
| steps, set max to 3× that") — but with a **safety clamp**: calibration may only | |
| ever TIGHTEN the stop (`min(3×baseline, current)`), never loosen it past the | |
| documented collapse band, so a noisy / already-drifting baseline cannot raise the | |
| ceiling above 0.08. | |
| > **UNITS GOTCHA (load-bearing).** `kl_to_init` is **token-mean KL in | |
| > nats/token**, matching `composer_replication.integrations.altered_minds. | |
| > kl_logging.token_mean_kl`. It is NOT comparable to a sequence-level / | |
| > sequence-summed KL (whose healthy band is ~0.05–10). Passing a sequence-summed | |
| > KL into the per-token hard stop will fire it instantly. | |
| ### Public surface | |
| `composer_replication.safety` re-exports: | |
| `HeldOutGuard`, `TripwireStatus`, `CollapseStopError`, | |
| `kl_token_trust_filter`. The guard exposes both flag-checking | |
| (`should_halt()` / `status.fire` / `status.halt`) and exception-based | |
| (`raise_if_fired(status) -> CollapseStopError`) control flow so a trainer loop can | |
| use whichever convention it prefers. `kl_token_trust_filter` is the per-token | |
| torchrl-style "KL-Mask" sibling (caller passes `0.5·(log π/π_ref)²`; returns | |
| True to mask the token) — same 0.08 band, kept torch-free. | |
| ## Consequences | |
| - **Positive**: the flywheel gains a run-level, online collapse tripwire that | |
| fires on the literature's exact reward-hacking signature (proxy-up / real-down), | |
| is denoised against single-step noise, and latches so a detected collapse | |
| cannot un-halt. It is layered defense-in-depth ON TOP OF the per-task ADR-010 | |
| controls — neither sufficient alone (per EvilGenie / Catastrophic Goodhart). | |
| - **Positive**: pure-Python and CPU-testable — `kl_to_init` is a float the caller | |
| computes upstream, so the guard pulls no torch / cloud dependency and is unit | |
| testable without a model. | |
| - **Positive**: the thresholds are calibratable and the KL stop only ever | |
| tightens, so the safety property (ceiling ≤ documented band) is preserved | |
| across calibration. | |
| - **Negative / honest**: a held-out eval is necessary but NOT sufficient by | |
| itself (EvilGenie); the guard's value depends entirely on the caller honoring | |
| the `HeldoutSplit` disjointness discipline. The KL stop is one tripwire among | |
| several, not a Goodhart-proof guarantee. `entropy` / `reward_std` are tracked | |
| and exposed but are NOT yet hard gates (early-warning instruments only). | |
| - **Neutral**: `HeldoutSplit` ships as a documented design-of-record discipline | |
| rather than an enforced data-partitioning class in this wave; the guard | |
| consumes the scalar held-out score the caller provides. | |
| ## Acceptance gate | |
| - [x] `HeldOutGuard.update(...)` folds in-loop / held-out / KL (+ entropy / | |
| reward_std) EMAs and returns a `TripwireStatus`; fires on (a) collapse-in-the-act, | |
| (b) KL > 0.08 nats/token, (c) proxy-real gap blowout; no fire before `min_steps`; | |
| latched after first fire. | |
| - [x] `proxy_real_gap()` returns the RSI Hacking-Gap (in-loop gain − held-out gain | |
| since baseline); `should_halt()` / `last_status` are idempotent query helpers; | |
| `raise_if_fired()` converts a fired verdict into `CollapseStopError`. | |
| - [x] `calibrate_kl_threshold()` only ever TIGHTENS the hard stop (safety clamp); | |
| raises on empty input. | |
| - [x] `kl_token_trust_filter()` per-token KL-Mask helper, torch-free. | |
| - [x] Pure-Python, CPU-only; `composer_replication.safety.__init__` re-exports the | |
| public surface and references this ADR. | |
| - [x] Documented in `docs/API_REFERENCE.md` §17. | |
| ## More Information | |
| - `composer_replication/safety/kill_switch.py` — the implementation + the | |
| primary-source citations inline. | |
| - ADR-010 (FeatureDeletion datagen) — the per-task controls this layers above. | |
| - `docs/API_REFERENCE.md` §16 (`DockerSandbox`) / §17 (`composer_replication.safety`). | |
| - Zhao et al. RSI (OpenReview `ikrQWGgxYg`); Gao et al. self-evolving survey | |
| §8.3 (arXiv 2507.21046 v4); Shumailov et al. (*Nature* 2024); EvilGenie | |
| (arXiv 2511.21654); Catastrophic Goodhart (OpenReview `UXuBzWoZGK`). | |