Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
status: accepted
date: 2026-06-08T00:00:00.000Z
deciders:
- Codeseys
- ARIA
builds-on:
- ADR-010 (FeatureDeletion datagen — per-task controls)
- ADR-012 (curriculum + provenance review findings)
ADR-015: Held-out disjoint eval + depth/generation kill-switch (HeldOutGuard)
Context and Problem Statement
The framework drives a self-evolving RL flywheel: a generator proposes tasks,
the policy is optimized against an in-loop (proxy / oracle) reward, and the loop
repeats across generations. ADR-010 gave this loop its per-task safety
controls — the 4-gate solvability validator, the HackMonitor provenance check,
and the sandbox denylist (now hardened by DockerSandbox, see API §16). What was
still missing is the run-level / across-generation control: a watcher sitting
ABOVE the per-task gates that asks, every generation, "is the proxy reward
improving because the policy got better, or because it learned to game the
proxy?" — and HALTS the run when the answer is the latter.
The literature is unambiguous that a held-out eval + a hard stop is the load-bearing control here, not a nice-to-have:
Reward hacking rises monotonically with optimization depth. Zhao et al., "Reward Hacking in Self-Improving Code Agents" (ICLR 2026 Workshop on RSI, OpenReview
ikrQWGgxYg), show that going from 10 → 100 optimization steps drives the hacking rate from 26.4% → 57.8% (+31.4 points), and that 73.8% of KernelBench / 46.8% of ALE-Bench optimizations show proxy gains without real gains. They define Hacking Gap = proxy gain − real gain and label an optimization reward-hacking when it "improves the public metric WITHOUT improving the private metric" — the canonical signature a run-level tripwire must fire on. Because the hacking rate climbs with depth, a one-time eval is insufficient; the control has to be an online per-generation tripwire.Closed-loop RL on self-generated data collapses. The self-evolving-agents survey (Gao et al., TMLR 2026; arXiv 2507.21046 v4) §8.3 names "model collapse from closed-loop RL on static synthetic data" and prescribes "continuous monitoring … to detect long-horizon value drift." Shumailov et al. (Nature 2024, "AI models collapse when trained on recursively generated data") show self-training first loses the distribution tails, then converges to a low-variance point estimate. The mitigation that matters here: the held-out eval must stay anchored to REAL tasks that are NEVER fed back to the generator, otherwise the eval drifts with the train set and the gap signal becomes meaningless.
Self-critique alone is insufficient. The RSI paper's "retrospection" self-critique sometimes increased hacking; its conclusion: "mitigating reward hacking likely requires stronger evaluations and constraints beyond self-critique alone." So the control is a genuinely disjoint held-out eval + hard stop, not a critique hook.
A verifiable / held-out reward is still gameable. EvilGenie (arXiv 2511.21654) found "only minimal improvement from the use of held out test cases" in isolation and that "holdout tests have many surprising failure modes." The implication is NOT "skip the held-out eval" but "do not treat it as sufficient on its own" — layer it explicitly as defense-in-depth on top of
HackMonitor(provenance) and the KL hard stop. Catastrophic Goodhart (OpenReviewUXuBzWoZGK) likewise proves KL regularization alone does NOT prevent heavy-tailed reward misspecification — so the KL stop is one tripwire among several, never the sole control.
Decision
Build composer_replication.safety — a pure-Python, no-torch / no-cloud,
fully CPU-testable run-level safeguard — centered on the HeldOutGuard
kill-switch. The guard is stateful and is called once per checkpoint /
generation (update(round_idx, in_loop_reward, heldout_score, kl_to_init=…)),
the same cadence as DifficultyCurriculum.update. It maintains denoised EMAs of
every metric (raw single-step values are too noisy to threshold) and returns a
structured TripwireStatus.
The 3 fire conditions
HeldOutGuard.update returns fire=True (alias halt) when ANY of:
(a) collapse-caught-in-the-act — the in-loop reward EMA is RISING while the held-out score EMA has DECLINED for
>= decline_patienceconsecutive checkpoints (default 3, the "monotone for ≥3 checkpoints" rule). This is the canonical reward-hacking signature: proxy up, real down. A held-out dip during an in-loop dip is treated as noise (a hard batch), not hacking — the decline streak only grows when in-loop is simultaneously rising.(b) KL-to-init hard stop — the
kl_to_initEMA exceedskl_hard_stop(default 0.08 nats/token) on/aftermin_steps. Checked first as the cheapest unambiguous breach.(c) proxy-real gap blowout — the Hacking Gap (proxy gain − real gain since a run-start baseline) widens beyond
max_proxy_real_gap(default 0.10), catching a fast single-generation divergence even before the full decline window elapses.HeldOutGuard.proxy_real_gap()returns exactly the RSI Hacking-Gap quantity.
No tripwire fires before min_steps (default 20) to avoid halting on early-run
warm-up noise. Once fired, the verdict is latched — every subsequent update
keeps fire=True, so a transient post-collapse recovery cannot silently un-halt
the run.
HeldoutSplit disjointness discipline (design-of-record)
The heldout_score fed to the guard MUST come from a disjoint held-out eval
pool — REAL tasks the generator NEVER trains on (the HeldoutSplit
discipline). This is the load-bearing precondition: per the self-evolving survey
§8.3 / Shumailov collapse dynamics, if the held-out set is allowed to drift with
the train set, the proxy-real gap signal degenerates and the guard becomes blind
to the exact collapse it exists to catch. The split is documented here as the
design-of-record; the guard consumes a scalar heldout_score and does not
itself partition data — the caller is responsible for keeping the split disjoint
and never feeding held-out tasks back into the generator.
The 0.08 nats/token KL hard-stop default
The GRPO "healthy progression" band (Orchestra Research GRPO skill) climbs
0.02 → 0.05 → 0.08 → 0.12 nats/token over a run, with 0.08 the top of the "good
progression" band and just below the code-generation drift zone (0.05–0.15
per-token; >0.5 is "diverging too much"). So 0.08 nats/token is a sound
hard-stop default. calibrate_kl_threshold(baseline_kls, factor=3.0) lets a run
adapt the ceiling to its own KL scale ("record baseline KL over the first ~100
steps, set max to 3× that") — but with a safety clamp: calibration may only
ever TIGHTEN the stop (min(3×baseline, current)), never loosen it past the
documented collapse band, so a noisy / already-drifting baseline cannot raise the
ceiling above 0.08.
UNITS GOTCHA (load-bearing).
kl_to_initis token-mean KL in nats/token, matchingcomposer_replication.integrations.altered_minds. kl_logging.token_mean_kl. It is NOT comparable to a sequence-level / sequence-summed KL (whose healthy band is ~0.05–10). Passing a sequence-summed KL into the per-token hard stop will fire it instantly.
Public surface
composer_replication.safety re-exports:
HeldOutGuard, TripwireStatus, CollapseStopError,
kl_token_trust_filter. The guard exposes both flag-checking
(should_halt() / status.fire / status.halt) and exception-based
(raise_if_fired(status) -> CollapseStopError) control flow so a trainer loop can
use whichever convention it prefers. kl_token_trust_filter is the per-token
torchrl-style "KL-Mask" sibling (caller passes 0.5·(log π/π_ref)²; returns
True to mask the token) — same 0.08 band, kept torch-free.
Consequences
- Positive: the flywheel gains a run-level, online collapse tripwire that fires on the literature's exact reward-hacking signature (proxy-up / real-down), is denoised against single-step noise, and latches so a detected collapse cannot un-halt. It is layered defense-in-depth ON TOP OF the per-task ADR-010 controls — neither sufficient alone (per EvilGenie / Catastrophic Goodhart).
- Positive: pure-Python and CPU-testable —
kl_to_initis a float the caller computes upstream, so the guard pulls no torch / cloud dependency and is unit testable without a model. - Positive: the thresholds are calibratable and the KL stop only ever tightens, so the safety property (ceiling ≤ documented band) is preserved across calibration.
- Negative / honest: a held-out eval is necessary but NOT sufficient by
itself (EvilGenie); the guard's value depends entirely on the caller honoring
the
HeldoutSplitdisjointness discipline. The KL stop is one tripwire among several, not a Goodhart-proof guarantee.entropy/reward_stdare tracked and exposed but are NOT yet hard gates (early-warning instruments only). - Neutral:
HeldoutSplitships as a documented design-of-record discipline rather than an enforced data-partitioning class in this wave; the guard consumes the scalar held-out score the caller provides.
Acceptance gate
-
HeldOutGuard.update(...)folds in-loop / held-out / KL (+ entropy / reward_std) EMAs and returns aTripwireStatus; fires on (a) collapse-in-the-act, (b) KL > 0.08 nats/token, (c) proxy-real gap blowout; no fire beforemin_steps; latched after first fire. -
proxy_real_gap()returns the RSI Hacking-Gap (in-loop gain − held-out gain since baseline);should_halt()/last_statusare idempotent query helpers;raise_if_fired()converts a fired verdict intoCollapseStopError. -
calibrate_kl_threshold()only ever TIGHTENS the hard stop (safety clamp); raises on empty input. -
kl_token_trust_filter()per-token KL-Mask helper, torch-free. - Pure-Python, CPU-only;
composer_replication.safety.__init__re-exports the public surface and references this ADR. - Documented in
docs/API_REFERENCE.md§17.
More Information
composer_replication/safety/kill_switch.py— the implementation + the primary-source citations inline.- ADR-010 (FeatureDeletion datagen) — the per-task controls this layers above.
docs/API_REFERENCE.md§16 (DockerSandbox) / §17 (composer_replication.safety).- Zhao et al. RSI (OpenReview
ikrQWGgxYg); Gao et al. self-evolving survey §8.3 (arXiv 2507.21046 v4); Shumailov et al. (Nature 2024); EvilGenie (arXiv 2511.21654); Catastrophic Goodhart (OpenReviewUXuBzWoZGK).