Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 17,846 Bytes
c11cf49 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | [
{
"id": "rec-1",
"category": "break-paragraph",
"severity": "high",
"current": "RLVR-trained models systematically shortcut extensional verifiers, with shortcut prevalence *rising with task complexity and inference-time compute*; and monitors trained on synthetic hacks *fail to generalize* to in-the-wild hacking, so a `HackMonitor` validated on constructed examples is exactly the one likely to miss the real thing [29][30][31]. Cursor itself observed Composer 2.5 reverse-engineering a leftover type-check cache and decompiling Java bytecode to recover deleted signatures [1]. The oracle *bounds* the hack surface",
"recommended": "RLVR-trained models systematically shortcut extensional verifiers, with shortcut prevalence *rising with task complexity and inference-time compute*; and monitors trained on synthetic hacks *fail to generalize* to in-the-wild hacking, so a `HackMonitor` validated on constructed examples is exactly the one likely to miss the real thing [29][30][31]. Cursor itself observed Composer 2.5 reverse-engineering a leftover type-check cache and decompiling Java bytecode to recover deleted signatures [1].\n\nThe oracle *bounds* the hack surface",
"rationale": "Line 79 is the longest paragraph in the report (~1880 chars); breaking at the Cursor-example sentence boundary splits a dense wall into two scannable units."
},
{
"id": "rec-2",
"category": "break-paragraph",
"severity": "high",
"current": "Self-distillation in the inner loop is, in this configuration, a *stabilizer* and not only a collapse risk: SDFT shows on-policy self-distillation from demonstrations reduces catastrophic forgetting and lets a single model accumulate skills sequentially — the opposite of model collapse — and Channel-2 SDPO is exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses [36]. But the repo's own ADR-013 warns",
"recommended": "Self-distillation in the inner loop is, in this configuration, a *stabilizer* and not only a collapse risk: SDFT shows on-policy self-distillation from demonstrations reduces catastrophic forgetting and lets a single model accumulate skills sequentially — the opposite of model collapse — and Channel-2 SDPO is exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses [36].\n\nBut the repo's own ADR-013 warns",
"rationale": "Line 103 (~1670 chars) is a wall mixing the stabilizer claim, the amplification caveat, and the flywheel argument; breaking after the SDFT point isolates the first claim."
},
{
"id": "rec-3",
"category": "break-paragraph",
"severity": "high",
"current": "every working SWE flywheel optimizes a true execution oracle (Socratic-SWE +7.8 over three iters beating self-play at equal compute [10]; DeepSWE +20 Pass@1 in 200 RL steps on sparse 0/1 reward; SWE-RL 41% generalizing OOD [37]). Most collapse stories require a proxy or self-judged verifier",
"recommended": "every working SWE flywheel optimizes a true execution oracle (Socratic-SWE +7.8 over three iters beating self-play at equal compute [10]; DeepSWE +20 Pass@1 in 200 RL steps on sparse 0/1 reward; SWE-RL 41% generalizing OOD [37]).\n\nMost collapse stories require a proxy or self-judged verifier",
"rationale": "Further splits the remaining tail of the overlong line-103 paragraph at the working-flywheel / collapse-stories pivot so each half reads as one idea."
},
{
"id": "rec-4",
"category": "break-paragraph",
"severity": "high",
"current": "alignment-gated structure over naive train-on-all [55]. The content side is trainable:",
"recommended": "alignment-gated structure over naive train-on-all [55].\n\nThe content side is trainable:",
"rationale": "Line 27 (~1670 chars) packs four distinct facts into one paragraph; breaking before the trainable-content fact separates the anti-emergence evidence from the pro-training evidence."
},
{
"id": "rec-5",
"category": "break-paragraph",
"severity": "high",
"current": "branch factor × sandbox cold-start [6]. The layered posture: **gVisor",
"recommended": "branch factor × sandbox cold-start [6].\n\nThe layered posture: **gVisor",
"rationale": "Line 179 (~1616 chars) is a §8 wall; breaking before the layered-isolation discussion separates the framing sentence from the three-tier detail."
},
{
"id": "rec-6",
"category": "break-paragraph",
"severity": "high",
"current": "many small vLLM pods share a GPU [45]. One hosting fact feeds the platform choice:",
"recommended": "many small vLLM pods share a GPU [45].\n\nOne hosting fact feeds the platform choice:",
"rationale": "Splits the remaining tail of the overlong line-179 §8 paragraph at the hosting-fact pivot, separating sandbox/GPU sizing from the TRL-vs-VeRL engine choice."
},
{
"id": "rec-7",
"category": "break-paragraph",
"severity": "high",
"current": "the predicted `tool_error` kind — never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15]. The latent-motion line carries the same discipline into 2026:",
"recommended": "the predicted `tool_error` kind — never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15].\n\nThe latent-motion line carries the same discipline into 2026:",
"rationale": "Line 27 also runs the MuZero/Dreamer design discipline into the latent-motion result; breaking before the 2026 latent-motion sentence relieves the second half of this overlong paragraph."
},
{
"id": "rec-8",
"category": "break-paragraph",
"severity": "medium",
"current": "RL on the token's *placement* teaches the *governance* that is the real bottleneck [11].",
"recommended": "RL on the token's *placement* teaches the *governance* that is the real bottleneck [11].\n",
"rationale": "Line 33 (~1407 chars) runs the SDPO carrier and deliberate-token mechanisms together; inserting a break after the placement sentence separates the two mechanisms."
},
{
"id": "rec-9",
"category": "break-paragraph",
"severity": "medium",
"current": "that absence is the delta. So \"multi-model Monte-Carlo tree-of-work\" means, concretely:",
"recommended": "that absence is the delta.\n\nSo \"multi-model Monte-Carlo tree-of-work\" means, concretely:",
"rationale": "Line 19 (~1320 chars) is a §1 wall; breaking before the concrete restatement separates the repo-primitive mapping from the definitional summary."
},
{
"id": "rec-10",
"category": "break-paragraph",
"severity": "medium",
"current": "*structured/selective negatives beat both raw train-on-all and positives-only pruning.* The verdict: **train on all surviving branches, typed and routed by signal, never as raw negative policy gradient.**",
"recommended": "*structured/selective negatives beat both raw train-on-all and positives-only pruning.*\n\nThe verdict: **train on all surviving branches, typed and routed by signal, never as raw negative policy gradient.**",
"rationale": "Separates the bracketing observation from the verdict so the §4 headline verdict stands out before the numbered routing list."
},
{
"id": "rec-11",
"category": "break-paragraph",
"severity": "medium",
"current": "what makes divergence-gating mandatory [6]. The gating pays for itself:",
"recommended": "what makes divergence-gating mandatory [6].\n\nThe gating pays for itself:",
"rationale": "Line 191 (~1112 chars) is the dense Cost paragraph; breaking before the gating-savings sentence separates the cost problem from the mitigation."
},
{
"id": "rec-12",
"category": "break-paragraph",
"severity": "medium",
"current": "real for the *unguarded* version [8]. The escape is not better replay;",
"recommended": "real for the *unguarded* version [8].\n\nThe escape is not better replay;",
"rationale": "Line 17 (~1169 chars) is a §1 wall; breaking before the escape sentence separates the critique from the design response."
},
{
"id": "rec-13",
"category": "break-paragraph",
"severity": "medium",
"current": "for that turn only [1]. The frontier-variance curriculum is a homeostatic selection regulator,",
"recommended": "for that turn only [1].\n\nThe frontier-variance curriculum is a homeostatic selection regulator,",
"rationale": "Line 51 (~1316 chars) joins the mutation point and the curriculum point; breaking before the curriculum sentence separates two distinct GA-mapping claims."
},
{
"id": "rec-14",
"category": "break-paragraph",
"severity": "low",
"current": "and the trainer need *zero* changes, and `ModalSpawnExecutor` is the working existence proof [41].",
"recommended": "and the trainer need *zero* changes, and `ModalSpawnExecutor` is the working existence proof [41].\n",
"rationale": "Line 165 (~1021 chars) runs the ADR-005 framing and the AWS S3 mapping together; a trailing break after the existence-proof sentence eases the §8 lede paragraph."
},
{
"id": "rec-15",
"category": "bold-keyterms",
"severity": "high",
"current": "This upgrade from teacher-plurality to execution-oracle fitness is **the single most important change** and the one the corpus most strongly supports.",
"recommended": "This upgrade from **teacher-plurality to execution-oracle fitness** is **the single most important change** and the one the corpus most strongly supports.",
"rationale": "Bolds the load-bearing term \"teacher-plurality to execution-oracle fitness\" so a skimmer sees the report's central upgrade."
},
{
"id": "rec-16",
"category": "bold-keyterms",
"severity": "high",
"current": "A literal per-turn N-way tree is O(N^D) and economically fatal — ungated, a branching trace prices around $64 versus $0.98 flat [6].",
"recommended": "A literal per-turn N-way tree is **O(N^D)** and economically fatal — ungated, a branching trace prices around **$64 versus $0.98 flat** [6].",
"rationale": "Bolds the cost-blowup complexity and the headline price figures a skimmer needs to grasp why divergence-gating is mandatory."
},
{
"id": "rec-17",
"category": "bold-keyterms",
"severity": "high",
"current": "so collapse to a single rollout — turning O(N^D) into roughly O(N · decision-points) [6].",
"recommended": "so collapse to a single rollout — turning O(N^D) into roughly **O(N · decision-points)** [6].",
"rationale": "Bolds the target complexity after gating, the key quantitative payoff of the divergence-gated design."
},
{
"id": "rec-18",
"category": "bold-keyterms",
"severity": "high",
"current": "policy.\"** \"Prune versus train-on-all\" is a false binary.",
"recommended": "policy.\"** **\"Prune versus train-on-all\" is a false binary.**",
"rationale": "Bolds the §4 reframe conclusion so the skimmer catches the central thesis that the prune/train-on-all dichotomy is false."
},
{
"id": "rec-19",
"category": "bold-keyterms",
"severity": "medium",
"current": "and reaches 50.40% on SWE-bench Verified after three iterations [10].",
"recommended": "and reaches **50.40% on SWE-bench Verified** after three iterations [10].",
"rationale": "Bolds the headline Socratic-SWE pass-rate so the closest published analogue's result is scannable."
},
{
"id": "rec-20",
"category": "bold-keyterms",
"severity": "medium",
"current": "reaching 65.8% on SWE-bench Verified — crucially training *on all* trajectories for the world-model head",
"recommended": "reaching **65.8% on SWE-bench Verified** — crucially training *on all* trajectories for the world-model head",
"rationale": "Bolds the CWM existence-proof pass-rate, a key statistic supporting train-on-all for the world-model head."
},
{
"id": "rec-21",
"category": "bold-keyterms",
"severity": "medium",
"current": "across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) *while achieving 92% lower prediction error* [18].",
"recommended": "across 2,695 networks **mean causal fidelity is 0.49** (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes **causally blind (~1e-8) *while achieving 92% lower prediction error*** [18].",
"rationale": "Bolds the two decisive predictive-causal-gap statistics that justify measuring foresight rather than next-state accuracy."
},
{
"id": "rec-22",
"category": "bold-keyterms",
"severity": "medium",
"current": "is the kill ablation: if it is ≈0, the token is a no-op and is cut",
"recommended": "is **the kill ablation**: if it is ≈0, the token is a no-op and is cut",
"rationale": "Bolds \"the kill ablation\" so the skimmer registers Foresight@k as the decisive cut criterion for the world-model head."
},
{
"id": "rec-23",
"category": "bold-keyterms",
"severity": "medium",
"current": "DeepSWE 42.2% Pass@1, 59% with test-time scaling, from pure outcome RL — stronger-teacher SFT *hurt* [43]",
"recommended": "**DeepSWE 42.2% Pass@1**, 59% with test-time scaling, from pure outcome RL — stronger-teacher SFT *hurt* [43]",
"rationale": "Bolds the DeepSWE headline figure, the incumbent baseline every later phase must beat at equal compute."
},
{
"id": "rec-24",
"category": "bold-keyterms",
"severity": "medium",
"current": "The strongest argument that this is buildable is that the substrate already exists — roughly nine-tenths of it.",
"recommended": "The strongest argument that this is buildable is that the substrate already exists — **roughly nine-tenths of it**.",
"rationale": "Bolds the reuse-fraction claim that anchors the entire §6 reuse-vs-build ledger."
},
{
"id": "rec-25",
"category": "bold-keyterms",
"severity": "medium",
"current": "whether the divergence-gated tree beats an equal-budget outcome-only GRPO baseline on long-horizon tasks — and it has never been run.",
"recommended": "whether the **divergence-gated tree beats an equal-budget outcome-only GRPO baseline on long-horizon tasks** — and **it has never been run**.",
"rationale": "Bolds the program's single most important unrun experiment so the skimmer catches the central open question of §7."
},
{
"id": "rec-26",
"category": "bold-keyterms",
"severity": "medium",
"current": "This is the single biggest architectural payoff of the object-store design on Kubernetes.",
"recommended": "This is **the single biggest architectural payoff** of the object-store design on Kubernetes.",
"rationale": "Bolds the headline architectural claim that gang scheduling is unneeded for inter-replica DiLoCo sync."
},
{
"id": "rec-27",
"category": "bold-keyterms",
"severity": "low",
"current": "First, `strip_thinking` must be `False`: ~67% of real Claude Code error-recovery turns are pure thinking, and stripping them yields empty SDPO masks that silently collapse two-thirds of the channel's supervision sites",
"recommended": "First, `strip_thinking` must be `False`: **~67% of real Claude Code error-recovery turns are pure thinking**, and stripping them yields empty SDPO masks that silently collapse two-thirds of the channel's supervision sites",
"rationale": "Bolds the 67%-thinking statistic that makes strip_thinking=False a load-bearing repo configuration fact."
},
{
"id": "rec-28",
"category": "bold-keyterms",
"severity": "low",
"current": "Calibration (ECE/Brier on the predicted-outcome head) is primary, because the documented failure is over-confidence; next-state accuracy is a secondary diagnostic.",
"recommended": "**Calibration (ECE/Brier on the predicted-outcome head) is primary**, because the documented failure is over-confidence; next-state accuracy is a secondary diagnostic.",
"rationale": "Bolds the primary-measurement decision (calibration over next-state accuracy), a load-bearing methodological choice in §2."
},
{
"id": "rec-29",
"category": "split-sentence",
"severity": "medium",
"current": "The divergence tree has a rigorous backbone: sibling A and B from a shared parent reaching different *executed* outcomes is a model-free Monte-Carlo counterfactual credit estimate, low-variance because the shared parent differences out the baseline — a group-relative/leave-one-out argument (Tree-GRPO [44]) — which the executed-sibling structure then approximates non-parametrically for the stronger, hindsight-conditioned variant that learned counterfactual-credit methods (CCA [33]) achieve with a learned hindsight model, and min-form/bottleneck-localized because the credit-bearing step is the earliest node where sibling subtrees separate [33].",
"recommended": "The divergence tree has a rigorous backbone: sibling A and B from a shared parent reaching different *executed* outcomes is a model-free Monte-Carlo counterfactual credit estimate, low-variance because the shared parent differences out the baseline — a group-relative/leave-one-out argument (Tree-GRPO [44]). The executed-sibling structure then approximates non-parametrically the stronger, hindsight-conditioned variant that learned counterfactual-credit methods (CCA [33]) achieve with a learned hindsight model, and it is min-form/bottleneck-localized because the credit-bearing step is the earliest node where sibling subtrees separate [33].",
"rationale": "This ~80-word run-on survived polish; splitting at the Tree-GRPO clause yields two readable sentences without losing any clause."
}
]
|