composer-replication-framework / research /readability-recommendations.json
Baladithya Balamurugan
Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt
c11cf49
Raw
History Blame Contribute Delete
17.8 kB
[
{
"id": "rec-1",
"category": "break-paragraph",
"severity": "high",
"current": "RLVR-trained models systematically shortcut extensional verifiers, with shortcut prevalence *rising with task complexity and inference-time compute*; and monitors trained on synthetic hacks *fail to generalize* to in-the-wild hacking, so a `HackMonitor` validated on constructed examples is exactly the one likely to miss the real thing [29][30][31]. Cursor itself observed Composer 2.5 reverse-engineering a leftover type-check cache and decompiling Java bytecode to recover deleted signatures [1]. The oracle *bounds* the hack surface",
"recommended": "RLVR-trained models systematically shortcut extensional verifiers, with shortcut prevalence *rising with task complexity and inference-time compute*; and monitors trained on synthetic hacks *fail to generalize* to in-the-wild hacking, so a `HackMonitor` validated on constructed examples is exactly the one likely to miss the real thing [29][30][31]. Cursor itself observed Composer 2.5 reverse-engineering a leftover type-check cache and decompiling Java bytecode to recover deleted signatures [1].\n\nThe oracle *bounds* the hack surface",
"rationale": "Line 79 is the longest paragraph in the report (~1880 chars); breaking at the Cursor-example sentence boundary splits a dense wall into two scannable units."
},
{
"id": "rec-2",
"category": "break-paragraph",
"severity": "high",
"current": "Self-distillation in the inner loop is, in this configuration, a *stabilizer* and not only a collapse risk: SDFT shows on-policy self-distillation from demonstrations reduces catastrophic forgetting and lets a single model accumulate skills sequentially — the opposite of model collapse — and Channel-2 SDPO is exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses [36]. But the repo's own ADR-013 warns",
"recommended": "Self-distillation in the inner loop is, in this configuration, a *stabilizer* and not only a collapse risk: SDFT shows on-policy self-distillation from demonstrations reduces catastrophic forgetting and lets a single model accumulate skills sequentially — the opposite of model collapse — and Channel-2 SDPO is exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses [36].\n\nBut the repo's own ADR-013 warns",
"rationale": "Line 103 (~1670 chars) is a wall mixing the stabilizer claim, the amplification caveat, and the flywheel argument; breaking after the SDFT point isolates the first claim."
},
{
"id": "rec-3",
"category": "break-paragraph",
"severity": "high",
"current": "every working SWE flywheel optimizes a true execution oracle (Socratic-SWE +7.8 over three iters beating self-play at equal compute [10]; DeepSWE +20 Pass@1 in 200 RL steps on sparse 0/1 reward; SWE-RL 41% generalizing OOD [37]). Most collapse stories require a proxy or self-judged verifier",
"recommended": "every working SWE flywheel optimizes a true execution oracle (Socratic-SWE +7.8 over three iters beating self-play at equal compute [10]; DeepSWE +20 Pass@1 in 200 RL steps on sparse 0/1 reward; SWE-RL 41% generalizing OOD [37]).\n\nMost collapse stories require a proxy or self-judged verifier",
"rationale": "Further splits the remaining tail of the overlong line-103 paragraph at the working-flywheel / collapse-stories pivot so each half reads as one idea."
},
{
"id": "rec-4",
"category": "break-paragraph",
"severity": "high",
"current": "alignment-gated structure over naive train-on-all [55]. The content side is trainable:",
"recommended": "alignment-gated structure over naive train-on-all [55].\n\nThe content side is trainable:",
"rationale": "Line 27 (~1670 chars) packs four distinct facts into one paragraph; breaking before the trainable-content fact separates the anti-emergence evidence from the pro-training evidence."
},
{
"id": "rec-5",
"category": "break-paragraph",
"severity": "high",
"current": "branch factor × sandbox cold-start [6]. The layered posture: **gVisor",
"recommended": "branch factor × sandbox cold-start [6].\n\nThe layered posture: **gVisor",
"rationale": "Line 179 (~1616 chars) is a §8 wall; breaking before the layered-isolation discussion separates the framing sentence from the three-tier detail."
},
{
"id": "rec-6",
"category": "break-paragraph",
"severity": "high",
"current": "many small vLLM pods share a GPU [45]. One hosting fact feeds the platform choice:",
"recommended": "many small vLLM pods share a GPU [45].\n\nOne hosting fact feeds the platform choice:",
"rationale": "Splits the remaining tail of the overlong line-179 §8 paragraph at the hosting-fact pivot, separating sandbox/GPU sizing from the TRL-vs-VeRL engine choice."
},
{
"id": "rec-7",
"category": "break-paragraph",
"severity": "high",
"current": "the predicted `tool_error` kind — never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15]. The latent-motion line carries the same discipline into 2026:",
"recommended": "the predicted `tool_error` kind — never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15].\n\nThe latent-motion line carries the same discipline into 2026:",
"rationale": "Line 27 also runs the MuZero/Dreamer design discipline into the latent-motion result; breaking before the 2026 latent-motion sentence relieves the second half of this overlong paragraph."
},
{
"id": "rec-8",
"category": "break-paragraph",
"severity": "medium",
"current": "RL on the token's *placement* teaches the *governance* that is the real bottleneck [11].",
"recommended": "RL on the token's *placement* teaches the *governance* that is the real bottleneck [11].\n",
"rationale": "Line 33 (~1407 chars) runs the SDPO carrier and deliberate-token mechanisms together; inserting a break after the placement sentence separates the two mechanisms."
},
{
"id": "rec-9",
"category": "break-paragraph",
"severity": "medium",
"current": "that absence is the delta. So \"multi-model Monte-Carlo tree-of-work\" means, concretely:",
"recommended": "that absence is the delta.\n\nSo \"multi-model Monte-Carlo tree-of-work\" means, concretely:",
"rationale": "Line 19 (~1320 chars) is a §1 wall; breaking before the concrete restatement separates the repo-primitive mapping from the definitional summary."
},
{
"id": "rec-10",
"category": "break-paragraph",
"severity": "medium",
"current": "*structured/selective negatives beat both raw train-on-all and positives-only pruning.* The verdict: **train on all surviving branches, typed and routed by signal, never as raw negative policy gradient.**",
"recommended": "*structured/selective negatives beat both raw train-on-all and positives-only pruning.*\n\nThe verdict: **train on all surviving branches, typed and routed by signal, never as raw negative policy gradient.**",
"rationale": "Separates the bracketing observation from the verdict so the §4 headline verdict stands out before the numbered routing list."
},
{
"id": "rec-11",
"category": "break-paragraph",
"severity": "medium",
"current": "what makes divergence-gating mandatory [6]. The gating pays for itself:",
"recommended": "what makes divergence-gating mandatory [6].\n\nThe gating pays for itself:",
"rationale": "Line 191 (~1112 chars) is the dense Cost paragraph; breaking before the gating-savings sentence separates the cost problem from the mitigation."
},
{
"id": "rec-12",
"category": "break-paragraph",
"severity": "medium",
"current": "real for the *unguarded* version [8]. The escape is not better replay;",
"recommended": "real for the *unguarded* version [8].\n\nThe escape is not better replay;",
"rationale": "Line 17 (~1169 chars) is a §1 wall; breaking before the escape sentence separates the critique from the design response."
},
{
"id": "rec-13",
"category": "break-paragraph",
"severity": "medium",
"current": "for that turn only [1]. The frontier-variance curriculum is a homeostatic selection regulator,",
"recommended": "for that turn only [1].\n\nThe frontier-variance curriculum is a homeostatic selection regulator,",
"rationale": "Line 51 (~1316 chars) joins the mutation point and the curriculum point; breaking before the curriculum sentence separates two distinct GA-mapping claims."
},
{
"id": "rec-14",
"category": "break-paragraph",
"severity": "low",
"current": "and the trainer need *zero* changes, and `ModalSpawnExecutor` is the working existence proof [41].",
"recommended": "and the trainer need *zero* changes, and `ModalSpawnExecutor` is the working existence proof [41].\n",
"rationale": "Line 165 (~1021 chars) runs the ADR-005 framing and the AWS S3 mapping together; a trailing break after the existence-proof sentence eases the §8 lede paragraph."
},
{
"id": "rec-15",
"category": "bold-keyterms",
"severity": "high",
"current": "This upgrade from teacher-plurality to execution-oracle fitness is **the single most important change** and the one the corpus most strongly supports.",
"recommended": "This upgrade from **teacher-plurality to execution-oracle fitness** is **the single most important change** and the one the corpus most strongly supports.",
"rationale": "Bolds the load-bearing term \"teacher-plurality to execution-oracle fitness\" so a skimmer sees the report's central upgrade."
},
{
"id": "rec-16",
"category": "bold-keyterms",
"severity": "high",
"current": "A literal per-turn N-way tree is O(N^D) and economically fatal — ungated, a branching trace prices around $64 versus $0.98 flat [6].",
"recommended": "A literal per-turn N-way tree is **O(N^D)** and economically fatal — ungated, a branching trace prices around **$64 versus $0.98 flat** [6].",
"rationale": "Bolds the cost-blowup complexity and the headline price figures a skimmer needs to grasp why divergence-gating is mandatory."
},
{
"id": "rec-17",
"category": "bold-keyterms",
"severity": "high",
"current": "so collapse to a single rollout — turning O(N^D) into roughly O(N · decision-points) [6].",
"recommended": "so collapse to a single rollout — turning O(N^D) into roughly **O(N · decision-points)** [6].",
"rationale": "Bolds the target complexity after gating, the key quantitative payoff of the divergence-gated design."
},
{
"id": "rec-18",
"category": "bold-keyterms",
"severity": "high",
"current": "policy.\"** \"Prune versus train-on-all\" is a false binary.",
"recommended": "policy.\"** **\"Prune versus train-on-all\" is a false binary.**",
"rationale": "Bolds the §4 reframe conclusion so the skimmer catches the central thesis that the prune/train-on-all dichotomy is false."
},
{
"id": "rec-19",
"category": "bold-keyterms",
"severity": "medium",
"current": "and reaches 50.40% on SWE-bench Verified after three iterations [10].",
"recommended": "and reaches **50.40% on SWE-bench Verified** after three iterations [10].",
"rationale": "Bolds the headline Socratic-SWE pass-rate so the closest published analogue's result is scannable."
},
{
"id": "rec-20",
"category": "bold-keyterms",
"severity": "medium",
"current": "reaching 65.8% on SWE-bench Verified — crucially training *on all* trajectories for the world-model head",
"recommended": "reaching **65.8% on SWE-bench Verified** — crucially training *on all* trajectories for the world-model head",
"rationale": "Bolds the CWM existence-proof pass-rate, a key statistic supporting train-on-all for the world-model head."
},
{
"id": "rec-21",
"category": "bold-keyterms",
"severity": "medium",
"current": "across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) *while achieving 92% lower prediction error* [18].",
"recommended": "across 2,695 networks **mean causal fidelity is 0.49** (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes **causally blind (~1e-8) *while achieving 92% lower prediction error*** [18].",
"rationale": "Bolds the two decisive predictive-causal-gap statistics that justify measuring foresight rather than next-state accuracy."
},
{
"id": "rec-22",
"category": "bold-keyterms",
"severity": "medium",
"current": "is the kill ablation: if it is ≈0, the token is a no-op and is cut",
"recommended": "is **the kill ablation**: if it is ≈0, the token is a no-op and is cut",
"rationale": "Bolds \"the kill ablation\" so the skimmer registers Foresight@k as the decisive cut criterion for the world-model head."
},
{
"id": "rec-23",
"category": "bold-keyterms",
"severity": "medium",
"current": "DeepSWE 42.2% Pass@1, 59% with test-time scaling, from pure outcome RL — stronger-teacher SFT *hurt* [43]",
"recommended": "**DeepSWE 42.2% Pass@1**, 59% with test-time scaling, from pure outcome RL — stronger-teacher SFT *hurt* [43]",
"rationale": "Bolds the DeepSWE headline figure, the incumbent baseline every later phase must beat at equal compute."
},
{
"id": "rec-24",
"category": "bold-keyterms",
"severity": "medium",
"current": "The strongest argument that this is buildable is that the substrate already exists — roughly nine-tenths of it.",
"recommended": "The strongest argument that this is buildable is that the substrate already exists — **roughly nine-tenths of it**.",
"rationale": "Bolds the reuse-fraction claim that anchors the entire §6 reuse-vs-build ledger."
},
{
"id": "rec-25",
"category": "bold-keyterms",
"severity": "medium",
"current": "whether the divergence-gated tree beats an equal-budget outcome-only GRPO baseline on long-horizon tasks — and it has never been run.",
"recommended": "whether the **divergence-gated tree beats an equal-budget outcome-only GRPO baseline on long-horizon tasks** — and **it has never been run**.",
"rationale": "Bolds the program's single most important unrun experiment so the skimmer catches the central open question of §7."
},
{
"id": "rec-26",
"category": "bold-keyterms",
"severity": "medium",
"current": "This is the single biggest architectural payoff of the object-store design on Kubernetes.",
"recommended": "This is **the single biggest architectural payoff** of the object-store design on Kubernetes.",
"rationale": "Bolds the headline architectural claim that gang scheduling is unneeded for inter-replica DiLoCo sync."
},
{
"id": "rec-27",
"category": "bold-keyterms",
"severity": "low",
"current": "First, `strip_thinking` must be `False`: ~67% of real Claude Code error-recovery turns are pure thinking, and stripping them yields empty SDPO masks that silently collapse two-thirds of the channel's supervision sites",
"recommended": "First, `strip_thinking` must be `False`: **~67% of real Claude Code error-recovery turns are pure thinking**, and stripping them yields empty SDPO masks that silently collapse two-thirds of the channel's supervision sites",
"rationale": "Bolds the 67%-thinking statistic that makes strip_thinking=False a load-bearing repo configuration fact."
},
{
"id": "rec-28",
"category": "bold-keyterms",
"severity": "low",
"current": "Calibration (ECE/Brier on the predicted-outcome head) is primary, because the documented failure is over-confidence; next-state accuracy is a secondary diagnostic.",
"recommended": "**Calibration (ECE/Brier on the predicted-outcome head) is primary**, because the documented failure is over-confidence; next-state accuracy is a secondary diagnostic.",
"rationale": "Bolds the primary-measurement decision (calibration over next-state accuracy), a load-bearing methodological choice in §2."
},
{
"id": "rec-29",
"category": "split-sentence",
"severity": "medium",
"current": "The divergence tree has a rigorous backbone: sibling A and B from a shared parent reaching different *executed* outcomes is a model-free Monte-Carlo counterfactual credit estimate, low-variance because the shared parent differences out the baseline — a group-relative/leave-one-out argument (Tree-GRPO [44]) — which the executed-sibling structure then approximates non-parametrically for the stronger, hindsight-conditioned variant that learned counterfactual-credit methods (CCA [33]) achieve with a learned hindsight model, and min-form/bottleneck-localized because the credit-bearing step is the earliest node where sibling subtrees separate [33].",
"recommended": "The divergence tree has a rigorous backbone: sibling A and B from a shared parent reaching different *executed* outcomes is a model-free Monte-Carlo counterfactual credit estimate, low-variance because the shared parent differences out the baseline — a group-relative/leave-one-out argument (Tree-GRPO [44]). The executed-sibling structure then approximates non-parametrically the stronger, hindsight-conditioned variant that learned counterfactual-credit methods (CCA [33]) achieve with a learned hindsight model, and it is min-form/bottleneck-localized because the credit-bearing step is the earliest node where sibling subtrees separate [33].",
"rationale": "This ~80-word run-on survived polish; splitting at the Tree-GRPO clause yields two readable sentences without losing any clause."
}
]