composer-replication-framework / research /readability-recommendations.json

Baladithya Balamurugan

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

c11cf49 16 days ago

17.8 kB

	[
	{
	"id": "rec-1",
	"category": "break-paragraph",
	"severity": "high",
	"current": "RLVR-trained models systematically shortcut extensional verifiers, with shortcut prevalence rising with task complexity and inference-time compute; and monitors trained on synthetic hacks fail to generalize to in-the-wild hacking, so a `HackMonitor` validated on constructed examples is exactly the one likely to miss the real thing [29][30][31]. Cursor itself observed Composer 2.5 reverse-engineering a leftover type-check cache and decompiling Java bytecode to recover deleted signatures [1]. The oracle bounds the hack surface",
	"recommended": "RLVR-trained models systematically shortcut extensional verifiers, with shortcut prevalence rising with task complexity and inference-time compute; and monitors trained on synthetic hacks fail to generalize to in-the-wild hacking, so a `HackMonitor` validated on constructed examples is exactly the one likely to miss the real thing [29][30][31]. Cursor itself observed Composer 2.5 reverse-engineering a leftover type-check cache and decompiling Java bytecode to recover deleted signatures [1].\n\nThe oracle bounds the hack surface",
	"rationale": "Line 79 is the longest paragraph in the report (~1880 chars); breaking at the Cursor-example sentence boundary splits a dense wall into two scannable units."
	},
	{
	"id": "rec-2",
	"category": "break-paragraph",
	"severity": "high",
	"current": "Self-distillation in the inner loop is, in this configuration, a stabilizer and not only a collapse risk: SDFT shows on-policy self-distillation from demonstrations reduces catastrophic forgetting and lets a single model accumulate skills sequentially — the opposite of model collapse — and Channel-2 SDPO is exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses [36]. But the repo's own ADR-013 warns",
	"recommended": "Self-distillation in the inner loop is, in this configuration, a stabilizer and not only a collapse risk: SDFT shows on-policy self-distillation from demonstrations reduces catastrophic forgetting and lets a single model accumulate skills sequentially — the opposite of model collapse — and Channel-2 SDPO is exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses [36].\n\nBut the repo's own ADR-013 warns",
	"rationale": "Line 103 (~1670 chars) is a wall mixing the stabilizer claim, the amplification caveat, and the flywheel argument; breaking after the SDFT point isolates the first claim."
	},
	{
	"id": "rec-3",
	"category": "break-paragraph",
	"severity": "high",
	"current": "every working SWE flywheel optimizes a true execution oracle (Socratic-SWE +7.8 over three iters beating self-play at equal compute [10]; DeepSWE +20 Pass@1 in 200 RL steps on sparse 0/1 reward; SWE-RL 41% generalizing OOD [37]). Most collapse stories require a proxy or self-judged verifier",
	"recommended": "every working SWE flywheel optimizes a true execution oracle (Socratic-SWE +7.8 over three iters beating self-play at equal compute [10]; DeepSWE +20 Pass@1 in 200 RL steps on sparse 0/1 reward; SWE-RL 41% generalizing OOD [37]).\n\nMost collapse stories require a proxy or self-judged verifier",
	"rationale": "Further splits the remaining tail of the overlong line-103 paragraph at the working-flywheel / collapse-stories pivot so each half reads as one idea."
	},
	{
	"id": "rec-4",
	"category": "break-paragraph",
	"severity": "high",
	"current": "alignment-gated structure over naive train-on-all [55]. The content side is trainable:",
	"recommended": "alignment-gated structure over naive train-on-all [55].\n\nThe content side is trainable:",
	"rationale": "Line 27 (~1670 chars) packs four distinct facts into one paragraph; breaking before the trainable-content fact separates the anti-emergence evidence from the pro-training evidence."
	},
	{
	"id": "rec-5",
	"category": "break-paragraph",
	"severity": "high",
	"current": "branch factor × sandbox cold-start [6]. The layered posture: **gVisor",
	"recommended": "branch factor × sandbox cold-start [6].\n\nThe layered posture: **gVisor",
	"rationale": "Line 179 (~1616 chars) is a §8 wall; breaking before the layered-isolation discussion separates the framing sentence from the three-tier detail."
	},
	{
	"id": "rec-6",
	"category": "break-paragraph",
	"severity": "high",
	"current": "many small vLLM pods share a GPU [45]. One hosting fact feeds the platform choice:",
	"recommended": "many small vLLM pods share a GPU [45].\n\nOne hosting fact feeds the platform choice:",
	"rationale": "Splits the remaining tail of the overlong line-179 §8 paragraph at the hosting-fact pivot, separating sandbox/GPU sizing from the TRL-vs-VeRL engine choice."
	},
	{
	"id": "rec-7",
	"category": "break-paragraph",
	"severity": "high",
	"current": "the predicted `tool_error` kind — never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15]. The latent-motion line carries the same discipline into 2026:",
	"recommended": "the predicted `tool_error` kind — never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15].\n\nThe latent-motion line carries the same discipline into 2026:",
	"rationale": "Line 27 also runs the MuZero/Dreamer design discipline into the latent-motion result; breaking before the 2026 latent-motion sentence relieves the second half of this overlong paragraph."
	},
	{
	"id": "rec-8",
	"category": "break-paragraph",
	"severity": "medium",
	"current": "RL on the token's placement teaches the governance that is the real bottleneck [11].",
	"recommended": "RL on the token's placement teaches the governance that is the real bottleneck [11].\n",
	"rationale": "Line 33 (~1407 chars) runs the SDPO carrier and deliberate-token mechanisms together; inserting a break after the placement sentence separates the two mechanisms."
	},
	{
	"id": "rec-9",
	"category": "break-paragraph",
	"severity": "medium",
	"current": "that absence is the delta. So \"multi-model Monte-Carlo tree-of-work\" means, concretely:",
	"recommended": "that absence is the delta.\n\nSo \"multi-model Monte-Carlo tree-of-work\" means, concretely:",
	"rationale": "Line 19 (~1320 chars) is a §1 wall; breaking before the concrete restatement separates the repo-primitive mapping from the definitional summary."
	},
	{
	"id": "rec-10",
	"category": "break-paragraph",
	"severity": "medium",
	"current": "structured/selective negatives beat both raw train-on-all and positives-only pruning. The verdict: train on all surviving branches, typed and routed by signal, never as raw negative policy gradient.",
	"recommended": "structured/selective negatives beat both raw train-on-all and positives-only pruning.\n\nThe verdict: train on all surviving branches, typed and routed by signal, never as raw negative policy gradient.",
	"rationale": "Separates the bracketing observation from the verdict so the §4 headline verdict stands out before the numbered routing list."
	},
	{
	"id": "rec-11",
	"category": "break-paragraph",
	"severity": "medium",
	"current": "what makes divergence-gating mandatory [6]. The gating pays for itself:",
	"recommended": "what makes divergence-gating mandatory [6].\n\nThe gating pays for itself:",
	"rationale": "Line 191 (~1112 chars) is the dense Cost paragraph; breaking before the gating-savings sentence separates the cost problem from the mitigation."
	},
	{
	"id": "rec-12",
	"category": "break-paragraph",
	"severity": "medium",
	"current": "real for the unguarded version [8]. The escape is not better replay;",
	"recommended": "real for the unguarded version [8].\n\nThe escape is not better replay;",
	"rationale": "Line 17 (~1169 chars) is a §1 wall; breaking before the escape sentence separates the critique from the design response."
	},
	{
	"id": "rec-13",
	"category": "break-paragraph",
	"severity": "medium",
	"current": "for that turn only [1]. The frontier-variance curriculum is a homeostatic selection regulator,",
	"recommended": "for that turn only [1].\n\nThe frontier-variance curriculum is a homeostatic selection regulator,",
	"rationale": "Line 51 (~1316 chars) joins the mutation point and the curriculum point; breaking before the curriculum sentence separates two distinct GA-mapping claims."
	},
	{
	"id": "rec-14",
	"category": "break-paragraph",
	"severity": "low",
	"current": "and the trainer need zero changes, and `ModalSpawnExecutor` is the working existence proof [41].",
	"recommended": "and the trainer need zero changes, and `ModalSpawnExecutor` is the working existence proof [41].\n",
	"rationale": "Line 165 (~1021 chars) runs the ADR-005 framing and the AWS S3 mapping together; a trailing break after the existence-proof sentence eases the §8 lede paragraph."
	},
	{
	"id": "rec-15",
	"category": "bold-keyterms",
	"severity": "high",
	"current": "This upgrade from teacher-plurality to execution-oracle fitness is the single most important change and the one the corpus most strongly supports.",
	"recommended": "This upgrade from teacher-plurality to execution-oracle fitness is the single most important change and the one the corpus most strongly supports.",
	"rationale": "Bolds the load-bearing term \"teacher-plurality to execution-oracle fitness\" so a skimmer sees the report's central upgrade."
	},
	{
	"id": "rec-16",
	"category": "bold-keyterms",
	"severity": "high",
	"current": "A literal per-turn N-way tree is O(N^D) and economically fatal — ungated, a branching trace prices around $64 versus $0.98 flat [6].",
	"recommended": "A literal per-turn N-way tree is O(N^D) and economically fatal — ungated, a branching trace prices around $64 versus $0.98 flat [6].",
	"rationale": "Bolds the cost-blowup complexity and the headline price figures a skimmer needs to grasp why divergence-gating is mandatory."
	},
	{
	"id": "rec-17",
	"category": "bold-keyterms",
	"severity": "high",
	"current": "so collapse to a single rollout — turning O(N^D) into roughly O(N · decision-points) [6].",
	"recommended": "so collapse to a single rollout — turning O(N^D) into roughly O(N · decision-points) [6].",
	"rationale": "Bolds the target complexity after gating, the key quantitative payoff of the divergence-gated design."
	},
	{
	"id": "rec-18",
	"category": "bold-keyterms",
	"severity": "high",
	"current": "policy.\"** \"Prune versus train-on-all\" is a false binary.",
	"recommended": "policy.\" \"Prune versus train-on-all\" is a false binary.**",
	"rationale": "Bolds the §4 reframe conclusion so the skimmer catches the central thesis that the prune/train-on-all dichotomy is false."
	},
	{
	"id": "rec-19",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "and reaches 50.40% on SWE-bench Verified after three iterations [10].",
	"recommended": "and reaches 50.40% on SWE-bench Verified after three iterations [10].",
	"rationale": "Bolds the headline Socratic-SWE pass-rate so the closest published analogue's result is scannable."
	},
	{
	"id": "rec-20",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "reaching 65.8% on SWE-bench Verified — crucially training on all trajectories for the world-model head",
	"recommended": "reaching 65.8% on SWE-bench Verified — crucially training on all trajectories for the world-model head",
	"rationale": "Bolds the CWM existence-proof pass-rate, a key statistic supporting train-on-all for the world-model head."
	},
	{
	"id": "rec-21",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) while achieving 92% lower prediction error [18].",
	"recommended": "across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes *causally blind (~1e-8) while achieving 92% lower prediction error*** [18].",
	"rationale": "Bolds the two decisive predictive-causal-gap statistics that justify measuring foresight rather than next-state accuracy."
	},
	{
	"id": "rec-22",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "is the kill ablation: if it is ≈0, the token is a no-op and is cut",
	"recommended": "is the kill ablation: if it is ≈0, the token is a no-op and is cut",
	"rationale": "Bolds \"the kill ablation\" so the skimmer registers Foresight@k as the decisive cut criterion for the world-model head."
	},
	{
	"id": "rec-23",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "DeepSWE 42.2% Pass@1, 59% with test-time scaling, from pure outcome RL — stronger-teacher SFT hurt [43]",
	"recommended": "DeepSWE 42.2% Pass@1, 59% with test-time scaling, from pure outcome RL — stronger-teacher SFT hurt [43]",
	"rationale": "Bolds the DeepSWE headline figure, the incumbent baseline every later phase must beat at equal compute."
	},
	{
	"id": "rec-24",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "The strongest argument that this is buildable is that the substrate already exists — roughly nine-tenths of it.",
	"recommended": "The strongest argument that this is buildable is that the substrate already exists — roughly nine-tenths of it.",
	"rationale": "Bolds the reuse-fraction claim that anchors the entire §6 reuse-vs-build ledger."
	},
	{
	"id": "rec-25",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "whether the divergence-gated tree beats an equal-budget outcome-only GRPO baseline on long-horizon tasks — and it has never been run.",
	"recommended": "whether the divergence-gated tree beats an equal-budget outcome-only GRPO baseline on long-horizon tasks — and it has never been run.",
	"rationale": "Bolds the program's single most important unrun experiment so the skimmer catches the central open question of §7."
	},
	{
	"id": "rec-26",
	"category": "bold-keyterms",
	"severity": "medium",
	"current": "This is the single biggest architectural payoff of the object-store design on Kubernetes.",
	"recommended": "This is the single biggest architectural payoff of the object-store design on Kubernetes.",
	"rationale": "Bolds the headline architectural claim that gang scheduling is unneeded for inter-replica DiLoCo sync."
	},
	{
	"id": "rec-27",
	"category": "bold-keyterms",
	"severity": "low",
	"current": "First, `strip_thinking` must be `False`: ~67% of real Claude Code error-recovery turns are pure thinking, and stripping them yields empty SDPO masks that silently collapse two-thirds of the channel's supervision sites",
	"recommended": "First, `strip_thinking` must be `False`: ~67% of real Claude Code error-recovery turns are pure thinking, and stripping them yields empty SDPO masks that silently collapse two-thirds of the channel's supervision sites",
	"rationale": "Bolds the 67%-thinking statistic that makes strip_thinking=False a load-bearing repo configuration fact."
	},
	{
	"id": "rec-28",
	"category": "bold-keyterms",
	"severity": "low",
	"current": "Calibration (ECE/Brier on the predicted-outcome head) is primary, because the documented failure is over-confidence; next-state accuracy is a secondary diagnostic.",
	"recommended": "Calibration (ECE/Brier on the predicted-outcome head) is primary, because the documented failure is over-confidence; next-state accuracy is a secondary diagnostic.",
	"rationale": "Bolds the primary-measurement decision (calibration over next-state accuracy), a load-bearing methodological choice in §2."
	},
	{
	"id": "rec-29",
	"category": "split-sentence",
	"severity": "medium",
	"current": "The divergence tree has a rigorous backbone: sibling A and B from a shared parent reaching different executed outcomes is a model-free Monte-Carlo counterfactual credit estimate, low-variance because the shared parent differences out the baseline — a group-relative/leave-one-out argument (Tree-GRPO [44]) — which the executed-sibling structure then approximates non-parametrically for the stronger, hindsight-conditioned variant that learned counterfactual-credit methods (CCA [33]) achieve with a learned hindsight model, and min-form/bottleneck-localized because the credit-bearing step is the earliest node where sibling subtrees separate [33].",
	"recommended": "The divergence tree has a rigorous backbone: sibling A and B from a shared parent reaching different executed outcomes is a model-free Monte-Carlo counterfactual credit estimate, low-variance because the shared parent differences out the baseline — a group-relative/leave-one-out argument (Tree-GRPO [44]). The executed-sibling structure then approximates non-parametrically the stronger, hindsight-conditioned variant that learned counterfactual-credit methods (CCA [33]) achieve with a learned hindsight model, and it is min-form/bottleneck-localized because the credit-bearing step is the earliest node where sibling subtrees separate [33].",
	"rationale": "This ~80-word run-on survived polish; splitting at the Tree-GRPO clause yields two readable sentences without losing any clause."
	}
	]