Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 18 days ago

30.9 kB

Deep-Read: World-Model / Deliberation Literature — Critical Review

Cluster 6 / Novel-Extension Guard Reviewer: critical pipeline subagent Date: 2026-06-09 Sources fetched: MuZero (1911.08265), DreamerV3 (2301.04104), CWM (2510.02387), Chain-of-World (2603.03195), foresight-governance (2601.03905), From-Word-to-World (2512.18832), Predictive-Causal Gap (2605.05029), Reasoning-Tool-Compete/DART (2602.00994), Myopic Planning (2605.06840), Negative Gradient/LLD (2505.18830), RAFT (2504.11343), Near-miss negatives (2503.14391). Final report reviewed: research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md sections 2–4.

Executive Summary

The report's world-model / deliberation section (§2–4) is well-structured and intellectually honest about uncertainty, but contains five factual misreadings of primary sources, three overclaims dressed as conditional commitments, and two omissions that materially weaken the evidence base. The most serious finding is a CWM misread: the paper does NOT "train on all trajectories for the world-model head, reserving success-filtering only for the RL reward" in the sense the report implies — CWM uses a mid-training stage architecture, not an auxiliary head on a policy network, and the "train on all" decision applies to a separate, structurally distinct training phase, not an add-on loss riding the policy head. The report imports this result as license for an "aux head trains on all" design that the source does not demonstrate. The other misreadings are: (a) CWM's 65.8% score requires test-time scaling and is not the base score; (b) the Chain-of-World (2603.03195) paper is a robotics/embodied VLA paper, not an SWE paper; (c) the foresight-governance paper (2601.03905) is VLM/VQA, not SWE; (d) the Predictive-Causal Gap paper (2605.05029) is a single-author preprint with linear-Gaussian proofs and a small Duffing-GRU sweep — the report presents it as if the SWE mixed-timescale argument is a theorem about the proposed system, which it is not. The overclaims are: the report commits "parameter isolation eliminates the interference risk" when 2602.00994 shows interference on parameter-isolated LoRA modules too; the report treats Foresight@k as a standard metric when it is a proposed construct with no published baseline; and the "two hard prune gates resolve the central question" framing obscures that neither gate addresses the predictive-causal gap the report itself invokes. The omissions are: no paper in the cluster studies next-state-prediction as an auxiliary loss on a policy network for software engineering tasks — the exact configuration proposed — so the evidentiary basis for the aux-loss design rests entirely on analogical transfer from CWM (different architecture) and MuZero/Dreamer (different domain). The report acknowledges uncertainty but does not flag this as the null-evidence zone it is.

Section 2: World-Model Goal — Source-by-Source Findings

Finding 2.1 — CWM (arXiv:2510.02387): Misread of "trains on all" + score overclaim [CRITICAL]

What the report says (§2, line 33):

"Meta's Code World Model mid-trains a 32B model on observation-action trajectories to predict next program state, reaching 65.8% on SWE-bench Verified — crucially training on all trajectories for the world-model head, reserving success-filtering only for the RL reward [13]."

What the source actually says:

The CWM paper uses a three-phase training pipeline (pre-training → mid-training → post-training/RL). The "train on all" decision is a mid-training data decision for an entire separate training stage, verbatim:

"Because our goal with the ForagerAgent data is to learn a comprehensive world model of agentic interactions with code environments, we do not filter trajectories based on whether they succeed at bug or issue resolution." (Section 2.2, ForagerAgent)

This is not an auxiliary loss added to a policy head. CWM is mid-trained as a general purpose next-state predictor in a dedicated training phase, separate from RL, using 3M ForagerAgent trajectories from 10.2k images. The world modeling capability is baked into the base model before policy optimization begins. The RL stage (Section 5.3.1) then applies success-filtered rewards on top of a model that already has world-modeling capability from mid-training.

Why this matters for the proposed design: The report uses this citation as support for "a world-model aux head can train on all branches during RL." CWM does NOT support this. CWM supports "dedicate a mid-training stage to train-on-all dynamics learning before RL." These are architecturally distinct: CWM's world-modeling capability is in the base weights, not in an auxiliary head receiving gradients simultaneously with RL. The interference risk (2602.00994) the report cites elsewhere applies precisely to the simultaneous-gradient version CWM does NOT use.

Score overclaim: The 65.8% SWE-bench Verified score requires test-time scaling (multiple candidates + ranking). CWM's base score (single attempt, no retry) is lower. The report does not make this distinction. The verbatim from the CWM abstract: "it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling)." The base score "is computed with a single attempt per instance (no retries, majority voting, or parallel candidates), averaged over multiple runs." CWM's figure-2 caption explicitly notes this gap. Citing "65.8%" without the "(with test-time scaling)" qualifier is a misread of the headline number.

Verdict: Overclaim on two dimensions. The aux-head-on-policy design does not have CWM support. The score should be cited as "65.8% with test-time scaling" or the base score should be stated alongside.

Finding 2.2 — Chain-of-World (arXiv:2603.03195): Wrong domain attribution [HIGH]

What the report says (§2, line 35):

"The latent-motion line carries the same discipline into 2026: factorize dynamics into a compact latent and predict the consequential terminal state, not the full frame [53]."

The source list cites [53] as: "Chain of World: World Model Thinking in Latent Motion — arXiv:2603.03195 (CVPR 2026; disentangled latent-motion world model predicts terminal state instead of reconstructing redundant background)."

What the source actually says:

CoWVLA (Chain-of-World VLA) is a Vision-Language-Action model for robotics. It is submitted to CVPR 2026 under cs.CV. The authors are from Li Auto, Harbin Institute of Technology, and BAAI. The experimental benchmarks are robotic simulation benchmarks (manipulator tasks: grasping cups, placing objects). The architecture uses a video VAE to extract latent motion from physical-world video frames and predicts the terminal visual frame of a robot arm action segment.

This paper has no relevance to software engineering or LLM-based code agents. Its "compact latent, predict terminal state, not full frame" insight is about pixel-space robot dynamics. The analogy to "don't reconstruct the full next repo state, predict the decision-relevant delta" is the report author's inference, not a claim the source makes. Using it as a citation for SWE world modeling is a domain transfer that the source does not support.

Verdict: Improper citation. The paper should either be removed or explicitly noted as "robotics analogy, not SWE evidence."

Finding 2.3 — Foresight Governance Paper (arXiv:2601.03905): Domain-transfer not flagged adequately [MEDIUM]

What the report says (§2, line 29):

"handed a world model as a tool, agents invoke it under 1% of the time, misuse it ~15%, degrade when forced, and consult it less as they grow more capable — the bottleneck is foresight governance [11]"

What the source actually says:

The abstract is verbatim: "Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced."

The report's numbers are accurate. However, the paper's experimental domain is "agentic and VQA tasks" — Vision-Language Models (VLMs) on visual question answering. The degradation figures are from VLM/VQA settings, not SWE task settings. The paper itself notes the bottleneck is foresight governance in those domains.

The report acknowledges this in §7: "the world-model-as-tool foresight result [11] is VLM/VQA." However, in §2 where this result is deployed as a structural argument for "it does not emerge from scale," the domain caveat is absent. A reader of §2 alone would not know this is VLM evidence being applied to SWE.

Verdict: The numbers are accurately quoted, but the domain caveat is deferred to §7 and absent from §2 where the argument is made. Minor but addressable.

Finding 2.4 — Predictive-Causal Gap (arXiv:2605.05029): Scope overstated [MEDIUM]

What the report says (§2, line 37):

"across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) while achieving 92% lower prediction error [18]. A SWE repo is exactly mixed-timescale..."

What the source actually says:

The paper is a single-author preprint (Kejun Liu, single affiliation) studying linear-Gaussian dynamics with a theorem and a nonlinear Duffing-GRU sweep. The 2,695-network count and the fidelity numbers are accurate. The theorem proves the gap for linear-Gaussian systems.

The SWE-specific generalization ("A SWE repo is exactly mixed-timescale") is the report's inference, not the paper's claim. The paper does state implications for "world models" in general, but its empirical evidence is limited to linear-Gaussian dynamics and the Duffing oscillator nonlinear extension. There is no SWE experiment, no code model, no NLP experiment.

Further, the paper uses "operational grounding" as a partial mitigation ("operational grounding — restricting the loss to system observables — partially suppresses the gap"). The report correctly notes "The value-equivalent target reduces but, by the theorem, never eliminates the gap" — this is accurate per the abstract. But the "never eliminates" is the theorem's conclusion for linear-Gaussian systems; the practical magnitude of the gap for an LLM trained on code SWE trajectories is unknown.

Verdict: The report uses this paper's impossibility theorem to argue against an aux-head configuration that CWM does not use anyway (see Finding 2.1). The theorem is real and relevant, but applying it as if it proves the SWE aux-head will fail is an extrapolation the paper does not support. It is best-used as a risk flag, not a structural argument.

Finding 2.5 — MuZero (arXiv:1911.08265) and DreamerV3 (arXiv:2301.04104): The "value-equivalent" translation is plausible but not direct evidence [MEDIUM]

What the report says (§2, line 33):

"MuZero and Dreamer add the design discipline: learn the value-equivalent latent — predict reward, value, the signed FAIL_TO_PASS delta, the predicted tool_error kind — never reconstruct the full state [14][15]."

What the sources say:

MuZero (arXiv:1911.08265) learns a latent model that predicts reward, policy, and value function for MCTS planning in board games and Atari. DreamerV3 (arXiv:2301.04104) learns a RSSM world model that predicts compact latent states for imagined rollouts. Both papers operate in fully observable, discrete/continuous control domains with clear reward signals.

The "value-equivalent latent" framing is from Schrittwieser et al. and is accurately invoked. However, neither paper has experiments in NLP, code generation, or multi-step software engineering. The translation from "predict reward/value/policy in Atari" to "predict FAIL_TO_PASS delta + tool_error kind for SWE" is the report's design inference.

This is not a misread — it is analogical reasoning from RL theory, which is legitimate. But the report presents these as "design discipline" rather than "analogical design inspiration," which is a subtle overclaim. MuZero and Dreamer provide no direct evidence that their latent-representation principle transfers to transformer-based LLM policy training on code.

Verdict: Sound analogy but presented as established principle. The report should note: MuZero/Dreamer motivate the value-equivalent design direction; they do not demonstrate it works in the LLM-policy-training regime.

Section 3: Aux-Head as "Second SDPO Mode" — Overclaim Analysis

Finding 3.1 — "Parameter isolation eliminates the interference risk" is overclaimed [HIGH]

What the report says (§2, line 39):

"three 2026 results pull the other way, hard, and they are why the aux loss must be a separate head and an ablation. First, interference: 'Reasoning and Tool-use Compete in Agentic RL' shows training reasoning and tool-use into one parameter set induces misaligned gradients, and decoupling into separate adapters (DART) beats every joint baseline across thirteen benchmarks [16] — stacking a next-state head onto the same policy head is exactly the configuration it indicts."

The report then concludes (§2, line 39): the solution is a "parameter-isolated head or adapter."

What arXiv:2602.00994 actually shows:

DART decouples reasoning and tool-use into separate LoRA modules — but these LoRA modules share the same base model weights (frozen). The gradient interference is between the two LoRA adapters, not just between a head and a base. The paper's solution is to use disjoint parameter sets for the two capabilities. This means parameter isolation to a separate LoRA module does reduce interference, but the report's implication that a "parameter-isolated head" fully eliminates the problem is not the paper's finding.

Specifically: DART's separate LoRA modules still share the frozen base, and the paper's ablation shows even with LoRA decoupling, there is residual interference. "Approaches the 2-Agent upper bound" (abstract) means it does not fully close the gap. The "two-Agent upper bound" is the theoretical ceiling achieved by having two separate models — separate LoRA on the same base does not achieve this.

Verdict: The report correctly identifies that joint parameter training is the risk and that isolation helps. The overclaim is that isolation eliminates the risk. The source shows isolation reduces interference but does not eliminate it. The correct framing: "parameter isolation substantially reduces but does not eliminate gradient interference; a fully separate model achieves the upper bound."

Finding 3.2 — "Foresight@k" is proposed, not standard [MEDIUM]

What the report says (§2, line 43):

"Foresight@k — the lift in terminal pass-fraction when the deliberation token is allowed versus suppressed, sampling fixed — is the kill ablation: if it is ≈0, the token is a no-op and is cut [11][2]."

What the sources say:

Neither arXiv:2601.03905 nor any other cited source defines or uses "Foresight@k" as a metric. The term appears to be coined by the report. This is fine — defining a novel metric is reasonable — but the report presents it as a standard metric with source citations, which is misleading.

Verdict: The metric is the report's proposal. The citations [11][2] do not define or use Foresight@k. The report should explicitly mark this as a proposed metric ("we define Foresight@k as...") rather than implying it is established.

Finding 3.3 — The aux-loss-as-"second SDPO mode" claim is architecturally creative but unsupported [MEDIUM]

What the report says (§2, line 41):

"A 'predict-the-outcome' target is the same shape: splice the realized post-action observation [...] into the teacher context as the privileged info, and distill the student toward the distribution it would have had if it had foreseen that outcome. [...] Because the teacher is stop-grad, a wrong predicted-outcome hint is bounded-bad..."

This is presented as follows: the aux objective is a "second SDPO mode" riding generalized_jsd_loss, not a new loss term.

Assessment of the claim:

The architectural argument is internally consistent and clever. SDPO is hint-conditioned distillation; if the hint is the realized observation, then distillation toward what the model would have predicted with that hint is "predict the next state." The stop-grad safety argument is valid.

However: the synthesis note in the vault (latent-what-if-deliberation) accurately identifies this as the report's design inference, not something any source supports directly. The gap-fill note (gap-fill-counter-evidence) explicitly identifies the missing ablation: "No SWE-specific next-state-head null result exists yet — that exact ablation is the cheapest decisive experiment we could run ourselves." The final report presents this design as a committed design with supporting evidence, when the supporting evidence is analogical (CWM uses a different architecture; MuZero/Dreamer operate in different domains).

Verdict: The design argument is sound but the evidentiary claim is overclaimed. The source base for "aux-next-state-loss as second SDPO mode improves SWE agent performance" is zero. All sources are analogical or domain-different. The report should state this explicitly: "no prior work has tested this exact configuration; the nearest existential proof is CWM's mid-training architecture, which is structurally different."

Section 4: Prune-vs-Train-on-All — Source Fidelity Check

Finding 4.1 — RAFT [2504.11343]: The report's characterization is accurate [OK]

What the report says: "RAFT/rejection-sampling is competitive with GRPO at far less complexity, and GRPO's advantage comes from discarding all-wrong prompts — a pruning move [25]."

What the source says (arXiv:2504.11343 abstract): "a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization."

Verdict: The report's characterization is verbatim-accurate. No issue.

Finding 4.2 — Negative Gradient / LLD [2505.18830]: Characterization is accurate [OK]

What the report says: "the 'squeezing'/lazy-likelihood-displacement pathology, where the likelihood of correct responses barely rises or even drops under blanket per-token penalties [26]."

What the source says (arXiv:2505.18830 abstract): "we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training [...] identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength."

Verdict: Accurately characterized. No issue.

Finding 4.3 — Near-Miss Negatives [2503.14391]: Characterization is accurate but domain is MCQA [OK with caveat]

What the report says: "positives-only training structurally cannot decrease the likelihood of plausible-but-wrong near-misses [27]."

What the source says (arXiv:2503.14391 abstract): "while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them."

The experimental setting is multiple-choice QA benchmarks (MCQA), not SWE. The report acknowledges this in §7: "the near-miss-calibration result [27] is MCQA." In §4 itself, this caveat is absent.

Verdict: Accurately quoted, domain gap present but not flagged in §4.

Finding 4.4 — CWM "trains on all" re-enters §4 without correcting §2 [HIGH]

What the report says (§4, lines 82-83):

"World-model next-state target — the single best foresight lever [...]; no policy penalty at all (§2) [13][27]." "The head is therefore not a bolt-on; it is the mechanism that makes train-on-all safe for the policy, because it relocates failed-branch signal off the policy gradient."

This invokes CWM [13] again as the model for "aux next-state head trains on all." As established in Finding 2.1, CWM's "train on all" is in the mid-training stage on a model that is NOT simultaneously receiving policy-gradient updates. CWM's design does not demonstrate that a simultaneously-trained aux head receiving failed-branch signal is safe for the policy.

Verdict: The §4 argument inherits the §2 misread. The "trains on all" CWM citation does not support the aux-head-on-policy configuration. This is the report's most consequential misread because it is the load-bearing justification for the "failed branch → world model head (safe)" two-harvest design.

Missing Evidence / What the Sources Do NOT Say

Missing 1: No paper tests next-state-prediction as aux loss on a policy network for SWE

The entire cluster (MuZero, Dreamer, CWM, Chain-of-World, 2512.18832) provides zero direct evidence for the proposed configuration: an auxiliary next-state-prediction objective appended to a policy network (as a separate head/adapter) during RL on software engineering tasks.

MuZero: game-playing RL with a separate planning model
DreamerV3: latent world model where policy is trained inside the world model's imagined rollouts — structurally opposite to "add aux head to existing policy"
CWM: dedicated mid-training stage, not aux head during RL
Chain-of-World: robotics, not SWE
2512.18832 ("From Word to World"): tests prompting and SFT for next-state prediction, not RL training with aux head

The evidentiary gap is total for the specific proposed configuration. The report should acknowledge: "There is no published ablation of an aux next-state loss on a policy LLM during code RL. CWM is the existence proof for mid-training dynamics; MuZero/Dreamer motivate the value-equivalent latent target. The specific aux-head-during-RL design is ours to test."

Missing 2: The DART paper's scope is narrow

DART (2602.00994) is on retrieval-augmented QA and NL2SQL — not SWE, not multi-step agent tasks with long-horizon tool use. The interference result is between two capabilities (reasoning vs tool-use) in a shared LoRA. Applying it to "next-state-prediction head vs policy head" is another analogical transfer that the source does not make.

Missing 3: The Myopic Planning paper (2605.06840) is not verified in the vault with full content

The vault note for 2605.06840 only has the abstract. The report cites specific causal pruning findings. The full paper is not in the vault as fetched content. The abstract confirms the causal CoT-pruning direction is described, but the specific intervention details ("causal CoT-pruning intervention confirms move selection is driven by shallow depth-1 nodes") are derived from the paper's full text, which was not independently verified from the source. This should be checked directly.

What the Literature DOES Support (for balance)

Explicit dynamics training (mid-training or SFT) beats zero-shot prompting: 2512.18832 demonstrates SFT on trajectories lifts ALFWorld/SciWorld accuracy to 99%/98%. CWM demonstrates mid-training on dynamics produces a strong SWE base. These are real, direct endorsements of some form of dynamics training.
Train-on-all for world model, filter for RL reward: CWM explicitly does this, and the motivating argument ("comprehensive world model") is stated in the paper. The design principle is supported, just not via aux head during RL.
Value-equivalent / decision-relevant targets: MuZero's design principle — predict only what matters (reward, value, policy) — is well-established RL theory. Its application to code (predict FAIL_TO_PASS delta, not full repo state) is a sound architectural translation.
Structured negatives fix near-miss calibration positives-only cannot: 2503.14391 supports this in MCQA. The SWE transfer is an inference.
Foresight governance bottleneck: 2601.03905 clearly identifies this bottleneck in VLM/VQA. The principle is general.
Simultaneous gradient interference: 2602.00994 is a real empirical finding in agentic RL. The prescriptive consequence (use parameter isolation) is supported.

Verdict on Report Commitments (§2–4)

Commitment in final report	Source support	Verdict
CWM "trains on all" for world-model head	CWM trains on all in mid-training stage, not during RL aux-head	MISREAD — cite correctly as mid-training
CWM reaches "65.8% SWE-bench Verified"	Correct but with test-time scaling; base score is lower	INCOMPLETE — add qualifier
Chain-of-World supports "predict terminal state, not full frame" for code	CoWVLA is robotics VLA, not SWE	WRONG DOMAIN — remove or note as robotics analogy
Predictive-Causal Gap proves SWE repo is dangerous for aux loss	Theorem is linear-Gaussian; SWE application is author inference	OVERSTATED — demote to risk flag
Parameter isolation eliminates interference risk	DART shows isolation reduces, not eliminates	OVERCLAIM — soften to "substantially reduces"
Foresight@k is the kill ablation	Metric is proposed by report, not a standard metric	MARK AS PROPOSED METRIC
Aux loss as "second SDPO mode" has evidentiary support	No source tests this configuration	ZERO DIRECT EVIDENCE — flag as null-evidence design proposal
"Two hard prune gates resolve the central question"	Gates don't address predictive-causal gap	INTERNAL INCONSISTENCY — the theorem is invoked then resolved by a design that doesn't address it
MuZero/Dreamer provide "design discipline" for SWE aux head	They motivate value-equivalent targets; no LLM-SWE experiments	OVERCLAIM — demote to "analogical motivation"

Recommended Corrections

§2, CWM citation: Rewrite as: "Meta's CWM mid-trains a 32B model in a dedicated pre-RL stage on 3M observation-action trajectories without success filtering, reaching 65.8% on SWE-bench Verified with test-time scaling. The train-on-all decision is for the mid-training dynamics stage, not an auxiliary head during RL." Remove the implication that CWM licenses aux-head-on-policy train-on-all.
§2, Chain-of-World [53]: Flag explicitly as robotics VLA. Either remove or rewrite as: "In the robotics domain, CoWVLA (CVPR 2026) demonstrates the same latent-terminal-state design for embodied agents, providing design-level motivation for the analogous SWE architecture."
§2, Predictive-Causal Gap: Rewrite as a risk flag: "The Predictive-Causal Gap theorem (linear-Gaussian dynamics; 2695 networks) establishes that predictive objectives can be accurate and causally blind simultaneously. Applied to SWE by analogy, a next-state head could improve token-level prediction while failing to learn decision-relevant dynamics. This is a structural risk, not a demonstrated outcome for LLM SWE training."
§2, Parameter isolation: Change "parameter-isolated head or adapter, never fused into the policy head [16]" to "parameter-isolated head or adapter substantially reduces gradient interference (DART reduces but does not fully close the interference gap to the 2-Agent upper bound [16])."
§2, Foresight@k: Add "(we define this metric; it has no published baseline)" at first use.
§2, null-evidence flag: Add a box or explicit paragraph: "Direct evidence gap: no published paper has tested an auxiliary next-state-prediction objective as an add-on loss during RL on a code policy network. All cited support is analogical transfer from: (a) mid-training architectures (CWM), (b) dedicated world-model planning systems (MuZero, Dreamer), or (c) non-SWE domains (Chain-of-World, 2512.18832). The proposed aux-head-during-RL design is a research hypothesis requiring the P4 ablation in §4, not a design with established support."
§4, CWM [13] re-citation: Correct to: "CWM's mid-training precedent motivates train-on-all dynamics learning; direct evidence for the aux-head variant of this design during RL is absent."
§3, foresight domain caveat: Bring the "VLM/VQA" caveat from §7 into §2 at first use of 2601.03905.

What These Sources Collectively Say About (a), (b), (c)

Question (a): Does next-state-prediction auxiliary head help agent policies (vs hurt via gradient interference)?

Direct evidence: None in SWE or code domain. CWM does it in mid-training, not as aux head. DART shows simultaneous parameter optimization on reasoning+tool-use interferes; parameter isolation helps substantially. 2512.18832 shows SFT (not RL with aux head) helps next-state prediction transfer. The honest position: unknown for the specific proposed configuration; plausible but untested.

Question (b): Does training on failure trajectories help/hurt and under what routing?

Direct evidence for routing: 2503.14391 (MCQA near-miss: negatives help near-miss calibration, positives-only cannot decrease plausible-wrong likelihood). 2505.18830 (raw uniform negative gradient destabilizes). 2504.11343 (positives-only RAFT is competitive on pass@1). Together: raw negatives hurt pass@1, structured negatives fix calibration, SWE-specific ablation unrun. CWM's mid-training: all trajectories useful for dynamics, not for policy. The "two-harvest" design (negatives to world model, not policy) is consistent with all sources but not tested in the proposed configuration.

Question (c): Do "deliberation tokens" / think-before-act distillation have support?

Partial support: CWM discusses "reasoning about environment feedback to improve agentic code generation" as future work and shows early prototype (Figure 5) of trace-conditioned reasoning. 2512.18832 shows SFT on trajectories enables explicit next-state reasoning. But "deliberation token" as a trainable gate with RL on placement is not in any source. 2605.06840 suggests CoT deliberation content is generated but causally ignored by the model — a direct challenge to the governance-RL-on-token-placement idea. 2601.03905 shows even explicit simulation access fails. The think-before-act idea has motivational support but no direct SWE ablation and one result (myopic planning) that challenges whether the token's content would be consumed.

End of findings. Full-length file: /Users/baladita/Documents/DevBox/composer-replication-framework/research/deepread/06-worldmodel.md