composer-replication-framework / research /critic-findings-width.json

Baladithya Balamurugan

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

c11cf49 19 days ago

11.5 kB

	{
	"critic": "width",
	"findings": [
	{
	"severity": "high",
	"section": "## 7. What the Literature Says (and Where It Pushes Back)",
	"issue": "The process-vs-outcome cluster is the empirical backbone of the entire §4 prune-vs-train-on-all argument, yet its two foundational papers are named once with WRONG citations and given no source IDs. Line 161 reads 'process supervision genuinely beats outcome on reasoning traces (Let's Verify, Uesato) [19][27]' — but [19] is the 'LLM-Based World Models' paper (arXiv:2411.08794) and [27] is 'How Much Do LLMs Learn From Negative Examples' (arXiv:2503.14391). Neither is Let's Verify nor Uesato. Both papers have dedicated, on-point vault notes (Lightman et al. 2305.20050 'Let's Verify Step by Step' — PRM beats ORM on MATH, releases PRM800K; Uesato et al. 2211.14275 — first head-to-head process-vs-outcome on GSM8K, process feedback cuts reasoning error 14.0%→3.4%) that carry zero source IDs anywhere in the report. The single sentence the skeptic-rebuttal rests on mis-attributes its evidence.",
	"fix": "Add two new Sources entries: '[49] Let's Verify Step by Step (Lightman et al.) — arXiv:2305.20050 (PRM process supervision beats ORM on MATH; releases PRM800K)' and '[50] Solving math word problems with process- and outcome-based feedback (Uesato et al.) — arXiv:2211.14275 (first process-vs-outcome head-to-head; process feedback cuts reasoning error 14.0%→3.4% at final-answer parity)'. Then change the line-161 citation from '(Let's Verify, Uesato) [19][27]' to '(Let's Verify [49]; Uesato [50] — process feedback cuts reasoning error 14.0%→3.4% at final-answer parity)' so the named papers carry their own IDs.",
	"anchor_quote": "process supervision genuinely beats outcome on reasoning traces (Let's Verify, Uesato), the world-model field is split"
	},
	{
	"severity": "high",
	"section": "## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work",
	"issue": "SWE-Search (arXiv:2410.20285) is the single closest published analogue to the proposed system — MCTS over repository-level SWE tasks with per-node value estimation, backtracking, and self-feedback — and the report names it ('SWE-Search expands nodes with one policy') but gives it no source ID and never engages its central, decision-relevant findings: a 23% relative SWE-bench improvement across five models from search ALONE, and the explicit result that performance scales with inference-time compute 'without requiring larger models or additional training data.' That is a sharper version of Pushback 3's skeptic case (does the tree's gain need training at all, or is it just test-time search?) and a direct input to the §7 paradigm table, yet the vault note is completely unused. Leaving the closest prior art uncited weakens the 'claim the synthesis, not the parts' provenance argument.",
	"fix": "Add a Sources entry '[51] SWE-Search: Enhancing Software Agents with MCTS and Iterative Refinement — arXiv:2410.20285 (23% relative SWE-bench gain from search alone, single policy, scales with inference-time compute, no extra training)'. Tag the existing §1 mention 'SWE-Search expands nodes with one policy [51]', and add one clause to Pushback 3 (§7) noting SWE-Search already shows per-node SWE search helps at TEST time without training — so the tree must justify the marginal value of folding that search into TRAINING, not just adding search.",
	"anchor_quote": "SWE-Search expands nodes with one policy; Symphony does heterogeneous-LM planning"
	},
	{
	"severity": "medium",
	"section": "## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks",
	"issue": "Symphony (arXiv:2601.22623, NeurIPS 2025) is the strongest pro-heterogeneity result in the vault — a heterogeneous-LM MCTS planner whose explicit thesis is that single-agent MCTS yields 'insufficient diversity among generated branches' and that a heterogeneous LM pool 'enhances rollout diversity and facilitates more effective exploration,' outperforming SOTA when given API models. The report names Symphony once in §1 with no source ID, then builds Pushback 1 (heterogeneity-is-a-hypothesis) almost entirely on the anti-heterogeneity sources [21][22][23], leaving the heterogeneity-as-ablation framing under-steelmanned. Symphony is precisely the source that says the system's distinctive choice (different model per node) buys exploration diversity — the very 'anti-collapse diversity' the report concedes survives (safeguard 4) but does not source on the capability side.",
	"fix": "Add a Sources entry '[52] SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous LM Assembly — arXiv:2601.22623 (NeurIPS 2025; single-agent MCTS gives insufficient branch diversity; heterogeneous LM pool improves rollout diversity and exploration)'. In §3 Pushback 2 (heterogeneity), add a sentence acknowledging Symphony [52] as the counter-result that frames heterogeneity's surviving justification (exploration/branch diversity), so the ablation is set up as a genuine two-sided question rather than a near-foregone demotion.",
	"anchor_quote": "Symphony does heterogeneous-LM planning"
	},
	{
	"severity": "medium",
	"section": "## 2. The World-Model Goal: Training Latent What-If Deliberation",
	"issue": "Section 2 grounds the latent-deliberation 'value-equivalent / never reconstruct the full state' argument on MuZero [14] and Dreamer [15] (both pre-LLM RL) but leaves the most on-point 2026 vault note — Chain of World (arXiv:2603.03195, CVPR 2026) — entirely uncited. Chain of World is precisely a 'World Model Thinking in Latent Motion' paradigm that factorizes dynamics into a disentangled latent and predicts terminal state rather than reconstructing redundant background — the exact value-equivalent-latent point the report wants to make for SWE ('predict the signed FAIL_TO_PASS delta, never reconstruct the full token sea'). prompt-decomposition.json explicitly lists 'Chain of World' as a required field of the world-model literature cluster, so its absence is a coverage miss against the decomposition.",
	"fix": "Add a Sources entry '[53] Chain of World: World Model Thinking in Latent Motion — arXiv:2603.03195 (CVPR 2026; disentangled latent-motion world model predicts terminal state instead of reconstructing redundant background)'. In §2, append to the MuZero/Dreamer value-equivalent sentence a clause: 'and the latent-motion line carries the same discipline into 2026 — factorize dynamics into a compact latent and predict the consequential terminal state, not the full frame [53]', tying the embodied result to the SWE next-state-delta target.",
	"anchor_quote": "never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15]"
	},
	{
	"severity": "medium",
	"section": "## 4. The Central Question: Prune Bad Branches vs Train on All Branches",
	"issue": "The report cites EvilGenie [30] only for its hacking-prevalence half ('explicit hardcoding / test-file edits by Codex and Claude Code') and then declares the disjoint held-out eval 'the most load-bearing safeguard.' But EvilGenie's other headline finding is that an LLM judge is HIGHLY EFFECTIVE at detecting reward hacking in unambiguous cases while held-out unit tests give only 'minimal improvement.' The report uses the held-out-is-weak half (line 79) but omits the LLM-judge-is-strong half — which is decision-relevant: it suggests a cheaper, validated hack DETECTOR (distinct from a learned reward) that the report's own safeguard framing ('learned verifier allowed only at test-time selection') would permit. Omitting it makes the held-out eval look like the only option when the source the report already cites offers a complementary one.",
	"fix": "In §4 (line 79) after 'with held-out tests giving only minimal detection improvement', add: '— while in the same study an LLM judge proved highly effective at flagging unambiguous hacks, suggesting an offline LLM-judge hack-detector (never a training reward) as a cheaper complement to the held-out gate [30].' This uses the already-cited [30] note's second finding without adding a source.",
	"anchor_quote": "with held-out tests giving only minimal detection improvement"
	},
	{
	"severity": "low",
	"section": "## 8. Implementing on AWS EKS (Primary)",
	"issue": "The SWE-rebench / Nebius infrastructure vault note (behind-swe-rebench, a 26KB substantive source on production SWE-task collection + eval-at-scale, evaluating thousands of SWE instances per hour with distributed container orchestration on TractoAI) is unused. It is the most directly-relevant existence proof for the report's central infra claim that mass SWE-task sandboxing/eval is an established distributed pattern — and for §6's task-construction discussion (mining (problem, test-set) pairs from resolved GitHub issues, exactly the FeatureDeletionEnv substrate-inversion pattern). The EKS section leans on DeepSWE's 512-container limit [43] but omits the one note built specifically around scaling SWE-task execution infrastructure.",
	"fix": "Add a Sources entry '[54] Behind SWE-rebench: infrastructure to collect/evaluate SWE tasks at scale — nebius.com (distributed container orchestration evaluating thousands of SWE instances/hour; (problem,test-set) pairs mined from resolved GitHub issues)'. In §8's data-plane or throughput discussion, add one clause citing [54] as production evidence that thousands-per-hour distributed SWE-task execution is an established pattern, reinforcing the 'EKS-primary is cheap to adopt' claim.",
	"anchor_quote": "DeepSWE itself ran rollout collection on Kubernetes with a Cluster Autoscaler over 1000+ CPU cores [45][43]."
	}
	],
	"overall": "The report is unusually wide and dense: 48 source IDs, all 11 required section headings present, and genuinely deep engagement with the hardest clusters — reward-hacking-with-verifiable-rewards (EvilGenie/RLVR-gaming/synthetic-trajectories all cited [29][30][31]), self-evolving-collapse (survey §8.3 [38], RSI [29], Self-Play-SWE-RL [8]), the negatives/credit-assignment cluster ([25][26][27][28][33]), and the EKS sandbox/SWE-MiniSandbox/KubeRay/verl/HyperPod stack ([41][42][45][46][48]) are all well-used. Width is therefore strong, not weak. The findings are targeted, not a scattershot: the single highest-value gap is a citation MISMATCH — the process-vs-outcome backbone (Let's Verify [Lightman 2305.20050] and Uesato 2211.14275), which underpins all of §4, is named once with the wrong source IDs and the two foundational vault notes carry no IDs at all. The second is the omission of SWE-Search (2410.20285), the closest published per-turn-MCTS-on-SWE prior art, named but uncited and unengaged on its sharpest point (search helps at test time without training). The remaining four are smaller: an under-steelmanned pro-heterogeneity source (Symphony), the most on-point 2026 latent world-model note (Chain of World, also a decomposition required-field miss), a half-used EvilGenie finding (LLM-judge detector), and an unused SWE-rebench infra note. Fixing the two high-severity items (both surgical Source-list additions plus a one-line citation correction) materially strengthens the report's evidentiary spine; the rest are optional polish. No structural rework needed."
	}