composer-replication-framework / research /critic-findings-dialectic.json

Baladithya Balamurugan

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

c11cf49 25 days ago

10.3 kB

	{
	"critic": "dialectic",
	"overall": "The report engages all six mandated skeptic disconfirmers (single-agent>=multi 2604.02460; aux interference 2602.00994; myopic 2605.06840; predictive-causal gap 2605.05029; oracle-gamed EvilGenie/RLVR-hacking; outcome-only DeepSWE/SWE-RL) and renders the two contested flourishes (heterogeneity, world-model aux loss) as explicit pre-registered ablation arms with stated falsifiers and flip-conditions. Provenance is clean: Channel 3, the tree, and FeatureDeletionEnv bug injection are correctly attributed to the framework's own additions, never to Cursor (Ch1 Dr.GRPO + Ch2 SDPO) and never to Socratic-SWE (which the report correctly notes does NOT inject bugs). The central prune-vs-train-on-all question is committed (typed train-on-all under two hard gates), not hedged. The heterogeneity axis (§3/§7 Pushback 1) is solid and faithful to the counter-evidence note, with the equal-compute control arm and the surviving anti-collapse justification both correct. The findings below are not about missing disconfirmers but about (a) one in-repo counter-position the report straw-manned toward optimism, (b) a categorical claim its own DeepSWE source contradicts, (c) asymmetric domain-transfer skepticism applied to the pro side but not to load-bearing non-SWE disconfirmers it relies on, (d) a directly-SWE disconfirmer present in the corpus but never cited, and (e) a numerical misread. Few, high-quality.",
	"findings": [
	{
	"severity": "high",
	"section": "5. Pipeline Shape: Two Loops, Not Two Phases",
	"issue": "The report straw-mans its own repo's counter-position. It claims self-distillation 'in this configuration, [is] a stabilizer and not only a collapse risk' citing SDFT, and treats Channel-2 SDPO as 'exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses.' But the repo's own ADR-013 (read in adr-decision-backbone note) states the opposite about THIS exact channel: 'SDPO against the altered model's own hint-conditioned forward pass is the channel most likely to AMPLIFY the distortion' and is 'an experimental intervention, not a benign stabilizer' (teacher==student-family; if hints add no independent info the optimum is to imitate the altered conditional, sharpening a soft bias into a hard preference). The report cites the optimistic external SDFT result while omitting the pessimistic in-repo finding on the very same mechanism, leaving the 'stabilizer' framing one-sided.",
	"fix": "Add a clause acknowledging the repo's own counter-position: e.g. after 'is exactly that on-policy, demonstration-conditioned regime' add '— though the repo's own ADR-013 warns the same SDPO channel is the one most likely to AMPLIFY an existing distortion when the teacher is same-family and the hint adds no independent information, so the stabilizer claim holds only when the privileged-information conditioning carries genuine new signal (the per-turn JSD signal-presence gate of §4).'",
	"anchor_quote": "Self-distillation in the inner loop is, in this configuration, a stabilizer and not only a collapse risk"
	},
	{
	"severity": "high",
	"section": "5. Pipeline Shape: Two Loops, Not Two Phases",
	"issue": "Categorical overclaim contradicted by the report's own cited source. The report asserts a clean dichotomy: 'every working SWE flywheel optimizes a true execution oracle ...; every collapse story requires a proxy or self-judged verifier.' But DeepSWE [43] — cited approvingly two sentences later and throughout — documents near-collapse on a TRUE 0/1 execution oracle from positives alone: 'LLM agents may stumble upon correct patches and pass all tests without knowing. Training with these positives reinforces undesired behaviors ... leading to collapse,' which is precisely why DeepSWE needed compact filtering. So a true execution oracle did NOT prevent a collapse mode; positives on a real oracle produced it. The 'every collapse story requires a proxy' claim is falsified by the report's own evidence base.",
	"fix": "Soften the dichotomy to acknowledge the positives-on-a-true-oracle collapse mode: e.g. change 'every collapse story requires a proxy or self-judged verifier' to 'most collapse stories require a proxy or self-judged verifier — though even a true execution oracle can collapse if positives reinforce accidental passes (DeepSWE's compact-filtering motivation [43]), which is a further argument for the per-turn signal gate and submit-gated credit.'",
	"anchor_quote": "every collapse story requires a proxy or self-judged verifier"
	},
	{
	"severity": "medium",
	"section": "7. What the Literature Says (and Where It Pushes Back)",
	"issue": "Asymmetric domain-transfer skepticism. The report disarms the pro-simulation cluster with 'none of those is a SWE-pass-rate result at equal compute — they are calibration, reasoning-trace, and non-SWE results.' But the report applies no such discount to two load-bearing disconfirmers that are equally non-SWE: the anti-emergence 'killer fact' (§2, [11] 2601.03905) is a vision-language-model agentic+VQA study, and 'the single most decisive result for this project' (§4, [27] 2503.14391) is a multiple-choice-QA Likra study, not SWE and not the DPO/GRPO regime in use. The same 'not a SWE-pass-rate result at equal compute' burden the report imposes on the pro side should be acknowledged for these anti-side pillars, or the symmetry argument is one-directional.",
	"fix": "Add a one-clause symmetry caveat where the burden-shift is stated, e.g. after 'they are calibration, reasoning-trace, and non-SWE results' add '(the same domain-transfer caveat applies to the anti-side pillars — the world-model-as-tool foresight result [11] is VLM/VQA and the near-miss-calibration result [27] is MCQA — which is why the SWE-specific P0-P6 ablation, not the imported literature, is the actual decider).'",
	"anchor_quote": "none of those is a SWE-pass-rate result at equal compute"
	},
	{
	"severity": "medium",
	"section": "2. The World-Model Goal: Training Latent What-If Deliberation",
	"issue": "The anti-emergence case rests on a non-SWE study while a directly-on-domain SWE disconfirmer in the same corpus is never cited. The 'killer fact against emergence' [11] (2601.03905) is built on vision-language models over 'agentic and visual question answering tasks.' The corpus contains 2604.12147 (Plan Compliance in Autonomous Programming Agents, 16,991 SWE-agent trajectories on SWE-bench Verified + Pro across GPT-5 mini / DeepSeek-R1-V3 / Devstral) — flagged by the corpus-critic as 'the single most on-domain piece of evidence' that SWE agents fall back on memorized workflows and that a subpar/misaligned plan hurts MORE than no plan. It directly supports the report's selective-curriculum-over-naive-train-on-all thesis yet is absent from the citation list (no [49]; sources end at [48]). Grounding the anti-emergence and selective-structure arguments on a VLM/VQA study when a direct SWE result is available weakens the section.",
	"fix": "Cite 2604.12147 in §2 (and/or §4) alongside [11]: e.g. after the foresight-governance sentence add 'and in SWE specifically, a study of 16,991 SWE-agent trajectories on SWE-bench finds agents revert to internalized workflows and that a misaligned plan hurts more than no plan — direct on-domain support for selective, alignment-gated structure over naive train-on-all [49].' Add the source to the Sources list.",
	"anchor_quote": "handed a world model as a tool, agents invoke it under 1% of the time"
	},
	{
	"severity": "medium",
	"section": "2. The World-Model Goal: Training Latent What-If Deliberation",
	"issue": "Numerical misread of the predictive-causal gap. The report says 'across 2,695 networks mean causal fidelity collapses toward ~1e-8 at high dimension while achieving 92% lower prediction error.' Per the source (the-predictive-causal-gap note), the MEAN causal fidelity across the 2,695 configurations is 0.49 (only 2.5% exceed 0.70); the ~1e-8 ('causally blind') figure and the 92%-lower-prediction-error figure are the high-dimension N=100 extreme, not the 2,695-network mean. Coupling '2,695 networks mean causal fidelity' with '~1e-8' conflates the corpus mean with the worst-case dimension and overstates the typical-case magnitude.",
	"fix": "Split the two statistics: e.g. 'across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) while achieving 92% lower prediction error.'",
	"anchor_quote": "across 2,695 networks mean causal fidelity collapses toward ~1e-8 at high dimension while achieving 92% lower prediction error"
	},
	{
	"severity": "low",
	"section": "4. The Central Question: Prune Bad Branches vs Train on All Branches",
	"issue": "Under-engaged tension in the oracle-cleanliness argument. The report uses EvilGenie [30] to argue held-out tests are weak ('held-out tests giving only minimal detection improvement') and simultaneously makes the disjoint held-out eval 'the most load-bearing safeguard' (§4, §5 safeguard #2). EvilGenie's own finding is that the held-out-test method gave minimal improvement while the LLM JUDGE was 'highly effective at detecting reward hacking in unambiguous cases' — yet safeguard #1 forbids a learned/self-judged verifier in the training reward and the report leans on held-out eval. The report should reconcile why the safeguard it most relies on is the detector EvilGenie found weakest, and whether the LLM-judge detector (allowed only at test-time selection per safeguard #1) belongs in the monitoring stack.",
	"fix": "Add a reconciling clause where EvilGenie is cited: e.g. 'EvilGenie found held-out tests weak as a detector but the LLM judge effective — so the held-out eval here is load-bearing as a drift TRIPWIRE (proxy-minus-realeval gain) rather than a per-trajectory hack detector, and an LLM-judge monitor is admissible for offline flagging though never as the training reward (safeguard #1).'",
	"anchor_quote": "with held-out tests giving only minimal detection improvement"
	}
	]
	}