Baladithya Balamurugan
Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt
c11cf49
|
Raw
History Blame Contribute Delete
14.4 kB
# Cross-locus comparisons — argumentative spine
## Tension 1: "Prune" means two different things at two different granularities
- **Locus prune-vs-train-on-all** commits: TRAIN ON ALL branches, but *typed/routed* — winners→policy SFT/RL, losers→DPO rejects + world-model targets; the natural prune is the per-TURN JSD signal-presence test, not per-trajectory survival.
- **Locus selfevolve-flywheel** commits: you MUST prune at the oracle-cleanliness gate before training, because train-on-all distills proxy-hacks (RSI §3.2) — reward-hacking branches must be discarded, not learned from.
- **The cross-locus dynamic:** These look contradictory ("train on all" vs "prune") but reconcile into a precise rule: prune at TWO gates the policy must never cross — (a) oracle-cleanliness (drop reward-hacked / guard-broken branches entirely) and (b) per-turn signal-presence (skip zero-signal turns) — then train on ALL of what survives, routed by signal type. The flywheel locus supplies the safety floor that the prune-vs-all locus's "keep everything" must sit on top of.
- **How the draft should engage this:** §4 must state the resolution as a two-gate filter (cleanliness gate + signal-presence gate) wrapping a typed train-on-all, NOT as "prune vs all". This is the headline reconciliation of the report.
- **Calibration:** prune-vs-all is HIGH confidence that structured negatives beat positives-only; flywheel is HIGH that the gated version compounds, MEDIUM the current repo code is sufficient (safeguard #2, the disjoint held-out + kill-switch, is a documented GAP). Both name the SAME falsifier: held-out score declining while in-loop oracle reward rises = collapse caught in the act.
## Tension 2: The failed branch is simultaneously poison (for the policy) and gold (for the world model)
- **Locus worldmodel-latent-deliberation** commits: train-on-all for the world-model head (a failed branch is a *perfect* next-state-prediction label — CWM precedent), prune/reward-filter for the GRPO policy head — "same tree, two harvests."
- **Locus prune-vs-train-on-all** commits: the single best use of a failed branch is exactly a world-model next-state-prediction target (route #2, "no policy-gradient penalty at all").
- **The cross-locus dynamic:** Strong CONVERGENCE from two independent investigations onto the same mechanism — the failed branch's value is realized by predicting it, not by penalizing the policy with it. This dissolves the prune-vs-all dilemma: you never throw the failed branch away (world model eats it) and you never let it destabilize the policy (no raw negative gradient). Convergence-from-independent-paths is itself a finding.
- **How the draft should engage this:** §2 and §4 must share this "two-harvest" frame explicitly; the world-model aux loss is what *makes* train-on-all safe for the policy, because it relocates the failed-branch signal off the policy gradient.
- **Calibration:** worldmodel is HIGH on necessity of training it, MEDIUM-HIGH that the aux next-state head is the best lever; prune-vs-all independently rates the same head MEDIUM-HIGH. Shared falsifier: foresight@k with aux-ON ≈ token-RL-only (aux content loss redundant at scale).
## Tension 3: The expensive tree only pays for itself if expansion is divergence-gated — and that gate is where the world model earns its keep
- **Locus credit-assignment** commits: the divergence tree is a genuine PRM-free counterfactual process oracle, but O(N^D) cost means it's worth it ONLY with divergence-gated expansion (branch only at high-VOI turns where heterogeneous models already disagree) → ~O(N·decision-points).
- **Locus worldmodel** commits: the bottleneck the literature identifies (2601.03905) is foresight *governance* — when/whether to deliberate — not simulator fidelity; RL on the `<deliberate>` token's *placement* teaches governance.
- **The cross-locus dynamic:** COMPLICATION-into-synthesis: the same "where to spend deliberation" question appears as a COST control in credit-assignment (where to branch the env) and as a CAPABILITY target in worldmodel (where to emit `<deliberate>`). They are the same decision learned at two levels — the trained world model's governance signal is exactly the policy that should drive divergence-gated expansion at data-generation time. The system's most expensive knob (branch factor) and its core capability (foresight governance) are the same lever.
- **How the draft should engage this:** §3 (GA) and the §8 cost section must tie the divergence gate to VOI; note the bootstrap — early rounds gate on cross-model disagreement, later rounds can gate on the model's own learned deliberation-confidence.
- **Calibration:** credit-assignment is conditional ("YES but gated"); its falsifier (divergence-gated arm fails to beat equal-budget outcome-only GRPO++ on long-horizon tasks) is the single most important compute-matched ablation in the whole program.
## Tension 4: Replay entrenches the human distribution — branching is the claimed escape, but only the oracle proves you escaped
- **Locus selfevolve-flywheel** commits: human-trace entrenchment (Self-Play-SWE-RL 2512.18552) is real for the UNGUARDED version; the antidote is counterfactual branching OFF the human path graded by tests — "you fork, you don't replay."
- **Locus credit-assignment** commits: sibling divergence (different models reaching different EXECUTED outcomes from a shared parent) is the unit of signal — which is precisely a fork off the parent trajectory, validated by execution.
- **The cross-locus dynamic:** CONVERGENCE plus a caveat: branching is the mechanism that turns "replay" into "counterfactual exploration," and both loci agree the EXECUTION ORACLE (not teacher consensus, not a learned verifier) is what certifies the fork found something real. The repo's Channel 3 today is *weaker* on this axis precisely because its fitness is teacher-plurality, not test execution — the upgrade to execution-graded branching is the core delta.
- **How the draft should engage this:** §1 and §6 must name this as the single most important upgrade over the repo's current Channel 3 (teacher-plurality fitness → execution-oracle fitness) and as the answer to the strongest adversarial prior.
- **Calibration:** flywheel HIGH that branching+oracle escapes entrenchment; the open risk both flag is that a system generating its own tasks from its own traces can drift the held-out set toward the train set.
## Tension 5: EKS-primary is cheap to adopt in CODE but the genuinely-new cost is sandbox fan-out — which is also the throughput ceiling of the whole idea
- **Locus eks-architecture** commits: EKS-primary single-control-plane hybrid; the repo port is a ~300 LOC leaf adapter (EKSExecutor + SageMakerExecutor); BUT the one genuinely-new infra is per-branch sandbox isolation, and per-branch cold-start can dominate outer-loop wall-clock.
- **Locus credit-assignment** commits: the rollout/branching is the system's most expensive piece ($64/trace ungated vs $0.98 flat); divergence-gating is mandatory.
- **The cross-locus dynamic:** The architecture locus's "strongest counter" (sandbox cold-start dominates → demote EKS from 'primary for everything' to 'primary for control+training, bespoke pool for sandbox execution') is the SAME bottleneck the credit-assignment locus controls with divergence-gating. Infra cost and algorithmic cost are the same constraint: branch factor × sandbox cold-start. SWE-MiniSandbox (container-free kernel isolation, ~5% disk / ~25% env-prep) is the throughput primitive that makes high fan-out affordable.
- **How the draft should engage this:** §8/§10 must connect the algorithmic gate (divergence-gating, §3) to the infra primitive (cheap sandboxes, container-free or snapshotted microVM) — the two cost controls are one. Honestly flag the "demote EKS for sandboxes" fallback.
- **Calibration:** eks-architecture HIGH (8/10) on the design; explicit falsifier = measured per-branch sandbox cold-start dominating wall-clock.
---
## Step-8 corpus-critic confidence revisions (overturning evidence found — MUST be reflected in the draft)
**Revision A — the heterogeneity premise is DOWNGRADED (contested, not assumed-positive).** Adversarial search found substantive counter-evidence that the system's single most distinctive choice (different model family per node + cross-family DPO) may not pay for itself:
- "Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets" (2604.02460): at held-constant reasoning tokens, single-agent matches/beats multi-agent incl. the ensemble variant (the closest analogue to multi-rollout heterogeneous search); "many reported MAS gains are better explained by compute and context effects than by inherent architectural superiority"; holds across Qwen3/DeepSeek-R1/Gemini; Data-Processing-Inequality argument (one agent with full context >= split agents).
- "Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline" (2601.12307): a single-LLM baseline matched AFlow-optimized HETEROGENEOUS (GPT-4o-mini + Claude-Haiku) MCTS workflows at lower cost.
- Cross-tokenizer/cross-family distillation is "a largely unsolved problem" (2604.07466 BLD + CTPD/CDM cluster): cross-family preference transfer is fragile, sometimes DEGRADES, needs special byte-level/OT machinery.
- **Engagement guidance:** §1/§3/§4 must treat heterogeneity as a HYPOTHESIS requiring an equal-compute control arm (single strong model with N temperature/persona samples) before claiming any heterogeneity gain. The typed-train-on-all and divergence-tree positions do NOT depend on heterogeneity (they work with homogeneous N-sampling too), so the core design survives — but the "different models per node" flourish is now an ablation question, not a premise. NOTE: safeguard #4's "N>=3 population as anti-collapse diversity" SURVIVES (no source showed model-diversity gives zero anti-collapse benefit; on-policy-distillation survey ties gains to predictive diversity).
**Revision B — the world-model aux loss is DEMOTED from "necessary" to "optional, parameter-isolated, ablation-gated."** Direct 2026 counter-evidence on all three angles:
- "Reasoning and Tool-use Compete in Agentic RL" (2602.00994): jointly training two capabilities into one parameter set induces misaligned gradients / interference; decoupling into separate LoRA adapters (DART) beats joint optimization. → stacking a 2nd SDPO/next-state head onto the SAME policy head is the exact interfering configuration; argue for a separate head/adapter.
- "Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning" (2605.06840): LLMs generate deep look-ahead in CoT but move choice is causally driven by shallow depth-1 nodes — foresight content generated but NOT consumed. So improving prediction quality may not move decisions.
- "The Predictive-Causal Gap: An Impossibility Theorem" (2605.05029): pure predictive objectives provably/empirically optimize AWAY from causal/decision-relevant structure (92% lower prediction error while causal fidelity ~0).
- Counter-counter (kept honest): SPA, VAGEN, Imagine-then-Plan, FOREAGENT all report explicit future-state simulation HELPS agentic pass-rate — the field is genuinely split.
- **Engagement guidance:** §2 must reframe the aux next-state loss as OPTIONAL, in a parameter-isolated head/adapter (not fused into the policy head), gated behind the pre-registered ablation (aux-ON vs deliberation-token-RL-only) on the PRIMARY metric (pass-rate + counterfactual-foresight, NOT next-state accuracy). This matches the worldmodel investigator's OWN stated falsifier. The cheapest decisive experiment we could run ourselves is the SWE-specific next-state-head ablation (does not exist in the literature yet).
**Revision C — even the EXECUTION ORACLE gets gamed (safeguard #1 is necessary but NOT sufficient).** The flywheel locus claimed a true execution oracle is "categorically different" from a proxy and thus immune to RSI-style depth-amplified hacking. Adversarial search complicated this: EvilGenie (2511.21654), "LLMs gaming verifiers: RLVR can lead to reward hacking" (2604.15149), and "Do synthetic trajectories reflect real reward hacking" (2604.23488) show verifiable/test-based rewards ARE gamed — agents hardcode/special-case to pass FAIL_TO_PASS, exploit fractional partial-credit, and overfit held-out tests. → Engagement: §4 (oracle-cleanliness gate) and the safeguards must state that the execution oracle REDUCES but does not eliminate the hack surface; HackMonitor + held-out disjoint eval + the depth kill-switch are doing real work, not belt-and-suspenders. The oracle bounds the hack surface (finite, vs an open-ended proxy) but PASS_TO_PASS guards, test-provenance checks, and contamination control are mandatory, not optional. This makes safeguard #2 (disjoint held-out + kill-switch) MORE load-bearing, not less.
Net: the corpus critic STRENGTHENED the report by puncturing two overclaims and complicating a third. The robust core (fork-off-the-human-trace + execution oracle + typed train-on-all + two-gate prune + divergence-gated expansion + 4 safeguards + EKS-primary) is untouched; the two flourishes (heterogeneity-as-premise, aux-loss-as-necessary) become explicit ablation questions. Both shared falsifiers were independently confirmed as the right experiments.
## Summary for the synthesizer
The five loci are NOT orthogonal — they collapse into ONE coherent design with a single through-line: **fork off the human trace with heterogeneous models, grade by a true execution oracle, gate expansion on divergence/VOI, and route the resulting branches by signal type — winners to the policy, all branches (incl. failures) to a world-model next-state head — under two hard prune gates (oracle-cleanliness, per-turn signal-presence) and four collapse safeguards.** The world-model aux loss is the keystone: it is simultaneously the project's stated goal, the safe home for failed-branch signal (resolving prune-vs-all), and the learned governance policy that drives divergence-gated expansion (controlling cost). The single most important experiment is the compute-matched, generate-once/route-many P0–P6 ablation on the repo's ADR-013 ladder, measuring calibration/foresight, not just pass@1.