Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
Browse files8 Sonnet readers re-fetched every primary source (Composer 2.5 blog verbatim,
Composer 2 techreport 2603.24477 full HTML, SWE-smith/SWE-Gym/R2E-Gym/SWE-bench,
SDPO/OPSD, GRPO family + Comedy-of-Estimators, DiLoCo line, world-model cluster
+ anti-evidence, LATS/ToT/rStar/Tree-GRPO/SWE-Search/Symphony/Socratic-SWE,
TRL/verl/SkyRL live docs, SWE-MiniSandbox) and cross-checked the repo's research
notes + vision docs against them. 2 adversarial critics (fidelity, design);
every P0/P1 finding independently verified against code + source quotes — 0
refuted. (Feasibility critic stalled on capacity; flagged unreviewed.)
Top verified findings:
- SDPO "mathematically the same" claim is WRONG (Cursor cites it as background;
ours is a third, blog-inspired design) — to be corrected.
- Envisioned tree pipeline had 4 structural breaks: seed-trace/oracle
disjointness, NO rollout harness (SFT corpus had no producer), divergence
gate uncomputable (whitespace-stub normalizer), no Sandbox.fork().
- Zero benchmark decontamination anywhere; no secrets/PII gate; golden_diff
serialization leak; two unreconciled S3 contracts.
- Buy-vs-build inversion: pip install swesmith ships what we planned to build;
its PR Mirror ≡ our gold-patch reversion and is VALIDATED best-of-five by
its ablation (Table 5) — the engine for "point at a repo".
- Fabricated numbers circulating as Cursor-stated (69.3%, "24 generators",
85% compute); CWM misread (mid-training, not RL-time aux head); Streaming
DiLoCo citation doubly wrong; cost figures mislabeled.
13-synthesis-architecture.md: the Stage-0 "point at a repo -> corpus" pipeline
(swesmith buy + rollout harness + ingest gates + canonical trajectory IR +
reconciled contract + acceptance probe, ~900 LOC) and the staged vision ladder.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- research/deepread/00-grounding.md +255 -0
- research/deepread/01-cursor-artifacts.md +326 -0
- research/deepread/02-swe-task-synthesis.md +340 -0
- research/deepread/03-sdpo-opsd.md +556 -0
- research/deepread/04-grpo-family.md +421 -0
- research/deepread/05-diloco.md +309 -0
- research/deepread/06-worldmodel.md +298 -0
- research/deepread/07-trace-replay-tree.md +291 -0
- research/deepread/08-rl-infra.md +292 -0
- research/deepread/10-critic-fidelity.md +169 -0
- research/deepread/11-critic-design.md +180 -0
- research/deepread/12-verified-findings.md +131 -0
- research/deepread/13-synthesis-architecture.md +197 -0
|
@@ -0,0 +1,255 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned
|
| 2 |
+
|
| 3 |
+
**Agent:** REPO-GROUNDING
|
| 4 |
+
**Date:** 2026-06-09
|
| 5 |
+
**Scope:** composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py,
|
| 6 |
+
hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5,
|
| 7 |
+
research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md,
|
| 8 |
+
docs/BACKLOG_RESOLUTION_2026-06-09.md
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## (1) Exact Current Dataset-Generation Capability
|
| 13 |
+
|
| 14 |
+
### FeatureDeletionTask schema (`datagen/schema.py`)
|
| 15 |
+
|
| 16 |
+
Six load-bearing fields and what produces each today:
|
| 17 |
+
|
| 18 |
+
| Field | Type | Producer today | Notes |
|
| 19 |
+
|---|---|---|---|
|
| 20 |
+
| `task_id` | `str` | `SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]` | `"unknown"` if missing |
|
| 21 |
+
| `repo` | `str` | `instance["repo"]` via `SweBenchAdapter.to_task()` | e.g. `"getmoto/moto"` |
|
| 22 |
+
| `base_commit` | `str` | `instance["base_commit"]` | no code to `git checkout` this commit exists today |
|
| 23 |
+
| `broken_image` | `str` | `SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest` | This tag is a **pre-built SWE-bench eval image**; no code in the repo pulls or builds these images |
|
| 24 |
+
| `fail_to_pass` | `tuple[str,...]` | `_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list | validated non-empty in `__post_init__` |
|
| 25 |
+
| `pass_to_pass` | `tuple[str,...]` | `_as_tuple(instance["PASS_TO_PASS"])` | may be empty |
|
| 26 |
+
| `test_command` | `str` | `SweBenchAdapter.default_test_command` = `"python -m pytest -q"` | hardcoded; not read from instance |
|
| 27 |
+
| `deleted_symbols` | `tuple[str,...]` | **never populated by SweBenchAdapter** — hardcoded `()` in every substrate inversion | the monitor can't do symbol-provenance checks without this |
|
| 28 |
+
| `golden_diff` | `str` | `instance["patch"]` | held out of repr; used only by validator |
|
| 29 |
+
| `granularity` | `str` | hardcoded `"feature"` in `SweBenchAdapter.to_task()` | CREATE-half escalation (function→file→feature) not wired to anything |
|
| 30 |
+
| `difficulty_prior` | `float` | `instance["difficulty"]` if present (SWE-rebench) else `0.5` | |
|
| 31 |
+
| `upstream_license` | `str` | `instance["license_name"]` | copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL |
|
| 32 |
+
|
| 33 |
+
### What SweBenchAdapter actually does and does NOT do
|
| 34 |
+
|
| 35 |
+
`SweBenchAdapter.to_task(instance: dict)` is a **pure schema inversion** — it takes one SWE-bench-shaped dict and maps it to a `FeatureDeletionTask`. It does NOT:
|
| 36 |
+
- Pull or build a Docker image
|
| 37 |
+
- Apply the gold patch in reverse (`git apply -R`)
|
| 38 |
+
- Run any tests
|
| 39 |
+
- Discover test node IDs
|
| 40 |
+
- Populate `deleted_symbols` (always empty)
|
| 41 |
+
- Escalate `granularity` beyond the static `"feature"`
|
| 42 |
+
|
| 43 |
+
The broken-repo Docker image is **assumed to exist pre-built** (the SWE-bench project publishes these images; SWE-rebench carries its own `docker_image` field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented `[~]` gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, `scrub_tree` is built, `LocalSubprocessSandbox` and `DockerSandbox` are built) but there is no code in the repo that actually clones a repo, runs `git apply -R <gold_patch>`, builds a Docker image, and pushes it to a registry.
|
| 44 |
+
|
| 45 |
+
### What FeatureDeletionEnv does during training (`datagen/env.py`)
|
| 46 |
+
|
| 47 |
+
- `reset(task)` — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes `task.repo`, `task.fail_to_pass`, `task.test_command` but NEVER `golden_diff` or `deleted_symbols`.
|
| 48 |
+
- `step(action)` — delegates to `sandbox.exec(action)`, returning observation text; grades on `submit` or turn limit.
|
| 49 |
+
- `_grade()` — runs `sandbox.run_tests(test_command, fail_to_pass + pass_to_pass)`, computes pass-fraction over `fail_to_pass`, gates to 0 if `pass_to_pass` guard is broken OR `HackMonitor.flag()` fires.
|
| 50 |
+
- `reward_fn(prompts, completions, *, task_id, **kwargs)` — TRL `RewardFunc` face; dispatches through `reset`/`step`; feeds fractional credit (not binary) to `DifficultyCurriculum.update`.
|
| 51 |
+
|
| 52 |
+
### Safeguards implemented
|
| 53 |
+
|
| 54 |
+
- `scrub_tree(workdir)` — physically removes `__pycache__`, `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, `*.pyc/.pyo/.class` before episode start. This is the PRIMARY control (added in Wave 2; was absent before).
|
| 55 |
+
- `SANDBOX_DENYLIST` — blocks `find`, `strings`, `unzip`, `jar`, `javap`, decompilers, `git`. First-token-only check; bypassable via `sh -c "..."`. Documented as defense-in-depth, NOT the wall.
|
| 56 |
+
- `HackMonitor.flag()` — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in `submit_patch`). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat `"__py"+"cache__"` obfuscation), flags the trajectory.
|
| 57 |
+
- `DockerSandbox` — `network_mode='none'`, `read_only=True`, `cap_drop=['ALL']`, `no-new-privileges`, `pids_limit=256`, `mem_limit=1g`, optional gVisor `runtime='runsc'`.
|
| 58 |
+
|
| 59 |
+
### What ingestion/claude_code.py can ingest today
|
| 60 |
+
|
| 61 |
+
`ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]`:
|
| 62 |
+
- Input: Claude Code session JSONL at `~/.claude/projects/<encoded>/<sessionId>.jsonl`
|
| 63 |
+
- Output: one `TraceState` per assistant TURN (`state_id`, `messages`, `student_action`)
|
| 64 |
+
- Skips: subagent files (`agent-` prefix), sidechain records (`isSidechain: True`), `summary` / `attachment` / `queue-operation` / `file-history-snapshot` records
|
| 65 |
+
- `student_action`: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if `strip_thinking=True`)
|
| 66 |
+
- `tool_error` flag: structurally set on `user` messages where any `tool_result` block has `is_error: true` — this is the SDPO error-site detection signal
|
| 67 |
+
- `state_id`: `f"{path.stem}::{state_idx:04d}"`
|
| 68 |
+
- Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## (2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)
|
| 73 |
+
|
| 74 |
+
From `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §5/§8/§10.
|
| 75 |
+
|
| 76 |
+
1. **Seed trace ingestion (Stage a):** `ClaudeCodeIngester.ingest()` over `s3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl` → Parquet at `traces/v1/run_id=<id>/part-*.parquet` via AWS Glue 5.0 Spark ETL job (`glue_ingest_job.py`, ~80 LOC, NOT YET BUILT).
|
| 77 |
+
2. **Schema inversion (Stage c1):** `SweBenchAdapter.to_task()` per SWE-bench row → `FeatureDeletionTask` JSONL at `tasks/v1/run_id=<id>/manifest.jsonl` (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (`is_redistributable()`) applied here.
|
| 78 |
+
3. **N-teacher replay (Stage b):** `teacher_replay.replay_trace()` generalized from flat OpenRouter to `BedrockBatchTeacherPool` — write one shared `replay/v1/run_id=<id>/input/states.jsonl`, submit one `CreateModelInvocationJob` per teacher, write `.jsonl.out` per teacher to `replay/v1/.../teacher=<slug>/`. An EMR Serverless aggregation step joins all N outputs by `state_id` → `list[TeacherCallResult]`. (`teacher_replay_bedrock.py`, ~180 LOC, NOT YET BUILT).
|
| 79 |
+
4. **Multi-model tree expansion (the core delta — NOT BUILT):** A `tree_controller.py` (~250–350 LOC, design-only) that, for each `TraceState` node, fires N models, applies each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branches again from the new state, grades leaves with `_grade()`. Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8).
|
| 80 |
+
5. **Sandbox materialization + 4-gate validation (Stage c2):** AWS Batch array jobs on EC2 Spot, one child per task. Each child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task in the S3 manifest, boots `DockerSandbox`/`LocalSubprocessSandbox`, runs `validator.validate_task()` (4 gates), writes `task_grades/v1/run_id=<id>/<task_id>.json`. (`datagen/aws/batch_validate.py`, ~120 LOC, NOT YET BUILT).
|
| 81 |
+
6. **DPO pair extraction + normalization (Stage d):** `extract_dpo_pairs()` (already built in `teacher_replay.py`) on the fan-in of teacher outputs → `DPOPair` rows → `DJNormalizer` data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → `corpus/v1/run_id=<id>/dpo/part-*.parquet` and `corpus/sft/part-*.parquet`. (`replaysim/emr_normalize_job.py`, ~100 LOC, NOT YET BUILT).
|
| 82 |
+
7. **Orchestration:** AWS Step Functions Standard Workflow: `Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda)`. (`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py`, ~250 LOC IaC, NOT YET BUILT).
|
| 83 |
+
8. **S3 typed dataset contract (full set):**
|
| 84 |
+
- `raw/claude_code/**/*.jsonl` — input seed traces
|
| 85 |
+
- `traces/v1/run_id=<id>/part-*.parquet` — TraceState rows (Stage a output)
|
| 86 |
+
- `tasks/v1/run_id=<id>/manifest.jsonl` — FeatureDeletionTask rows (Stage c1 output)
|
| 87 |
+
- `tasks/golden/run_id=<id>/` — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)
|
| 88 |
+
- `replay/v1/run_id=<id>/input/states.jsonl` — shared Bedrock batch input
|
| 89 |
+
- `replay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out` — per-teacher Bedrock batch output
|
| 90 |
+
- `task_grades/v1/run_id=<id>/<task_id>.json` — validator + _grade() results
|
| 91 |
+
- `corpus/v1/run_id=<id>/sft/part-*.parquet` — clean winning trajectories (SFT-first floor)
|
| 92 |
+
- `corpus/v1/run_id=<id>/dpo/part-*.parquet` — DPO pairs (normalized DPOPair)
|
| 93 |
+
- `dpo_pairs/` — divergence-derived DPO pairs from the tree (sibling winners vs losers)
|
| 94 |
+
- `rl_task_pool/` — FeatureDeletionTask registry + DifficultyCurriculum priors
|
| 95 |
+
- `divergence_pairs/` — divergence-annotated nodes (where sibling next-action distributions forked)
|
| 96 |
+
- `wm_tuples/` — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)
|
| 97 |
+
- `holdout/` — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)
|
| 98 |
+
- `diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt` — DiLoCo outer-sync (already used by existing allreduce.py)
|
| 99 |
+
- `manifests/run_id=<id>.json` — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
|
| 100 |
+
9. **SFT-first stage:** Read `sft_corpus/` (clean `_grade()` gate-1 passing trajectories), run `compose_loss` with `alpha_sdpo=0, beta_replay=0` (reduces to `_lm_response_ce` — next-token CE masked to response tokens), write `ckpt_sft/`. (`pipeline/sft_floor.py`, ~60 LOC, NOT YET BUILT).
|
| 101 |
+
10. **Inner RL loop:** `ComposerReplicationTrainer` (trl.GRPOTrainer subclass) on `rl_task_pool/` with `FeatureDeletionEnv.reward_fn`; `total = grpo + α·sdpo + β·trace_replay_dpo`; DiLoCo outer-sync via S3; `HeldOutGuard` kill-switch now wired (Wave 3).
|
| 102 |
+
11. **Flywheel:** Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## (3) Unbuilt Components the Vision Depends On
|
| 107 |
+
|
| 108 |
+
Every item below is design-only or a skeleton; none has real production code.
|
| 109 |
+
|
| 110 |
+
| Component | File Estimate | Source | Status |
|
| 111 |
+
|---|---|---|---|
|
| 112 |
+
| `datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes | ~250–350 LOC | design-F1, final_report §1/§5/§6 | **0% built** — no file exists |
|
| 113 |
+
| `SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice | ~60 LOC | design-F5 Tier 1 / final_report §1/§6 | **0% built** — not a class in hint_generator.py at all |
|
| 114 |
+
| `pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract | ~80 LOC | design-F1 §4 | **0% built** — no `pipeline/` directory exists |
|
| 115 |
+
| `pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/` | ~60 LOC | design-F1 §2 / design-F5 d | **0% built** |
|
| 116 |
+
| `teacher_replay_bedrock.py` — `BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]` | ~180 LOC | design-F2 §b | **0% built** |
|
| 117 |
+
| `datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/` | ~120 LOC | design-F2 §c2 | **0% built** — `datagen/aws/` subdirectory does not exist |
|
| 118 |
+
| `datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet | ~80 LOC | design-F2 §a | **0% built** |
|
| 119 |
+
| `replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet | ~100 LOC | design-F2 §d | **0% built** |
|
| 120 |
+
| `datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection | ~120 LOC | design-F2 §contract | **0% built** |
|
| 121 |
+
| `infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) | ~250 LOC IaC | design-F2 §orchestration | **0% built** — `infra/` directory does not exist |
|
| 122 |
+
| `trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `<deliberate>` token as second SDPO mode | ~40 LOC delta | design-F1 §4 / final_report §2 | **0% built** — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`<deliberate>` anywhere in `composer_replication/` |
|
| 123 |
+
| Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R <golden_diff>`, run `scrub_tree`, build and push a Docker image to ECR | unspecified | ADR-010 §decision / design-F2 §c2 | **0% built** — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch |
|
| 124 |
+
| `EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop | Wave-2 executor skeleton built; Argo controller design-only | design-F1 §AWS / final_report §8 | **skeleton built** — `eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% |
|
| 125 |
+
| `verl AsyncServer` backend for tool-heavy tree | — | final_report §8 | **0% built** — design note only |
|
| 126 |
+
| Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) | — | design-F5 §Tier 4 | **0% built** |
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## (4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code
|
| 131 |
+
|
| 132 |
+
The `SweBenchAdapter` is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:
|
| 133 |
+
|
| 134 |
+
### Break 1: `broken_image` assumes a pre-built SWE-bench image exists
|
| 135 |
+
|
| 136 |
+
`SweBenchAdapter.image_for()` returns either `instance["docker_image"]` (SWE-rebench) or the convention `swebench/sweb.eval.x86_64.{iid}:latest`. For an arbitrary OSS repo there is no such image. A fresh repo would need:
|
| 137 |
+
- Clone at `base_commit`
|
| 138 |
+
- Install the project's Python/Java/etc. toolchain
|
| 139 |
+
- Apply `git apply -R <golden_diff>` to manufacture the broken state
|
| 140 |
+
- Run `scrub_tree()` to strip caches
|
| 141 |
+
- Build a Docker image that encapsulates this broken state
|
| 142 |
+
- Push the image to a registry accessible by `DockerSandbox.boot()`
|
| 143 |
+
|
| 144 |
+
None of this code exists. `DockerSandbox.boot(image)` raises `RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.")` if the image is absent.
|
| 145 |
+
|
| 146 |
+
### Break 2: `test_command` is hardcoded
|
| 147 |
+
|
| 148 |
+
`SweBenchAdapter.default_test_command = "python -m pytest -q"`. A fresh repo may use `make test`, `npm test`, `cargo test`, `mvn verify`, or any other test runner. There is no test-discovery logic anywhere in the repo.
|
| 149 |
+
|
| 150 |
+
### Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels
|
| 151 |
+
|
| 152 |
+
SWE-bench instances ship with `FAIL_TO_PASS` and `PASS_TO_PASS` as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. `FeatureDeletionTask.__post_init__` raises `ValueError` if `fail_to_pass` is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.
|
| 153 |
+
|
| 154 |
+
### Break 4: `deleted_symbols` is never populated
|
| 155 |
+
|
| 156 |
+
`SweBenchAdapter` hardcodes `deleted_symbols=()`. The `HackMonitor._patch_provenance_hack()` check (`monitor.py:157-182`) skips the symbol-reappearance test if `deleted_symbols` is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.
|
| 157 |
+
|
| 158 |
+
### Break 5: No copyleft scrub for arbitrary repos
|
| 159 |
+
|
| 160 |
+
`is_redistributable()` reads `upstream_license` from `instance["license_name"]`. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.
|
| 161 |
+
|
| 162 |
+
### Break 6: No env setup for non-Python repos
|
| 163 |
+
|
| 164 |
+
`LocalSubprocessSandbox.run_tests` runs `subprocess.run(cmd, shell=True, ...)` against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. `DockerSandbox` depends on a pre-baked image that already has the environment. A fresh Python repo would need `pip install -e .` run inside the image, and a non-Python repo would need a completely different image and test runner.
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
## (5) What ingestion/claude_code.py Can Ingest Today
|
| 169 |
+
|
| 170 |
+
`ClaudeCodeIngester.ingest(path)` handles exactly one format: **Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionId>.jsonl`.
|
| 171 |
+
|
| 172 |
+
Supported record types handled:
|
| 173 |
+
- `type="user"` — string content or list of text/tool_result blocks → OpenAI-style user message; `tool_error` structural flag set if any `tool_result` block has `is_error: true`
|
| 174 |
+
- `type="assistant"` — list of text/thinking/tool_use blocks → one `TraceState` with `student_action` (full blocks including thinking) and `messages` (history, optionally with thinking stripped)
|
| 175 |
+
|
| 176 |
+
Record types silently skipped:
|
| 177 |
+
- `type="summary"` — Claude Code conversation summary records
|
| 178 |
+
- `type="attachment"`, `"queue-operation"`, `"file-history-snapshot"`, `"last-prompt"`, `"system"` — auxiliary records
|
| 179 |
+
- `isSidechain: True` records — subagent traces (skipped in v0.1 per ADR-002)
|
| 180 |
+
- Files starting with `agent-` — subagent session files by naming convention
|
| 181 |
+
|
| 182 |
+
Structural features:
|
| 183 |
+
- `state_id = f"{path.stem}::{state_idx:04d}"` — stable within-session identifier
|
| 184 |
+
- `strip_thinking` flag (default True) — strips `[THINKING] ...` lines from the teacher-facing `messages` history but keeps them in `student_action`
|
| 185 |
+
- Injects synthetic system prompt at `messages[0]` (`"You are a senior software engineer..."`)
|
| 186 |
+
- Version check: warns on schema version outside `2.x.x`
|
| 187 |
+
|
| 188 |
+
NOT handled by this ingester:
|
| 189 |
+
- OpenHands trajectory format (planned for v0.2 per ADR-002)
|
| 190 |
+
- SWE-smith trajectories (planned for v0.2)
|
| 191 |
+
- Cline VS Code export
|
| 192 |
+
- Aider chat history
|
| 193 |
+
- SWE-bench leaderboard trajectory submissions
|
| 194 |
+
- Any binary or non-JSONL format
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
|
| 198 |
+
## Critical Cross-Checks: What the Repo Claims vs What Exists
|
| 199 |
+
|
| 200 |
+
### Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")
|
| 201 |
+
**What the blog says (COMPOSER_RECIPE_MAPPING.md):** "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests."
|
| 202 |
+
**What the repo does:** Inverts *existing* SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was *explicitly rejected*.
|
| 203 |
+
|
| 204 |
+
### Claim 2: "25× synthetic data"
|
| 205 |
+
**What the blog says:** Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2).
|
| 206 |
+
**What the repo has:** A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the *training run*; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.
|
| 207 |
+
|
| 208 |
+
### Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"
|
| 209 |
+
**What Composer 2.5 says:** "We both select for and create harder tasks dynamically throughout the run."
|
| 210 |
+
**What the repo has:** The SELECT-FOR half: `DifficultyCurriculum` with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. `granularity` is set statically to `"feature"` for all SweBenchAdapter tasks; no escalation logic exists.
|
| 211 |
+
|
| 212 |
+
### Claim 4: `deleted_symbols` enables AST-provenance monitoring
|
| 213 |
+
**What ADR-010 says:** "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads.
|
| 214 |
+
**Reality:** `deleted_symbols=()` on every `SweBenchAdapter`-derived task (line 81 in substrates.py: hardcoded empty tuple). `HackMonitor._patch_provenance_hack()` returns False immediately when `deleted_symbols` is empty (`reappeared = [s for s in deleted_symbols if s and s in patch]` → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.
|
| 215 |
+
|
| 216 |
+
### Claim 5: The tree controller and world-model head are part of the system
|
| 217 |
+
**What design docs say:** "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table).
|
| 218 |
+
**Reality:** The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no `SiblingBootstrap`, `world_model`, `WorldModel`, `next_state_head`, `tree_controller`, `MCTS`, `deliberate_token` anywhere in `composer_replication/`. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.
|
| 219 |
+
|
| 220 |
+
### Claim 6: The broken-repo image is manufactured by the pipeline
|
| 221 |
+
**What design-F2 says:** Step c2 involves "pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()`, run the test command, confirm FAIL_TO_PASS actually fails."
|
| 222 |
+
**Reality:** This describes what SHOULD happen in the Batch array child. No such code is written. `SweBenchAdapter.image_for()` returns a string tag; that tag must be pre-pulled on the host before `DockerSandbox.boot()` can use it (`RuntimeError` on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.
|
| 223 |
+
|
| 224 |
+
---
|
| 225 |
+
|
| 226 |
+
## Summary of Unbuilt vs Built
|
| 227 |
+
|
| 228 |
+
### BUILT and tested (production-ready CPU, Docker-gated where noted):
|
| 229 |
+
- `FeatureDeletionTask` schema + `FeatureDeletionEnv` (reset/step/_grade/reward_fn)
|
| 230 |
+
- `SweBenchAdapter` schema inversion (pure dict transform)
|
| 231 |
+
- `FakeSandbox`, `LocalSubprocessSandbox`, `DockerSandbox` (hardware-gated e2e green in Wave 1/2)
|
| 232 |
+
- `scrub_tree()` primary reward-hack control
|
| 233 |
+
- `HackMonitor` (signature + patch-provenance, obfuscation-resistant)
|
| 234 |
+
- `DifficultyCurriculum` (SELECT-FOR half + effort tilt)
|
| 235 |
+
- `validate_task()` 4-gate solvability validator
|
| 236 |
+
- `ClaudeCodeIngester` (Claude Code JSONL only)
|
| 237 |
+
- `behavior_rewards.py` — `c_length`, `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (Wave 20)
|
| 238 |
+
- `kl_in_reward.py` — k1-in-reward path opt-in (Wave 20)
|
| 239 |
+
- `HeldOutGuard` + `HeldoutSplit` + wired into trainer (Wave 2/3)
|
| 240 |
+
- `EKSExecutor` skeleton + `SageMakerExecutor` skeleton (Wave 2)
|
| 241 |
+
|
| 242 |
+
### DESIGN-ONLY (no code):
|
| 243 |
+
- Tree controller (`datagen/tree_controller.py`)
|
| 244 |
+
- `SiblingBootstrapGenerator` in `hint_generator.py`
|
| 245 |
+
- `pipeline/s3_layout.py`, `pipeline/sft_floor.py`
|
| 246 |
+
- `teacher_replay_bedrock.py` (BedrockBatchTeacherPool)
|
| 247 |
+
- `datagen/aws/batch_validate.py`, `datagen/aws/glue_ingest_job.py`, `datagen/aws/s3_contract.py`
|
| 248 |
+
- `replaysim/emr_normalize_job.py`
|
| 249 |
+
- `infra/datagen_stepfunctions.json`, `infra/datagen_stack.py`
|
| 250 |
+
- World-model next-state head in trainer
|
| 251 |
+
- Argo Workflows outer-loop controller
|
| 252 |
+
- Broken-repo image builder (clone → git apply -R → build → push)
|
| 253 |
+
- CREATE half of difficulty curriculum (mint harder tasks during run)
|
| 254 |
+
- SFT-first training stage
|
| 255 |
+
- Offline LLM-judge hack monitor
|
|
@@ -0,0 +1,326 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read: Cursor Artifacts — Composer 2.5 Blog + Composer 2 Technical Report
|
| 2 |
+
|
| 3 |
+
> **Written:** 2026-06-09.
|
| 4 |
+
> **Primary sources fetched directly:**
|
| 5 |
+
> - Composer 2.5 blog — full body retrieved from vault note `introducing-composer-25-cursor` (sourced from `https://cursor.com/blog/composer-2-5`).
|
| 6 |
+
> - Composer 2 Technical Report — full HTML body retrieved from `https://arxiv.org/html/2603.24477` (arXiv:2603.24477 v2, 26 Mar 2026), stored as vault note `composer-2-technical-report` (~11 900 words, all sections including appendices).
|
| 7 |
+
> **Repo notes cross-checked:** `research/01-composer-2.5.md`, `research/09-composer-blog-delta-2026.md`, `research/10-composer2-techreport-mining.md`, `research/06-feature-deletion-datagen.md`, `research/07-sdpo-hint-generator.md`.
|
| 8 |
+
> **Tag conventions:** `[SOURCE-VERBATIM]` = quote taken verbatim from the fetched primary source. `[REPO-CLAIM]` = what the repo's notes assert. `[FINDING]` = discrepancy or new fact. `[CONFIRM]` = repo claim verified against primary.
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## PART 1 — What the primary sources actually say (verbatim extractions)
|
| 13 |
+
|
| 14 |
+
### 1A. Synthetic data — every word the 2.5 blog says
|
| 15 |
+
|
| 16 |
+
The entirety of the "Synthetic data" section of the Composer 2.5 blog (fetched verbatim):
|
| 17 |
+
|
| 18 |
+
> [SOURCE-VERBATIM] *"During RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run. Composer 2.5 is trained with 25x more synthetic tasks than Composer 2.*
|
| 19 |
+
>
|
| 20 |
+
> *We use a range of approaches for creating synthetic tasks that are grounded in real codebases. For example, one synthetic approach is feature deletion. For these tasks the agent is given a codebase with a large set of tests, and asked to delete code and files in such a way that the codebase remains functional while specific testable features are removed. The synthetic task is to reimplement the feature, and the tests are used as a verifiable reward.*
|
| 21 |
+
>
|
| 22 |
+
> *One downstream consequence of large scale synthetic task creation is that it can cause unexpected reward hacking. As the model became more adept, Composer 2.5 was able to find increasingly sophisticated workarounds to solve the task at hand. In one example, the model found a leftover Python type-checking cache and reverse-engineered the format to find a deleted function signature. In another, it was able to find and decompile Java bytecode to reconstruct a third-party API. We were able to find and diagnose these problems using agentic monitoring tools, but they demonstrate the increasing care necessary for large scale RL."*
|
| 23 |
+
|
| 24 |
+
**Critical observations from this text:**
|
| 25 |
+
|
| 26 |
+
1. **The "25x more synthetic tasks" is relative to Composer 2**, not an absolute count. No absolute count is given in any source. The Composer 2 report does not mention synthetic tasks at all (it uses real-problem distributions).
|
| 27 |
+
|
| 28 |
+
2. **Feature deletion is explicitly described as a TWO-PHASE task:** Phase 1 ("task construction") involves a deleter that *"deletes code and files in such a way that the codebase remains functional while specific testable features are removed."* Phase 2 is the reimplementation task for the agent under training. The blog does NOT say who/what performs the deletion (human? model? program?).
|
| 29 |
+
|
| 30 |
+
3. **"A range of approaches"** — the blog explicitly says feature deletion is "one synthetic approach" among a "range." No other generators are named anywhere in either source. The total number of generators is completely unspecified.
|
| 31 |
+
|
| 32 |
+
4. **"Select for and create harder tasks dynamically throughout the run"** — this is a SINGLE sentence that bundles two distinct operations: (a) online difficulty filtering/selection ("select for"), and (b) active task generation ("create"). Neither operation's implementation is described.
|
| 33 |
+
|
| 34 |
+
5. **"Agentic monitoring tools"** — the only stated mitigation for reward hacking. Absolutely no technical detail about these tools is given. No reward penalties, no static analysis, no sandbox specifications.
|
| 35 |
+
|
| 36 |
+
6. **The dynamic curriculum signal is implicit.** The mechanism that drives "select for harder" is unstated. The Composer 2 report (§4) provides the only concrete handle: *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."* This is from Composer **2**, not 2.5 — but it is the only stated heuristic.
|
| 37 |
+
|
| 38 |
+
### 1B. Targeted RL with textual feedback — every word the 2.5 blog says
|
| 39 |
+
|
| 40 |
+
Full "Targeted RL with textual feedback" section (verbatim):
|
| 41 |
+
|
| 42 |
+
> [SOURCE-VERBATIM] *"Credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens. When a reward is computed over an entire rollout, it may be hard for the model to tell which specific decision helped or hurt the outcome. This is especially limiting when we want to discourage a localized behavior, such as a bad tool call, a confusing explanation, or a style violation. The final reward can tell us that something went wrong, but it is a noisy signal for where it went wrong.*
|
| 43 |
+
>
|
| 44 |
+
> *To address this, we trained Composer 2.5 with targeted textual feedback.[footnote 1] The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's. This gives us a localized training signal for the behavior we want to change, while still retaining the broader RL objective over the full trajectory.*
|
| 45 |
+
>
|
| 46 |
+
> *As an illustration of the text feedback process, consider a long rollout that includes a tool call error where the model attempts to call a tool that is not available. During the rollout, the model will receive a "Tool not found" error and continue making additional valid tool calls. The fact that it hit one error in the process of hundreds of tool calls will have a minimal impact on its final reward.*
|
| 47 |
+
>
|
| 48 |
+
> *With text feedback, we can target this specific mistake by inserting a hint in the context of the problematic turn, such as "Reminder: Available tools…" with a list of available tools. This hint changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement. For that turn only, we then update the student weights towards to the new probabilities.*
|
| 49 |
+
>
|
| 50 |
+
> *During the Composer 2.5 run, we applied this method to a variety of model behaviors, from coding style to model communication."*
|
| 51 |
+
|
| 52 |
+
**Footnote 1** (verbatim list from the blog's closing line): "For more background on this approach see Self-Distillation Enables Continual Learning, Reinforcement Learning via Self-Distillation, and Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models."
|
| 53 |
+
|
| 54 |
+
**Critical observations:**
|
| 55 |
+
|
| 56 |
+
1. **The teacher = hint-conditioned forward pass of the SAME weights.** The blog says "use the resulting model distribution as a teacher" after inserting the hint into the local context. This is not a separate model. The weights do not change for the teacher pass.
|
| 57 |
+
|
| 58 |
+
2. **Student = policy with the ORIGINAL context.** The student is trainable; the teacher is stop-gradient.
|
| 59 |
+
|
| 60 |
+
3. **The KL is applied "for that turn only"** — not over the full trajectory. This is a LOCALIZED loss.
|
| 61 |
+
|
| 62 |
+
4. **HOW HINTS ARE CONSTRUCTED is never stated.** The blog says "we construct a short hint describing the desired improvement." The construction mechanism (heuristic templates? LLM judge? human?) is completely absent.
|
| 63 |
+
|
| 64 |
+
5. **Behavior targets explicitly named:** "coding style," "model communication," and implicitly "tool use" (from the example). The blog also separately mentions "effort calibration" as a behavioral improvement. That is four distinct behavior classes.
|
| 65 |
+
|
| 66 |
+
6. **The interaction with the main RL objective is stated:** "while still retaining the broader RL objective over the full trajectory." The SDPO loss is ADDITIVE to the main reward — it does not replace it.
|
| 67 |
+
|
| 68 |
+
### 1C. RL algorithm — from the Composer 2 Technical Report (§4.1)
|
| 69 |
+
|
| 70 |
+
This is NOT in the 2.5 blog. The RL algorithm is fully specified only in the Composer 2 report:
|
| 71 |
+
|
| 72 |
+
> [SOURCE-VERBATIM] *"We use a policy gradient algorithm with multiple samples per prompt and a fixed group size. We operate in the single-epoch regime, i.e., the same prompt is never trained on twice. We utilize Adam as our underlying optimizer and update the full parameter set."*
|
| 73 |
+
|
| 74 |
+
> [SOURCE-VERBATIM] *"As in Dr. GRPO, we found that it is crucial to minimize the bias in the gradients that can arise from transforming the underlying advantage. Following this work, we remove the length standardization term from GRPO as it introduces a length bias. We do not normalize group advantages by their standard deviation, as it results in the degenerate case where small behavioral differences get massively upweighted within a group where every rollout achieves equal correctness."*
|
| 75 |
+
|
| 76 |
+
> [SOURCE-VERBATIM] *"We did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length."*
|
| 77 |
+
|
| 78 |
+
KL regularization (§4.1):
|
| 79 |
+
|
| 80 |
+
> [SOURCE-VERBATIM] *"Similar to prior work, we use a Kullback–Leibler divergence for regularization, KL(q‖p) = E_{x~q}[−log r(x)], r(x)=p(x)/q(x). Many open-source implementations of RL estimate KL with the estimator k3=(r−1)−log r... However, Amini et al. shows that the variance increases drastically as p and q diverge. See Figure 4: for large KL values, the variance of the estimate is extremely large. Therefore, we use the standard estimator k1=−log r instead."*
|
| 81 |
+
|
| 82 |
+
Dynamic curriculum (§4):
|
| 83 |
+
|
| 84 |
+
> [SOURCE-VERBATIM] *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."*
|
| 85 |
+
|
| 86 |
+
Agent behavior shaping (§4.2):
|
| 87 |
+
|
| 88 |
+
> [SOURCE-VERBATIM] *"For behavior and communication, we apply an array of auxiliary rewards to ensure the model provides a good experience. These include rewards for coding style, communication, and product-specific penalties for poor tool calls, such as creating to-do list items and then leaving them unfinished. During RL training, we monitor the model for emergent behaviors and occasionally introduce additional behavior rewards as needed. For example, we observed that the model would start to leave long chains-of-thought in comments or collapse to using the terminal tool only."*
|
| 89 |
+
|
| 90 |
+
Nonlinear length penalty (§4.2):
|
| 91 |
+
|
| 92 |
+
> [SOURCE-VERBATIM] *"C_length{k,q}(x) = ((1 + kx)^{1−q} − 1) / (k(1−q)), where k and q are hyperparameters which define the curvature of the penalty, and the input x is a weighted combination of thinking tokens, tool calling tokens, tool output tokens, final message tokens, number of tool calls, and number of turns of a rollout."*
|
| 93 |
+
|
| 94 |
+
Self-summarization (§4.1):
|
| 95 |
+
|
| 96 |
+
> [SOURCE-VERBATIM] *"Each training rollout can involve multiple generations chained together by summaries, rather than a single prompt–response pair. We use the final reward for all tokens produced by the model in the chain. This upweights both the agent responses in good trajectories and also the self-summarizations that made them work."*
|
| 97 |
+
|
| 98 |
+
### 1D. Infrastructure (Anyrun) — Composer 2 Technical Report (§6.2)
|
| 99 |
+
|
| 100 |
+
> [SOURCE-VERBATIM] *"Environments are run on top of Anyrun, an internal compute platform built for running untrusted code at scale. This is the same compute platform that powers Cloud Agents and Automations in the Cursor product."*
|
| 101 |
+
|
| 102 |
+
> [SOURCE-VERBATIM] *"Within a cluster, a distributed set of Anyrun managers schedule pods, scale cloud compute provisioned across multiple regions, and perform state reconciliation to manage hundreds of thousands of pods per cluster. Each pod is a dedicated Firecracker VM capable of running a full development environment, including a browser and GUI for computer use."*
|
| 103 |
+
|
| 104 |
+
> [SOURCE-VERBATIM] *"Each Anyrun cluster is capable of scheduling more than 500 pods per second."*
|
| 105 |
+
|
| 106 |
+
Weight sync (§6.2):
|
| 107 |
+
|
| 108 |
+
> [SOURCE-VERBATIM] *"Every training step, we synchronize updated weights to the inference engine by uploading to a shared S3 bucket. To minimize transfer size, we use delta compression: each rank caches its previous upload and transmits only the diff against the new weights. Because RL updates are small, even with full-parameter training these diffs compress to a handful of gigabytes for the 1T-parameter model."*
|
| 109 |
+
|
| 110 |
+
Inference partner:
|
| 111 |
+
|
| 112 |
+
> [SOURCE-VERBATIM] *"We partner with Fireworks AI to run RL inference."*
|
| 113 |
+
|
| 114 |
+
### 1E. Sharded Muon and dual mesh HSDP — from the 2.5 blog only
|
| 115 |
+
|
| 116 |
+
The full text of the "Sharded Muon and dual mesh HSDP" section (verbatim):
|
| 117 |
+
|
| 118 |
+
> [SOURCE-VERBATIM] *"For continued pretraining, we use Muon with distributed orthogonalization. After forming the momentum update, we run Newton-Schulz at the model's natural granularity: per attention head for attention projections, and per expert for stacked MoE weights.*
|
| 119 |
+
>
|
| 120 |
+
> *The main cost is orthogonalizing expert weights. For sharded parameters, we batch same-shaped tensors, all-to-all shards into complete matrices, run Newton-Schulz, then all-to-all the result back to the original sharded layout. These transfers are asynchronous: while one task is waiting on communication, the optimizer runtime advances other Muon tasks, overlapping network and compute. This is equivalent to full-matrix Muon, but keeps the shard group busy; on the 1T model, optimizer step time is 0.2s.*
|
| 121 |
+
>
|
| 122 |
+
> *This interacts closely with how we use HSDP for MoE models. HSDP forms multiple FSDP replicas and all-reduces gradients across corresponding shards. We use separate HSDP layouts for non-expert and expert weights: non-expert weights are comparatively small, so their FSDP groups can stay narrow, often within a node or rack, while expert weights hold most of the parameters and most of the Muon compute, so they use a wider expert sharding mesh.*
|
| 123 |
+
>
|
| 124 |
+
> *Keeping these layouts separate also lets independent parallelism dimensions overlap: CP=2 and EP=8 can run on 8 GPUs instead of requiring 16 in a single shared mesh. This avoids wide communication for small non-expert state while spreading expert optimizer work over many GPUs."*
|
| 125 |
+
|
| 126 |
+
**Critical observation:** Muon + HSDP is described in the 2.5 blog as being used for **continued pretraining**. The Composer 2 report (§6.1) describes a different sharding scheme for Composer 2 (FSDP + decoupled EP + CP, with AdamW/Adam, no Muon). These are two different systems for two different model versions.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## PART 2 — Critical discrepancies between the repo's research notes and the primary sources
|
| 131 |
+
|
| 132 |
+
### FINDING-1: research/01 claims benchmarks not in either blog
|
| 133 |
+
|
| 134 |
+
**REPO-CLAIM (research/01):**
|
| 135 |
+
> *"CursorBench 69.3%, Terminal-Bench 2.0 parity"* — attributed to the 2.5 blog.
|
| 136 |
+
> *"On Cursor's internal CursorBench, Composer 2.5 scored 69.3% (or ~61-63% depending on the specific benchmark version cited)"*
|
| 137 |
+
|
| 138 |
+
**SOURCE FACT:** The **2.5 blog contains NO benchmark numbers at all.** The 2.5 blog is a brief, methods-focused post with zero numerical results. Benchmark numbers only appear in the **Composer 2 technical report** (Table 1): Composer 2 = 61.3 / 73.7 / 61.7. The 69.3% figure appears nowhere in either primary source.
|
| 139 |
+
|
| 140 |
+
**SEVERITY:** High. The 69.3% figure and "Terminal-Bench 2.0 parity" are fabrications or secondary-source-only claims presented as if Cursor-stated. The audit note at the top of research/01 correctly flags this as "[NOT in the Cursor blog]," which means this file's own audit notice is accurate — but the body of the file still asserts these numbers as though Cursor-stated. Any pipeline design that uses these numbers as performance targets is using unverified figures.
|
| 141 |
+
|
| 142 |
+
### FINDING-2: The "25x more synthetic tasks" claim is correctly reported but its scope is misread
|
| 143 |
+
|
| 144 |
+
**REPO-CLAIM (research/01, research/06):** Accurately quotes the 25x figure. research/06 correctly identifies it as Composer 2.5 vs Composer 2.
|
| 145 |
+
|
| 146 |
+
**SOURCE FACT:** [CONFIRM] The 2.5 blog says verbatim: *"Composer 2.5 is trained with 25x more synthetic tasks than Composer 2."* The absolute number is NEVER stated. The Composer 2 report gives **no figure** for how many synthetic tasks Composer 2 used — it describes only "a problem distribution that reflects the most common use cases" with a category histogram (Fig. 3). There is therefore no baseline from which "25x more" can be translated to an absolute count.
|
| 147 |
+
|
| 148 |
+
**FINDING:** research/06 says "Reaching a '25×-spirit' pool of ~50k–60k tasks" but this number has no grounding in the primary sources. With no Composer 2 synthetic-task count, 25x is not convertible to an absolute. **The 50k–60k estimate is entirely [EXTRAPOLATED]** and is not sourced from the primary documents.
|
| 149 |
+
|
| 150 |
+
### FINDING-3: Feature deletion task structure — the "two agents" framing is inferred, not stated
|
| 151 |
+
|
| 152 |
+
**REPO-CLAIM (research/06, §1):** Interprets feature deletion as a "two-agent / two-phase structure the blog implies" with a distinct "deleter" and a "reimplementation agent."
|
| 153 |
+
|
| 154 |
+
**SOURCE FACT:** The blog says *"the agent is given a codebase with a large set of tests, and **asked to delete code and files** in such a way that the codebase remains functional while specific testable features are removed."* The subject "the agent" is grammatically the SAME agent doing deletion and then the synthetic task is reimplementation. The blog does NOT cleanly describe two separate agents — it could be interpreted as: (1) an agent tasked to do the deletion [two-phase construction], or (2) the blog is describing the construction process generally. The "deleter model vs. program" question (flagged as an open question in research/06) is therefore genuinely open — and the current analysis of "two agents" may be over-reading the blog's grammar.
|
| 155 |
+
|
| 156 |
+
**RECOMMENDATION:** Treat the deleter as unknown. The programmatic AST-deletion approach in research/06 is well-reasoned but is [EXTRAPOLATED]. The blog could equally describe a model-driven deletion step.
|
| 157 |
+
|
| 158 |
+
### FINDING-4: The "other 24 generators" claim is hallucinated
|
| 159 |
+
|
| 160 |
+
**REPO-CLAIM (research/01, §2):** *"Feature Deletion + 24 unnamed generators"* — a specific count of 24 additional generators.
|
| 161 |
+
|
| 162 |
+
**SOURCE FACT:** The blog says: *"We use a **range of approaches** for creating synthetic tasks."* "Range" has no numeric content. Neither the blog nor the Composer 2 report names ANY other generator beyond feature deletion. "24 unnamed generators" is an unsourced extrapolation that has been transcribed as if it were a blog claim.
|
| 163 |
+
|
| 164 |
+
**SEVERITY:** Medium. This affects how the dataset-pipeline scope is scoped. The correct statement is: "feature deletion is the one named generator; the total count and names of others are unknown."
|
| 165 |
+
|
| 166 |
+
### FINDING-5: The RL algorithm in research/01 is labeled [EXTRAPOLATED] — now resolved
|
| 167 |
+
|
| 168 |
+
**REPO-CLAIM (research/01):** *"RL Algorithm: Use a PPO or GRPO variant, modified for long-horizon sparse rewards"* — labeled as [EXTRAPOLATED] by the audit note. This was correct at the time of writing.
|
| 169 |
+
|
| 170 |
+
**SOURCE FACT (Composer 2 Technical Report, §4.1):** [CONFIRM] The algorithm is now known from the primary source: **Dr. GRPO-style** (Liu et al., arXiv:2503.20783). Specific modifications:
|
| 171 |
+
- Remove the length-standardization term.
|
| 172 |
+
- Do NOT normalize group advantages by their standard deviation.
|
| 173 |
+
- k1 KL estimator (−log r), not k3.
|
| 174 |
+
- Adam optimizer, single-epoch regime, full-parameter update.
|
| 175 |
+
- Overlong masking EXPLICITLY REJECTED.
|
| 176 |
+
|
| 177 |
+
research/10 correctly captures all of this as [REPORT-VERIFIED]. The issue is that research/01 has never been updated to reflect the Composer 2 findings.
|
| 178 |
+
|
| 179 |
+
### FINDING-6: "85% of total compute is post-training" — confirmed as community consensus, not Cursor-stated
|
| 180 |
+
|
| 181 |
+
**REPO-CLAIM (research/01, §Overview):** *"roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training and Reinforcement Learning (RL) pipeline."*
|
| 182 |
+
|
| 183 |
+
**SOURCE FACT:** Neither the 2.5 blog nor the Composer 2 technical report contains any compute budget percentages. The 2.5 blog contains NO compute cost figures. The Composer 2 report mentions only that the production training job "spanned 3 GPU regions + 4 CPU regions." This 85% figure is community speculation and should not be stated as a fact in any pipeline design document.
|
| 184 |
+
|
| 185 |
+
### FINDING-7: Sharded Muon is a 2.5 CPT optimization, NOT a Composer 2 optimization
|
| 186 |
+
|
| 187 |
+
**REPO-CLAIM (research/01):** Presents Muon as a Composer 2.5 feature; also discusses alongside "Dual Mesh HSDP."
|
| 188 |
+
|
| 189 |
+
**SOURCE FACT from the 2.5 blog:** Muon + HSDP is explicitly for "continued pretraining" only. The blog section is titled "Sharded Muon and dual mesh HSDP" and begins: "For continued pretraining, we use Muon."
|
| 190 |
+
|
| 191 |
+
**SOURCE FACT from Composer 2 Technical Report (§6.1):** Composer 2 uses **AdamW** for CPT and **Adam** for RL. No Muon. Sharding is FSDP + decoupled EP + Context Parallelism (CP). "HSDP" does not appear in the Composer 2 report — the relevant construct is FSDP + CP.
|
| 192 |
+
|
| 193 |
+
**FINDING:** The 2.5 blog's "HSDP" formulation differs architecturally from Composer 2's FSDP+CP system. They are described as separate evolutionary steps. research/01's treatment of these as the same system is correct for the evolution, but care must be taken not to conflate Composer 2 sharding details with the 2.5 blog's Muon+HSDP description.
|
| 194 |
+
|
| 195 |
+
### FINDING-8: "Anyrun" — correct source attribution now confirmed
|
| 196 |
+
|
| 197 |
+
**REPO-CLAIM (research/01, audit note):** "Anyrun environment harness with LSP/file-I/O/terminal — likely from the Composer 2 report, not 2.5."
|
| 198 |
+
|
| 199 |
+
**SOURCE FACT:** [CONFIRM] Anyrun is mentioned in BOTH sources. The 2.5 blog does not name Anyrun explicitly; the Composer 2 Technical Report (§6.2) provides full internals. research/09 already correctly flagged and confirmed this. The audit note in research/01 is accurate.
|
| 200 |
+
|
| 201 |
+
### FINDING-9: The hint-generation mechanism remains the #1 reproducibility gap — research/07 correctly frames this
|
| 202 |
+
|
| 203 |
+
**REPO-CLAIM (research/07):** Correctly identifies hint generation as fully unstated by Cursor. Proposes a layered (a)→(b)→(c)→(f) generator taxonomy.
|
| 204 |
+
|
| 205 |
+
**SOURCE FACT:** [CONFIRM] Neither the 2.5 blog nor the Composer 2 Technical Report contains any description of how hints are constructed. The Composer 2 report does not even contain the hint mechanism (it is a 2.5-only feature; Composer 2 uses auxiliary scalar rewards instead). The entire hint-generation design in research/07 is well-reasoned extrapolation from SDPO/OPSD papers, appropriately labeled [EXTRAPOLATED].
|
| 206 |
+
|
| 207 |
+
### FINDING-10: Targeted textual feedback is confirmed ABSENT from Composer 2
|
| 208 |
+
|
| 209 |
+
**REPO-CLAIM (research/10):** Correctly states "The Composer 2 technical report contains NO hint-generation / teacher-student textual-feedback mechanism."
|
| 210 |
+
|
| 211 |
+
**SOURCE FACT (Composer 2 Technical Report, §4.2):** [CONFIRM] Composer 2 uses ONLY auxiliary scalar rewards for behavior shaping, plus the nonlinear length penalty. The targeted-textual-feedback method is exclusively a Composer 2.5 contribution. This is a critical boundary for dataset-pipeline design: anything built to replicate Composer 2.5's SDPO layer cannot be validated against the Composer 2 report.
|
| 212 |
+
|
| 213 |
+
### FINDING-11: The CPT→RL causal claim is in the Composer 2 report, not any blog
|
| 214 |
+
|
| 215 |
+
**REPO-CLAIM (research/09, §1):** Correctly flags "cross-entropy loss is predictive of downstream RL performance" as coming from the Composer 2 report.
|
| 216 |
+
|
| 217 |
+
**SOURCE FACT (Composer 2 Technical Report, §3.1):** [CONFIRM] Verbatim: *"cross-entropy loss is indeed predictive of downstream RL performance"* — demonstrated via Qwen3-Coder-30B-A3B at three log-spaced CPT compute levels. This is the empirical justification for doing CPT at all and is a Composer 2 finding, not stated in the 2.5 blog.
|
| 218 |
+
|
| 219 |
+
**IMPLICATION for pipeline design:** Starting from Qwen3-Coder-7B or similar already-code-tuned models as proposed in research/06 is supported by this finding. However, note that the Composer 2 team's CPT recipe (3-phase: 32k bulk → 256k long-context extension → SFT) is not replicable without knowing the data mix percentages and token counts, which remain unstated.
|
| 220 |
+
|
| 221 |
+
### FINDING-12: Dynamic curriculum implementation — only Composer 2 provides a concrete handle
|
| 222 |
+
|
| 223 |
+
**REPO-CLAIM (research/09, research/06):** The 2.5 blog says "select for and create harder tasks dynamically throughout the run" — interpreted as an online curriculum.
|
| 224 |
+
|
| 225 |
+
**SOURCE FACT:** The 2.5 blog gives ZERO implementation detail for the curriculum. The ONLY concrete curriculum handle in either primary source is from the **Composer 2 Technical Report** (§4): *"In later stages of training, we use simple heuristics—such as number of turns and thinking tokens of rollouts—to upsample increasingly harder data points."*
|
| 226 |
+
|
| 227 |
+
**FINDING:** This is a Composer 2 mechanism being mapped onto the 2.5 description. Whether Composer 2.5 uses the same heuristic (turns + thinking tokens) or a different one is unknown. The PassRateCurriculum in research/06 uses pass-rate as the difficulty signal, which is a reasonable [EXTRAPOLATED] alternative to Cursor's stated turns/thinking-tokens heuristic, but is NOT what Cursor described.
|
| 228 |
+
|
| 229 |
+
### FINDING-13: Reward structure — test pass-fraction vs. correctness/succinctness/principles
|
| 230 |
+
|
| 231 |
+
**REPO-CLAIM (research/06):** Designs reward as "test pass fraction (|FAIL_TO_PASS passing| / |FAIL_TO_PASS|), masked by the hack monitor."
|
| 232 |
+
|
| 233 |
+
**SOURCE FACT (Composer 2 Technical Report, §2):** *"a reward is given based on the code's correctness, succinctness, and conformance to software engineering principles."* The Composer 2 reward is multi-dimensional — not purely test pass-fraction. The "succinctness" and "conformance to software engineering principles" components are only partially captured by the length penalty; they imply additional rubrics or reward signals beyond pass/fail.
|
| 234 |
+
|
| 235 |
+
**FINDING:** For Feature Deletion specifically, test pass-fraction is the natural verifiable reward and is consistent with the 2.5 blog's description. But the broader reward structure for non-feature-deletion RL tasks includes quality and style components not reducible to test pass rates.
|
| 236 |
+
|
| 237 |
+
### FINDING-14: Self-summarization is present in Composer 2, not mentioned in 2.5 blog
|
| 238 |
+
|
| 239 |
+
**REPO-CLAIM (research/01):** Does not mention self-summarization.
|
| 240 |
+
|
| 241 |
+
**SOURCE FACT (Composer 2 Technical Report, §4.1):** Self-summarization is described as introduced in Composer 1.5 and carried into Composer 2. The 2.5 blog does not mention it at all. It is therefore likely carried into 2.5 but is not described as a new 2.5 feature. Research notes should treat this as a "likely continued from Composer 2" feature, not a 2.5 innovation.
|
| 242 |
+
|
| 243 |
+
### FINDING-15: MoE router replay — a critical infrastructure detail not in any repo note
|
| 244 |
+
|
| 245 |
+
**SOURCE FACT (Composer 2 Technical Report, §6.2):** *"during inference, the engine returns the selected expert indices for every token at every MoE layer, and during the training forward pass the router's expert assignment is overridden to match."* Extended by filtering replayed experts below a plausibility threshold. This addresses the MoE numerical mismatch between inference and training forward passes.
|
| 246 |
+
|
| 247 |
+
**FINDING:** This detail is critical for any MoE-base RL implementation (relevant if the framework uses Kimi K2.5 or another MoE). It is described in research/10 correctly, but it is absent from research/01 and the design docs. Any replication plan that assumes "just run GRPO on a MoE" will encounter numerical divergence issues that this router-replay mechanism is specifically designed to solve.
|
| 248 |
+
|
| 249 |
+
### FINDING-16: Base model selection — Composer 2 used specific criteria that precluded agentic benchmarks
|
| 250 |
+
|
| 251 |
+
**SOURCE FACT (Composer 2 Technical Report, Appendix B):** *"We intentionally do not consider coding agent benchmarks when testing base models. We find that such benchmarks are less predictive of final performance, as agentic and long-horizon capabilities can drastically change during the RL stage."* Base model was selected on: FreshBench (factual coding knowledge), State Tracking (LoCoDiff-style), and internal codebase perplexity.
|
| 252 |
+
|
| 253 |
+
**FINDING:** This is an important methodological point absent from research/01. When the replication framework selects a base model (currently proposed as Qwen3-Coder variants), the selection should be made on intrinsic knowledge benchmarks (perplexity on domain code, state tracking) rather than SWE-bench or similar agentic benchmarks — consistent with Cursor's own methodology.
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## PART 3 — What the sources DO NOT say (reproducibility gaps)
|
| 258 |
+
|
| 259 |
+
These are the load-bearing unknown quantities for building the dataset-generation pipeline:
|
| 260 |
+
|
| 261 |
+
| Gap | What's stated | What's missing |
|
| 262 |
+
|---|---|---|
|
| 263 |
+
| **Hint construction** | "we construct a short hint" | HOW — templates? LLM judge? human? learned? |
|
| 264 |
+
| **Synthetic generator inventory** | "a range of approaches… for example, one… is feature deletion" | Any other generator name, count, or weighting |
|
| 265 |
+
| **Synthetic task absolute count** | "25x more than Composer 2" | Absolute number; Composer 2 baseline count |
|
| 266 |
+
| **Dynamic curriculum implementation** | Composer 2: "turns + thinking tokens to upsample"; 2.5: "select for and create" | 2.5's specific curriculum signal; whether pass-rate is used |
|
| 267 |
+
| **Feature deletion: deletion agent** | "the agent… asked to delete code and files" | Whether "the agent" is a model, program, or human pipeline |
|
| 268 |
+
| **Feature deletion: target selection heuristic** | "delete code and files in such a way that the codebase remains functional while specific testable features are removed" | HOW deletion targets are selected from a repo |
|
| 269 |
+
| **Feature deletion: languages** | Python (type-check cache) and Java (bytecode) implied by reward-hack examples | No explicit language list |
|
| 270 |
+
| **Reward-hacking mitigations** | "agentic monitoring tools" | No technical specification |
|
| 271 |
+
| **CPT data mix** | "large code-dominated data mix"; "3 phases" (32k bulk → 256k → SFT) | Token counts, percentages, data sources |
|
| 272 |
+
| **Behavioral reward signals** | Scalar "rewards for coding style, communication" (Composer 2); 2.5 extends to hint-distillation | No RM architecture, no rubric spec |
|
| 273 |
+
| **Muon optimizer for CPT** | "For continued pretraining, we use Muon with distributed orthogonalization" | Learning rate, hyperparameters, data ordering during CPT |
|
| 274 |
+
| **SpaceXAI / Colossus 2 model** | "training a significantly larger model from scratch, using 10x more total compute" | Everything; this is a future model not Composer 2.5 |
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
+
|
| 278 |
+
## PART 4 — Implications for building the dataset-generation pipeline
|
| 279 |
+
|
| 280 |
+
Based on the primary-source analysis, the following are the **only verifiably-sourced facts** the pipeline design can be grounded on:
|
| 281 |
+
|
| 282 |
+
### What IS in the sources and can be directly implemented:
|
| 283 |
+
|
| 284 |
+
1. **Feature deletion as described:** give agent a test-covered repo → delete a testable feature (keeping repo otherwise functional) → reward = passing the tests. The blog's formulation implies the deletion maintains a `PASS_TO_PASS` guard (the "codebase remains functional" requirement). This is the exact formulation in research/06 and is faithful.
|
| 285 |
+
|
| 286 |
+
2. **Verifiable test-based reward:** "the tests are used as a verifiable reward." Test pass-fraction is the correct scalar. No golden patch is needed at reward time.
|
| 287 |
+
|
| 288 |
+
3. **Online curriculum — turns + thinking-token upsampling (Composer 2 handle):** The only stated heuristic. Difficulty = rollout length / thinking-token count. This is simpler than the pass-rate curriculum in research/06 but is the only Cursor-sourced signal.
|
| 289 |
+
|
| 290 |
+
4. **Targeted textual feedback — exact mechanism:** (a) construct hint, (b) insert into local context, (c) teacher = hint-conditioned forward pass of same weights (stop-grad), (d) student = policy without hint, (e) on-policy KL loss on student for that turn only, (f) stack on top of the main RL objective.
|
| 291 |
+
|
| 292 |
+
5. **Dr. GRPO modifications (Composer 2):** Remove length-standardization, remove std-norm advantage normalization, use k1 KL (−log r), Adam optimizer, single-epoch regime.
|
| 293 |
+
|
| 294 |
+
6. **Auxiliary scalar rewards for behavior (Composer 2):** coding style, communication, tool-call quality penalties. These are the Composer 2 behavior-shaping mechanism; in 2.5 they are SUPPLEMENTED (not replaced) by the hint-distillation channel.
|
| 295 |
+
|
| 296 |
+
7. **Reward = correctness + succinctness + SE principles (Composer 2):** Multi-dimensional, not just test pass. The nonlinear length penalty C_length is the explicit succinctness component.
|
| 297 |
+
|
| 298 |
+
### What CANNOT be directly sourced and requires [EXTRAPOLATION]:
|
| 299 |
+
|
| 300 |
+
1. The hint construction mechanism — all of research/07 is [EXTRAPOLATED] (well-reasoned but not Cursor-stated).
|
| 301 |
+
2. Any synthetic generator beyond feature deletion.
|
| 302 |
+
3. The absolute scale of the task bank (50k, 60k, or any other number).
|
| 303 |
+
4. The pass-rate curriculum (a reasonable design choice but not what Cursor describes for Composer 2 or 2.5).
|
| 304 |
+
5. The CPT data mix specifics.
|
| 305 |
+
6. The reward-hacking detection tooling specifics.
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
+
|
| 309 |
+
## PART 5 — Verdict on the repo's existing research notes
|
| 310 |
+
|
| 311 |
+
| Note | Accuracy vs. primary sources |
|
| 312 |
+
|---|---|
|
| 313 |
+
| `research/01-composer-2.5.md` | Body is substantially from secondary sources. Audit note at top is accurate in flagging its own errors. The benchmark numbers (69.3%, Terminal-Bench parity) are not in either primary source. The "24 unnamed generators" claim is hallucinated. Should NOT be used directly as design input; use `research/09` and `research/10` instead. |
|
| 314 |
+
| `research/09-composer-blog-delta-2026.md` | High accuracy. All verbatim quotes verified against fetched blog. Deltas correctly identified. One minor gap: does not flag that the "25x" baseline count (Composer 2 synthetic task count) is also unstated in the Composer 2 report. |
|
| 315 |
+
| `research/10-composer2-techreport-mining.md` | High accuracy. All [REPORT-VERIFIED] claims confirmed against the full HTML. Corrections (optimizer Adam not Muon; FSDP+CP not HSDP; hint mechanism absent) are correct. The "k1 in reward KL" claim (also referenced in commit bd37412) is verified against §4.1. |
|
| 316 |
+
| `research/06-feature-deletion-datagen.md` | Well-constructed design brief. Correctly labeled [BLOG-VERIFIED] vs [EXTRAPOLATED]. One concern: the "50k–60k tasks" scale estimate has no primary-source grounding. The PassRateCurriculum (pass-rate as difficulty) is [EXTRAPOLATED] and departs from Cursor's stated heuristic (turns + thinking tokens). Otherwise faithful. |
|
| 317 |
+
| `research/07-sdpo-hint-generator.md` | Well-constructed. Correctly identifies the open question. The taxonomy (a)→(b)→(c)→(d)→(f) is well-grounded in SDPO/OPSD papers. The "successful-sibling bootstrap" as a fallback is directly grounded in SDPO's ablation of "sample solution" feedback type. The single issue: the layered design is predicated on the collator's existing `hint_generator` hook, which should be verified in the actual codebase before committing to this design. |
|
| 318 |
+
|
| 319 |
+
---
|
| 320 |
+
|
| 321 |
+
## Sources
|
| 322 |
+
|
| 323 |
+
- **Composer 2.5 blog** — full body, vault note `introducing-composer-25-cursor`, URL `https://cursor.com/blog/composer-2-5` (fetched 2026-06-09).
|
| 324 |
+
- **Composer 2 Technical Report** — full HTML body, vault note `composer-2-technical-report`, URL `https://arxiv.org/html/2603.24477` (arXiv:2603.24477 v2, 26 Mar 2026). 11 904 words, all sections including appendices.
|
| 325 |
+
- **arXiv abstract note** — vault note `260324477-composer-2-technical-report`, cross-checks author list and abstract wording.
|
| 326 |
+
- **Prior internal notes** — `research/01`, `research/09`, `research/10`, `research/06`, `research/07` as cross-reference for repo claims.
|
|
@@ -0,0 +1,340 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read: SWE Task Synthesis Prior Art — Critical Review
|
| 2 |
+
**Cluster:** Open-source task-synthesis literature ("point at a repo, build a dataset")
|
| 3 |
+
**Date:** 2026-06-09
|
| 4 |
+
**Reviewer:** Claude (critical-review pipeline, subagent)
|
| 5 |
+
**Sources fetched:** SWE-smith HTML (arXiv:2504.21798, 26k words), SWE-Gym HTML (arXiv:2412.21139, 10k words), R2E-Gym HTML (arXiv:2504.07164, 14k words), SWE-bench abstract (arXiv:2310.06770), SWE-MiniSandbox (arXiv:2602.11210), SWE-RL (arXiv:2502.18449), DeepSWE blog (Together AI), SkyRL GitHub, SWE-smith HF dataset page, SWE-smith GitHub README.
|
| 6 |
+
**Repo artifacts compared:** `composer_replication/datagen/substrates.py` (SweBenchAdapter), `research/06-feature-deletion-datagen.md`, `docs/adrs/ADR-010-feature-deletion-datagen.md`.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 1. SWE-smith (arXiv:2504.21798) — Deep Read
|
| 11 |
+
|
| 12 |
+
### 1.1 What it actually does (task synthesis mechanics)
|
| 13 |
+
|
| 14 |
+
SWE-smith's core insight is stated verbatim in §2: "Conceptually, this is a simple inversion of SWE-bench's approach, which instead prioritizes identifying task instances, and then attempts to build an environment for each." SWE-smith **builds an execution environment first**, then synthesizes many tasks within it. This is the opposite of SWE-bench's PR-mining approach.
|
| 15 |
+
|
| 16 |
+
**Environment construction (§2.1):**
|
| 17 |
+
1. Target the top-5000 most-downloaded PyPI packages by stars (≥1000 stars), excluding the 12 SWE-bench test repos.
|
| 18 |
+
2. Run SWE-agent on the latest commit for ≤100 steps to auto-install + run test suite.
|
| 19 |
+
3. Manually verify installation instructions and check >80% tests pass.
|
| 20 |
+
4. Build one Docker image per repo (not per task — this is the key scalability win).
|
| 21 |
+
5. Total: 128 repos selected, ~7 min human labor per repo for step 2 correction, ~1 min for test parser.
|
| 22 |
+
6. Total human labor: ~20 hours to create the entire 50k dataset.
|
| 23 |
+
|
| 24 |
+
**Four bug-synthesis strategies (Table 1 from the paper):**
|
| 25 |
+
|
| 26 |
+
| Strategy | Yield % | # Instances | Cost/instance | Median F2P | Median Lines Edited |
|
| 27 |
+
|---|---|---|---|---|---|
|
| 28 |
+
| LM Modify | 56.0% | 17,887 | 0.38¢ | 4 | 3 |
|
| 29 |
+
| LM Rewrite | 35.0% | 4,173 | 3.93¢ | 4 | 24 |
|
| 30 |
+
| Procedural | 40.2% | 15,641 | 0.00¢ | 7 | 5 |
|
| 31 |
+
| Combine (Cross-bug) | 96.9% | 10,092 | 0.00¢ | 15 | 11 |
|
| 32 |
+
| PR Mirror (Invert PRs) | 33.8% | 2,344 | 5.53¢ | 3 | 14 |
|
| 33 |
+
| **Total** | **50.1%** | **50,137** | **2.32¢** | | |
|
| 34 |
+
|
| 35 |
+
**Key detail on bug strategies:**
|
| 36 |
+
|
| 37 |
+
- **LM Modify:** Prompt LM to "introduce errant modifications" to a function. Yield 56% (LM doesn't always break a test). Input: full function.
|
| 38 |
+
- **LM Rewrite:** Given only function header + docstring, ask LM to re-implement. Yield 35% (lower because it's not explicitly asked to break anything). Generates longer changes (24 lines median). Input: signature only, so LM can't see the original.
|
| 39 |
+
- **Procedural Modification (§B.2):** Zero-cost AST transformations. 13 types (see §B.2 verbatim):
|
| 40 |
+
- Class: Remove Functions (removes method(s) + references); Remove Parent (removes base class); Shuffle Methods.
|
| 41 |
+
- Control: Invert If/Else (inverts if-else bodies).
|
| 42 |
+
- Flow: Shuffle Lines (shuffles lines of function).
|
| 43 |
+
- Expressions: Change Constants (±1 to numeric); Break Chains (removes operators); Swap Operands; Change Operator (+→-).
|
| 44 |
+
- Removal: Remove Loops (for/while); Remove Conditionals (if); Remove Assignments; Remove Wrappers (try/with).
|
| 45 |
+
- Applied with a `likelihood` parameter so modifications don't make tasks too hard. Filtered by function/class criteria (complexity min/max gates).
|
| 46 |
+
- **Combine Bugs (§B.3):** Combine already-validated bugs from same file or module into a multi-bug task. 96.9% yield because each component already passes validation. Creates complex multi-site tasks.
|
| 47 |
+
- **PR Mirror (§B.4):** For each PR in the repo's history, prompt LM to "undo" the PR's changes in the current codebase (NOT checkout the base commit — this is the key difference from SWE-bench). LM rewrites each affected file. Expensive (rewrites whole files) and lower yield (33.8%). Most reflective of SWE-bench distribution (PR Mirror trajectories are the best training data per Table 5).
|
| 48 |
+
|
| 49 |
+
**Issue text generation (§2.1, last subsection):** Provide LM with (the diff, source code of a random F2P test, test execution output), ask for GitHub-style issue text with reproduction code. Empirically as good as real issues (28.2% vs 28.0% on SWE-bench Verified in Table 5).
|
| 50 |
+
|
| 51 |
+
**Validation:** Apply bug patch, run test suite, keep only bugs that break ≥1 test. Time limit: 2 minutes per test run (bugs causing infinite loops discarded).
|
| 52 |
+
|
| 53 |
+
**Scale and storage (Table 2 in §2.2 comparison):**
|
| 54 |
+
- SWE-smith: 50k tasks, 128 repos, 295 GB environments.
|
| 55 |
+
- SWE-Gym: 2.4k tasks, 11 repos, 6 TB environments.
|
| 56 |
+
- R2E-Gym subset: 4.6k tasks, 10 repos, 4 TB environments.
|
| 57 |
+
- SWE-fixer (Xie et al. 2025a): 115k tasks, 856 repos — but NO execution environments.
|
| 58 |
+
- SWE-bench-train: 19k tasks, 37 repos — NO executable environments.
|
| 59 |
+
|
| 60 |
+
The one Docker image per repo (vs. SWE-bench's one image per task) is the mechanism: estimated 500x storage reduction at 50k tasks. SWE-bench at 50k would need 50-150 TB; SWE-smith uses 295 GB.
|
| 61 |
+
|
| 62 |
+
**Training results (§3-4):**
|
| 63 |
+
- SWE-agent-LM-32B (Qwen 2.5 Coder 32B fine-tuned on 5,016 SWE-smith trajectories) achieves 40.2% on SWE-bench Verified.
|
| 64 |
+
- Expert model for trajectory collection: claude-3-7-sonnet-20250219 + SWE-agent, ≤75 steps, $2 cost limit.
|
| 65 |
+
- PR Mirror and LM Rewrite trajectories produce the best-performing models; LM Modify has a steep drop-off.
|
| 66 |
+
- GRPO-style RL on SWE-smith is explicitly mentioned in the SWE-smith GitHub README: "Perform GRPO style reinforcement learning using SkyRL."
|
| 67 |
+
|
| 68 |
+
**Licensing (Table 6, §A.4):**
|
| 69 |
+
- 128 repos cover Apache 2.0 (majority), BSD 2/3-Clause, MIT, GNU GPLv3, LGPL v2.1 and v3, ISC.
|
| 70 |
+
- GPLv3 repos: Cog-Creators/Red-DiscordBot and adrienverge/yamllint. LGPL: chardet/chardet, paramiko/paramiko, pylint-dev/astroid, Knio/dominate.
|
| 71 |
+
- Paper states: "We inspected the repositories with custom licenses and confirmed they allowed for the use cases exercised in our work." (Note: does NOT explicitly address whether derivative diffs are redistributable — they exercise the bugs, not redistribute.)
|
| 72 |
+
- **Dataset HF license:** The SWE-smith HF dataset page (`SWE-bench/SWE-smith`) shows **59,136 rows** (note: larger than the 50,137 paper reports — likely includes additional variants). The page says "We will no longer actively update this dataset. Recommend the language-specific `SWE-bench/SWE-smith-[lang]` datasets."
|
| 73 |
+
- **Toolkit license: MIT** (confirmed from GitHub).
|
| 74 |
+
|
| 75 |
+
**`pip install` / API availability:**
|
| 76 |
+
- The GitHub README shows a Python API: `from swesmith.profiles import registry; rp = registry.get_from_inst(task); container = rp.get_container(task)` — returns a Docker container with the task initialized.
|
| 77 |
+
- Install: `pip install swesmith` from source (requires Docker, Ubuntu 22.04.4 LTS, does NOT support Windows or MacOS officially).
|
| 78 |
+
- The toolkit is a full pipeline: create environments → synthesize task instances → keep tasks that break ≥1 unit test → generate issue text.
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## 2. SWE-Gym (arXiv:2412.21139) — Deep Read
|
| 83 |
+
|
| 84 |
+
### 2.1 What it actually provides
|
| 85 |
+
|
| 86 |
+
SWE-Gym is **not a synthesis pipeline** — it is an existing dataset of 2,438 real-world task instances from 11 Python repos with pre-built executable environments. It is the "first training environment" for SWE agents (per abstract).
|
| 87 |
+
|
| 88 |
+
**Collection method (§3 of paper):** Same approach as SWE-bench (mine PRs), but:
|
| 89 |
+
- Applied to 11 **different** repos from SWE-bench's 12 repos (deliberately non-overlapping for train/test separation).
|
| 90 |
+
- Per-task Docker images (6 TB total — this is the bottleneck SWE-smith solved).
|
| 91 |
+
- 66,894 raw task instances in SWE-Gym-Raw (no executable envs); 2,438 have full envs.
|
| 92 |
+
|
| 93 |
+
**Table 2 statistics (SWE-Gym vs SWE-bench):**
|
| 94 |
+
- Average gold patch: 69.8 lines edited, 2.5 files, 4.1 functions (much larger than SWE-bench's 32.8/1.7/3.0).
|
| 95 |
+
- Average F2P tests: 10.0 (vs SWE-bench 9.0).
|
| 96 |
+
- Average total tests: 760.8 (vs SWE-bench 132.5) — much more test coverage per instance.
|
| 97 |
+
- Average codebase: 971 non-test files, 340k lines.
|
| 98 |
+
|
| 99 |
+
**Training results:**
|
| 100 |
+
- 491 trajectories from GPT-4o and Claude 3.5 Sonnet → Qwen 2.5 Coder 32B fine-tuned → +12.3%/+13.6% on SWE-bench Lite/Verified.
|
| 101 |
+
- With verifier (OSRM): +11.4% more → 32.0% on Verified, 26.0% on Lite.
|
| 102 |
+
- Scaling: performance still improving at 491 trajectories, no saturation yet.
|
| 103 |
+
|
| 104 |
+
**License:** SWE-Gym tooling is MIT. Per-instance instances inherit upstream repo licenses (paper does not provide a per-instance breakdown, unlike SWE-rebench).
|
| 105 |
+
|
| 106 |
+
**Availability:** `SWE-Gym/SWE-Gym` on HuggingFace. Docker images hosted on Docker Hub as `xingyaoww/sweb.eval.*`. Pre-built images reduce env-build cost to near-zero for adopters.
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## 3. R2E-Gym (arXiv:2504.07164) — Deep Read
|
| 111 |
+
|
| 112 |
+
### 3.1 The SweGen pipeline — the most novel test synthesis approach
|
| 113 |
+
|
| 114 |
+
R2E-Gym's key contribution for this cluster is **SweGen**: a method to generate executable training environments **without human-written issues or unit tests**, directly from commits.
|
| 115 |
+
|
| 116 |
+
**SweGen procedure (§2):**
|
| 117 |
+
1. **Repo selection:** Use SEART GitHub search to find Python repos with many commits.
|
| 118 |
+
2. **Commit curation:** Extract commit history; filter with rule-based + LLM heuristics for "interesting" code changes.
|
| 119 |
+
3. **Build scripts:** Semi-manually collect dependency pins and installation procedures per commit.
|
| 120 |
+
4. **Test collection:** Use existing tests from the commit to identify F2P cases (failing before the commit, passing after — the natural oracle).
|
| 121 |
+
5. **Test generation for commits without tests:** Synthesize F2P test cases using an LLM when no test exists. Appended test generation details in Appendix A (not fully extracted, but confirmed it exists).
|
| 122 |
+
6. **Backtranslation for problem statements:** Instead of human GitHub issues, use LLM to backtranslate the commit diff into an issue-style problem statement. Key insight: include the F2P test execution trace in the backtranslation prompt to generate specific, non-generic statements. 27.8% vs 28.0% (synthetic vs real issues) — statistically indistinguishable.
|
| 123 |
+
7. **Decontamination:** Remove repos overlapping with SWE-bench test set → R2E-Gym-Subset (4,578 tasks, 10 repos).
|
| 124 |
+
|
| 125 |
+
**Scale:** 8,135 tasks total, from more repos than SWE-Gym. SweGen enables 2.5x more problems than PR-based collection alone. But 4 TB environments for the subset alone — still a storage problem.
|
| 126 |
+
|
| 127 |
+
**Oracle quality:**
|
| 128 |
+
- When commits lack existing tests: LLM synthesizes them. The paper claims "Real vs Synthetic: Real 28.0%, Synthetic 27.8%" — essentially identical performance. This is a strong result for synthetic oracle quality.
|
| 129 |
+
- However, the paper does NOT quantify what % of commits had no existing tests (and thus needed synthesized tests) — this is a gap that ADR-010 does not flag.
|
| 130 |
+
|
| 131 |
+
**Training results (Table 3):**
|
| 132 |
+
- R2E-Gym-trained 32B model: 34.4% on SWE-bench Verified (vs SWE-Gym 20.6%, +13.8%).
|
| 133 |
+
- With hybrid verifier (execution-based + execution-free): 51% on SWE-bench Verified — SOTA for open-weight models at paper time.
|
| 134 |
+
- DeepSWE (R2E-Gym + GRPO, Qwen3-32B): 42.2% Pass@1, 59% with test-time scaling.
|
| 135 |
+
|
| 136 |
+
**License:** Apache 2.0 for tooling (R2E-Gym GitHub). Per-instance upstream licenses inherited.
|
| 137 |
+
|
| 138 |
+
**API:** Available on HuggingFace at `R2E-Gym/R2E-Gym-V1` and `R2E-Gym/R2E-Gym-Subset`. Docker images hosted. The code at `r2e-gym.github.io` is the generation pipeline — not a clean `pip install swesmith`-style API, but usable.
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## 4. SWE-bench (arXiv:2310.06770) — Schema Reference
|
| 143 |
+
|
| 144 |
+
The canonical task schema (ICLR 2024):
|
| 145 |
+
- Input: `(codebase, issue_description)` at `base_commit`.
|
| 146 |
+
- Output: `git diff` patch resolving the issue.
|
| 147 |
+
- Oracle: run unit tests from `test_patch` against the patched codebase.
|
| 148 |
+
- Fields: `FAIL_TO_PASS` (tests that must go red→green), `PASS_TO_PASS` (tests that must stay green).
|
| 149 |
+
- Construction: mine GitHub issues → find corresponding PRs → filter for PRs that modify test files and resolve the issue → build per-version Docker environments.
|
| 150 |
+
- **The hard part:** building per-version Docker environments is the most costly step (~hundreds of hours human labor at scale). SWE-smith's core contribution is eliminating this by using one image per repo at HEAD.
|
| 151 |
+
- **The oracle property:** FAIL_TO_PASS is a constructive oracle — the tests already existed and were known to exercise the feature (they were in the PR). This is what makes it "verifiable for free."
|
| 152 |
+
- Scale: 2,294 eval instances, 12 Python repos. Training split: 19,008 (no exec environments).
|
| 153 |
+
- License: CC-BY-4.0 for the dataset.
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## 5. Newer Work (2025-2026)
|
| 158 |
+
|
| 159 |
+
### 5.1 SWE-MiniSandbox (arXiv:2602.11210, Feb 2026)
|
| 160 |
+
|
| 161 |
+
Container-free RL training for SWE agents. Key result: kernel-level isolation (not Docker) reduces disk usage to ~5% of container-based pipelines and setup time to ~25%. Evaluation performance comparable to container baseline. Not a dataset synthesis method, but directly relevant to the sandbox cost problem ADR-010 acknowledges (Docker required, CPU-pool generation cost). **MISSED by research/06 and ADR-010** — this was published after the main research was done (Feb 2026).
|
| 162 |
+
|
| 163 |
+
### 5.2 SWE-RL (arXiv:2502.18449, NeurIPS 2025 Main Track)
|
| 164 |
+
|
| 165 |
+
RL on real GitHub software evolution data (issues, PRs, code history). No synthetic bug injection — uses rule-based reward (similarity score between ground truth and LLM-generated solution). Llama3-SWE-RL-70B: 41.0% on SWE-bench Verified. Key finding: RL on SWE data transfers to 5 out-of-domain tasks (math, code reasoning, general language). Uses 11M training pairs from open-source software evolution. Reward = similarity (not binary test pass) — this is weaker than the test-execution oracle in SWE-smith/R2E-Gym.
|
| 166 |
+
|
| 167 |
+
### 5.3 DeepSWE (Together AI + Agentica, Jul 2025 blog)
|
| 168 |
+
|
| 169 |
+
Qwen3-32B + pure RL (GRPO via rLLM framework) on R2E-Gym environments. 4,500 SWE tasks, 64 H100s for 6 days. Achieves 42.2% Pass@1, 59% with test-time scaling (16 attempts). **This is the closest living demonstration of the Composer-style "RL on SWE tasks" recipe using an open-source substrate.**
|
| 170 |
+
|
| 171 |
+
### 5.4 SkyRL (NovaSky-AI, 2025)
|
| 172 |
+
|
| 173 |
+
Explicitly mentioned in SWE-smith README: "Perform GRPO style reinforcement learning using SkyRL." SkyRL is a full-stack RL library (Berkeley Sky Computing Lab) with SWE-bench environments in `skyrl-gym`. The GitHub shows direct integration: SkyRL + SWE-smith = GRPO on SWE-smith tasks. This is exactly the missing integration link for this repo's `reward_fn` + RL loop design.
|
| 174 |
+
|
| 175 |
+
### 5.5 SWE-fixer (Xie et al. 2025a, referenced in SWE-smith Table 2)
|
| 176 |
+
|
| 177 |
+
115k instances, 856 repos — largest by task count. But **NO execution environments** — no Docker, no test execution. SWE-fixer trained SWE-fixer-72B achieving 32.8% on SWE-bench Verified. The comparison in SWE-smith Table 2 shows SWE-fixer has zero executable environments, meaning it relies on string-similarity rewards, not test-execution verification. **Not a direct competitor for this repo's verifiable-reward approach.**
|
| 178 |
+
|
| 179 |
+
---
|
| 180 |
+
|
| 181 |
+
## 6. Comparison Against the Repo's Current State
|
| 182 |
+
|
| 183 |
+
### 6.1 What `substrates.py` (SweBenchAdapter) actually does
|
| 184 |
+
|
| 185 |
+
`substrates.py` implements **schema inversion only**: it takes an existing SWE-bench-shaped instance dict and converts it to a `FeatureDeletionTask` by reversing the gold patch. It does NOT:
|
| 186 |
+
- Synthesize new tasks from an arbitrary repo.
|
| 187 |
+
- Call SWE-smith's API or any synthesis engine.
|
| 188 |
+
- Implement AST/procedural deletion (Path B from research/06).
|
| 189 |
+
- Build execution environments for new repos.
|
| 190 |
+
- Generate issue text.
|
| 191 |
+
|
| 192 |
+
The adapter only handles the "adopt existing substrates" half of the design space. This is correct for the ADR-010 v0.0-v0.1 scope, but it's important to be explicit about what is NOT yet built.
|
| 193 |
+
|
| 194 |
+
### 6.2 What research/06 and ADR-010 claim vs. what the sources actually say
|
| 195 |
+
|
| 196 |
+
**Correct claims (verified):**
|
| 197 |
+
|
| 198 |
+
1. `[research/06 §4, ADR-010]` "SWE-bench FAIL_TO_PASS / PASS_TO_PASS schema is the universal schema across SWE-bench, SWE-Gym, R2E-Gym, SWE-rebench." **VERIFIED.** All four use this schema. SWE-smith also uses it (per §A.1: "FAIL_TO_PASS: The unit tests that break when the test suite is run with the bug patch applied. PASS_TO_PASS: The unit tests that do not break.").
|
| 199 |
+
|
| 200 |
+
2. `[research/06 §4]` "R2E-Gym: 8.1K executable envs." **VERIFIED.** Paper abstract states "more than 8.1K tasks."
|
| 201 |
+
|
| 202 |
+
3. `[research/06 §4]` "SWE-rebench: 21,336 tasks, 3,468 repos, CC-BY-4.0." **UNVERIFIED here** (SWE-rebench paper arXiv:2505.20411 not directly fetched — only the Nebius infrastructure blog post was available). The infrastructure blog describes the pipeline but does not give the exact 21,336 count. Cannot confirm from fetched sources.
|
| 203 |
+
|
| 204 |
+
4. `[research/06 §2, ADR-010]` "Online pass-rate curriculum: 'select for and create harder tasks dynamically.'" **VERIFIED verbatim** in the Cursor blog quote in research/06 §1.
|
| 205 |
+
|
| 206 |
+
5. `[research/06 §3]` "Python bytecode/type-check cache recovery is a real reward-hacking vector." **VERIFIED verbatim** in the Cursor blog (quoted in §1 of research/06).
|
| 207 |
+
|
| 208 |
+
6. `[ADR-010]` "SWE-smith env-construction: ~20h human labor for 128 repos." **VERIFIED.** Paper §2.1: "Creating SWE-smith took one author ~20h of human labor."
|
| 209 |
+
|
| 210 |
+
7. `[ADR-010]` "SWE-smith costs $1,360 to create." **VERIFIED.** Paper §2.2: "$1360 to create ($1000 to generate bugs, $160 for auto repo installation, $200 to generate issues for 10K bugs)."
|
| 211 |
+
|
| 212 |
+
**Errors and overclaims in the repo's research notes:**
|
| 213 |
+
|
| 214 |
+
1. **`[research/06 §4, ADR-010]` "SWE-Gym: 2.4k tasks, 11 repos, 6 TB."** PARTIALLY CORRECT. SWE-Gym has 2,438 tasks, 11 repos. The 6 TB figure for environments is cited from the SWE-smith Table 2 comparison. However, research/06 states "SWE-Gym (arXiv:2412.21139, ICML 2025)" — the venue claim is correct (ICML 2025 per the paper's header).
|
| 215 |
+
|
| 216 |
+
2. **`[research/06 §4]` "R2E-Gym-Subset = non-overlapping w/ SWE-bench."** VERIFIED but requires precision: R2E-Gym-Subset (4,578 tasks, 10 repos) is non-overlapping with SWE-bench TEST SET repos. The full R2E-Gym (8,135 tasks) may overlap with SWE-bench training repos. The paper decontaminates against the test set only.
|
| 217 |
+
|
| 218 |
+
3. **`[research/06 §4, ADR-010]` "OpenHands/Nemotron-SWE-v1: 59K agent trajectories."** The HF dataset page for `SWE-bench/SWE-smith` shows 59,136 rows — this is the SWE-smith dataset (with updates), NOT the Nemotron dataset. Research/06 correctly cites `nvidia/Nemotron-SWE-v1` separately. But the 59K figure is ambiguous: SWE-smith also has ~59k rows on HF. The Nemotron dataset is a different thing. This is a minor labeling confusion but not wrong in substance.
|
| 219 |
+
|
| 220 |
+
4. **`[research/06 §4]` "SWE-Gym purpose-built for training (train split, not a held-out benchmark → no contamination worry)"** — OVERCLAIM. SWE-Gym paper itself states its 11 repos are separate from SWE-bench's 12, but the decontamination is at the repo level, not task level. Using SWE-Gym for Feature Deletion and then evaluating on SWE-bench should be safe, but the claim "no contamination worry" is stronger than the paper asserts.
|
| 221 |
+
|
| 222 |
+
5. **`[research/06 §5, Path B]` "Coverage-mapped AST deletion using libcst — select deletion targets by coverage selectivity."** This is original to the repo and NOT from any of the fetched papers. SWE-smith's procedural approach uses random AST transforms without coverage-guided targeting. R2E-Gym uses real commits (not synthetic deletions). The coverage-selectivity framing is a valid engineering idea but is `[EXTRAPOLATED]` — research/06 correctly tags it `[EXTRAPOLATED]` but the ADR text promotes it to a stated capability without that tag. **ADR-010 should make clearer that Path B (coverage-mapped AST deletion) is not from any prior work — it would need to be built from scratch.**
|
| 223 |
+
|
| 224 |
+
6. **`[ADR-010 §2, Decision Drivers]` "Reuse existing verified OSS substrates... they already guarantee test-exercises-the-code via FAIL_TO_PASS."** PARTIALLY OVERCLAIM. SWE-smith (§2.1) explicitly notes that yield rates are limited because some bug candidates "did not actually introduce relevant issues" or "lack test coverage for the change." The FAIL_TO_PASS guarantee is not automatic — it is the *output* of the validation step (test execution), not something inherited from the schema. The ADR implies the guarantee is pre-built; in reality it requires running the test suite to confirm it.
|
| 225 |
+
|
| 226 |
+
7. **`[ADR-010 post-review]` "OPEN: Gate 2 ('deletion breaks the feature') does not verify reachability."** This self-identified open item is correct and important. The sources confirm: SWE-smith's validation only checks that some test breaks (not which code causes it). R2E-Gym also checks test pass/fail without coverage verification. This is a systemic gap in the field, not just in the repo.
|
| 227 |
+
|
| 228 |
+
**Missing from the repo's research notes:**
|
| 229 |
+
|
| 230 |
+
1. **SWE-smith's PR Mirror strategy is the most important one for this repo.** Per Table 5 of the paper: PR Mirror trajectories produce the best-performing models (trajectories from PR mirrors > LM Rewrite ≈ Procedural > LM Modify). The repo's feature deletion via gold-patch reversion (SweBenchAdapter.to_task) is exactly analogous to SWE-smith's PR Mirror strategy. The paper's ablation directly validates the repo's core approach — but neither research/06 nor ADR-010 cite this result.
|
| 231 |
+
|
| 232 |
+
2. **SWE-smith's "one Docker image per repo" architecture.** Research/06 and ADR-010 discuss reusing per-instance Docker images from SWE-Gym/SWE-rebench. SWE-smith shows that the correct architecture for new repos is one image per repo (not per task), which is 500x more storage-efficient. If the repo wants to extend beyond the existing substrates, this is the right architecture. Not mentioned.
|
| 233 |
+
|
| 234 |
+
3. **SWE-smith `pip install swesmith` — it is a real, usable API.** The GitHub README shows that `rp.get_container(task)` returns a Docker container with the task initialized. The repo could `pip install swesmith` to get task synthesis capabilities today, rather than building a separate synthesis pipeline. ADR-010 discusses options A/B/C but does not mention that Option A ("invert OSS substrates") could be powered by the `swesmith` toolkit directly.
|
| 235 |
+
|
| 236 |
+
4. **R2E-Gym's LLM-synthesized tests.** When commits lack existing tests, R2E-Gym synthesizes them with an LLM. Research/06 describes this capability as needing separate building ("the hard part: Composer deletes features where tests exist; SWE-smith generates bugs and validates against existing tests; R2E-Gym synthesizes tests"), but research/06 does not propose actually synthesizing tests — it only uses existing tests from substrates. The gap: if we want to point at an arbitrary repo (the user's stated goal), most arbitrary repos will have commits without comprehensive test coverage. SweGen's test synthesis capability (confirmed equivalent to real tests at 27.8% vs 28.0%) is directly relevant and not mentioned in ADR-010.
|
| 237 |
+
|
| 238 |
+
5. **SkyRL + SWE-smith = working GRPO pipeline.** The SWE-smith README explicitly says "Perform GRPO style reinforcement learning using SkyRL." This is a working open-source stack (SkyRL is MIT licensed, 2k GitHub stars). ADR-010's `reward_fn` design reinvents what SkyRL + SWE-smith already provide. Not evaluated in the ADR.
|
| 239 |
+
|
| 240 |
+
6. **SWE-MiniSandbox (arXiv:2602.11210, 2026).** Container-free RL at 5% disk / 25% setup time of Docker. Directly addresses ADR-010's acknowledged "sandbox/Docker cost" concern. Not considered — was published after ADR-010.
|
| 241 |
+
|
| 242 |
+
7. **DeepSWE: living demonstration of "RL + SWE-smith/R2E-Gym."** Qwen3-32B + GRPO on 4,500 R2E-Gym tasks = 42.2% Pass@1 / 59% with TTS. This is published evidence that the core architecture in ADR-010 (RL + feature-deletion env + GRPO reward_fn) works at scale in the open. Not cited.
|
| 243 |
+
|
| 244 |
+
---
|
| 245 |
+
|
| 246 |
+
## 7. ADOPT vs BUILD — Concrete Recommendation
|
| 247 |
+
|
| 248 |
+
### 7.1 What we can `pip install` TODAY
|
| 249 |
+
|
| 250 |
+
**`pip install swesmith` (MIT, from source, requires Docker):**
|
| 251 |
+
- Provides: environment construction from arbitrary GitHub repos, all 4 bug synthesis strategies (LM Modify, LM Rewrite, Procedural with 13 transform types, PR Mirror/Combine), issue text generation, task validation.
|
| 252 |
+
- What the repo currently builds manually in `composer_replication/datagen/` is a subset of what `swesmith` provides.
|
| 253 |
+
- **Recommendation:** Use `swesmith` as the synthesis engine for new repos rather than rebuilding. The repo's `SweBenchAdapter` can remain as the inversion layer for pre-existing substrates.
|
| 254 |
+
|
| 255 |
+
**`datasets.load_dataset("SWE-bench/SWE-smith")` (dataset available, MIT toolkit, mixed upstream licenses):**
|
| 256 |
+
- 59k tasks, 128 repos, 295 GB environments (vs. 50,137 in paper — dataset has grown since publication).
|
| 257 |
+
- All tasks have executable Docker environments via `swesmith.profiles.registry`.
|
| 258 |
+
- **This is the largest immediately usable Feature-Deletion dataset**: every SWE-smith task is a `(broken_repo, FAIL_TO_PASS, PASS_TO_PASS)` tuple. The "patch" field IS the gold diff (already reversed in the task construction). `SweBenchAdapter.to_task()` works on SWE-smith instances unchanged.
|
| 259 |
+
- **License risk:** GPLv3 repos (2 repos: Red-DiscordBot, yamllint) are present. The existing `is_redistributable()` copyleft filter in `substrates.py` would catch these. Apache/BSD/MIT majority is clean.
|
| 260 |
+
|
| 261 |
+
**`datasets.load_dataset("R2E-Gym/R2E-Gym-V1")` (Apache-2.0 toolkit, mixed upstream licenses):**
|
| 262 |
+
- 8,135 tasks, SweGen-synthesized. Includes synthesized tests for commits without existing tests.
|
| 263 |
+
- **Key advantage over SWE-smith for this repo's "point at any repo" vision:** SweGen works from commits, not PRs, and synthesizes tests when none exist. This is the mechanism that unlocks arbitrary repos.
|
| 264 |
+
- Pre-built Docker environments. Direct integration with OpenHands scaffold.
|
| 265 |
+
|
| 266 |
+
**What needs to be built (not available as pip install):**
|
| 267 |
+
|
| 268 |
+
1. **Coverage-guided deletion target selection (research/06 Path B).** No existing library does coverage-selectivity-based AST deletion. `libcst` for AST manipulation is available but the targeting logic (which function to delete based on test coverage and selectivity) is novel. This is the "hard part" for arbitrary-repo synthesis without prior substrates.
|
| 269 |
+
|
| 270 |
+
2. **The online difficulty curriculum.** `DifficultyCurriculum` in `composer_replication/datagen/curriculum.py` is correctly identified as needing to be built. SWE-smith/R2E-Gym/SWE-Gym do NOT provide this.
|
| 271 |
+
|
| 272 |
+
3. **Anti-reward-hacking sandbox.** The `LocalSubprocessSandbox` + `SANDBOX_DENYLIST` + `HackMonitor` in the repo are custom implementations. No OSS library provides this. SWE-MiniSandbox (arXiv:2602.11210) provides container-free isolation but not the semantic hack detection.
|
| 273 |
+
|
| 274 |
+
4. **TRL `reward_fn` adapter wired to test execution.** Not provided by any of the surveyed toolkits (SkyRL has its own non-TRL RL stack; SWE-smith's GRPO support is via SkyRL, not TRL GRPOTrainer).
|
| 275 |
+
|
| 276 |
+
### 7.2 Integration architecture recommendation
|
| 277 |
+
|
| 278 |
+
```
|
| 279 |
+
[Adopt] swesmith toolkit → environment construction for new repos
|
| 280 |
+
[Adopt] SWE-smith dataset → 59k pre-built Feature-Deletion tasks (via SweBenchAdapter)
|
| 281 |
+
[Adopt] R2E-Gym-Subset → 4.6k tasks with SweGen synthetic tests
|
| 282 |
+
[Adopt] SWE-Gym → 2.4k tasks, clean train split, per-task Docker images
|
| 283 |
+
[Adopt] SWE-rebench → scale (21k) + difficulty priors (cold-start p̂)
|
| 284 |
+
[Build] DifficultyCurriculum (unique to this recipe)
|
| 285 |
+
[Build] LocalSubprocessSandbox + HackMonitor (unique to this recipe)
|
| 286 |
+
[Build] TRL reward_fn adapter (SkyRL uses veRL, not TRL)
|
| 287 |
+
[Build] Coverage-guided Path B synthesis (unique, unlocks arbitrary repos)
|
| 288 |
+
[Consider] SWE-MiniSandbox for container-free RL at scale (2026, 5% disk overhead)
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
The current `substrates.py` correctly handles the [Adopt] paths. What is missing is the `swesmith` integration for new-repo synthesis (the "25x more synthetic tasks" vision requires going beyond existing substrates).
|
| 292 |
+
|
| 293 |
+
---
|
| 294 |
+
|
| 295 |
+
## 8. Critical Flags Summary
|
| 296 |
+
|
| 297 |
+
| Flag | Severity | Location | Issue |
|
| 298 |
+
|---|---|---|---|
|
| 299 |
+
| **Missing: swesmith API exists** | HIGH | ADR-010 Option A | `pip install swesmith` provides what Option A describes building. The ADR does not evaluate it as a dependency. |
|
| 300 |
+
| **Missing: SkyRL+SWE-smith = working GRPO stack** | HIGH | ADR-010 §Decision | SkyRL (MIT) + SWE-smith already implements GRPO on SWE tasks. The repo's TRL `reward_fn` reinvents this without acknowledging the prior art. |
|
| 301 |
+
| **Missing: SweGen synthesizes tests for commitless repos** | HIGH | research/06 §4, ADR-010 | R2E-Gym's test synthesis approach (commits without tests → LLM synthesizes F2P tests, equivalent to real tests) is the key mechanism for "point at any repo" synthesis. Not discussed. |
|
| 302 |
+
| **Overclaim: FAIL_TO_PASS 'guaranteed'** | MEDIUM | ADR-010 §Decision Drivers | The guarantee requires running tests; it is not inherited from the schema. SWE-smith validation explicitly filters out candidates with no test coverage. |
|
| 303 |
+
| **Overclaim: Path B (coverage-AST deletion) is built** | MEDIUM | ADR-010 capability list | Path B is `[EXTRAPOLATED]` in research/06 but the ADR implies it's implemented. It is not; `substrates.py` only does Path A (gold-patch reversion). |
|
| 304 |
+
| **Missing: SWE-MiniSandbox (2026)** | MEDIUM | ADR-010 §Negative consequences | Container-free RL at 5% disk / 25% setup overhead addresses the ADR's Docker cost concern. Published Feb 2026, after ADR-010. |
|
| 305 |
+
| **Missing: DeepSWE as validation** | LOW | research/06 | Provides evidence that GRPO + R2E-Gym = 42% Pass@1 (the core recipe) works. Not cited. |
|
| 306 |
+
| **Correct: PR Mirror = Feature Deletion** | CONFIRM | research/06 §5 Path A | SWE-smith ablations directly validate the repo's core approach. The best SWE-smith training data (PR Mirror) is exactly gold-patch-reversion Feature Deletion. |
|
| 307 |
+
| **Correct: SWE-smith costs ~$1360 for 50k tasks** | CONFIRM | ADR-010 §7 cost model | Cost model in research/06 is consistent with SWE-smith's reported $1360 for 50k tasks ($0.027/task vs research/06's estimate of $0.02-$0.10/task). |
|
| 308 |
+
| **Correct: ADR-010 OPEN items are honest** | CONFIRM | ADR-010 post-review | The "gate 2 does not prove reachability" open item correctly identifies a gap that exists across ALL surveyed work (SWE-smith, R2E-Gym, SWE-Gym all have this gap). |
|
| 309 |
+
|
| 310 |
+
---
|
| 311 |
+
|
| 312 |
+
## 9. Key Numbers for Architecture Reference
|
| 313 |
+
|
| 314 |
+
| Paper | Tasks | Repos | Env Size | Oracle source | Test-exec? | License |
|
| 315 |
+
|---|---|---|---|---|---|---|
|
| 316 |
+
| SWE-bench | 2,294 eval / 19,008 train | 12 | ~per-instance large | Human PRs | eval only | CC-BY-4.0 |
|
| 317 |
+
| SWE-Gym | 2,438 | 11 | 6 TB | Human PRs | YES | MIT (tooling) |
|
| 318 |
+
| R2E-Gym | 8,135 (4,578 subset) | 13 | 4 TB | Commits + LLM tests | YES | Apache-2.0 |
|
| 319 |
+
| SWE-smith | 50,137 (59,136 HF) | 128 | 295 GB | LM/AST bug injection | YES | MIT (toolkit) |
|
| 320 |
+
| SWE-rebench | ~21k | 3,468 | per-instance | Human PRs (automated) | YES | CC-BY-4.0 |
|
| 321 |
+
| SWE-fixer | 115k | 856 | none | Human PRs | NO | - |
|
| 322 |
+
|
| 323 |
+
SWE-smith's 295 GB for 50k tasks vs 6 TB for 2.4k (SWE-Gym) = the per-repo Docker image architecture is a 500x storage efficiency win. This is the right architecture for any new synthesis beyond existing substrates.
|
| 324 |
+
|
| 325 |
+
---
|
| 326 |
+
|
| 327 |
+
## 10. Sources
|
| 328 |
+
|
| 329 |
+
1. **SWE-smith HTML full text** — `research/notes/bugs-scaling-data-for-software-engineering-agents.md` (26,442 words, arxiv.org/html/2504.21798)
|
| 330 |
+
2. **SWE-Gym HTML full text** — `research/notes/training-software-engineering-agents-and-verifiers-with-swe-gym.md` (10,865 words, arxiv.org/html/2412.21139)
|
| 331 |
+
3. **R2E-Gym HTML full text** — `research/notes/r2e-gym-scaling-open-weights-software-engineering-agents-with-procedural-synthet.md` (14,810 words, arxiv.org/html/2504.07164)
|
| 332 |
+
4. **SWE-bench abstract** — `research/notes/231006770-swe-bench-can-language-models-resolve-real-world-github-issues.md`
|
| 333 |
+
5. **SWE-smith GitHub README** — `research/notes/github-swe-benchswe-smith-neurips-2025-db-spotlight-scaling-data-for-swe-agents.md`
|
| 334 |
+
6. **SWE-smith HF dataset page** — `research/notes/swe-benchswe-smith-datasets-at-hugging-face.md`
|
| 335 |
+
7. **SkyRL GitHub** — `research/notes/github-novasky-aiskyrl-skyrl-a-modular-full-stack-rl-library-for-llms-github.md`
|
| 336 |
+
8. **SWE-RL abstract** — `research/notes/250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft.md`
|
| 337 |
+
9. **SWE-MiniSandbox abstract** — `research/notes/260211210-swe-minisandbox-container-free-reinforcement-learning-for-building-sof.md`
|
| 338 |
+
10. **DeepSWE blog** — `research/notes/deepswe-training-a-fully-open-sourced-state-of-the-art-coding-agent-by-scaling-r.md`
|
| 339 |
+
11. **SWE-rebench infrastructure blog** — `research/notes/behind-swe-rebench-infrastructure-to-collect-massive-datasets-of-swe-tasks-and-e.md`
|
| 340 |
+
12. **Repo artifacts:** `composer_replication/datagen/substrates.py`, `research/06-feature-deletion-datagen.md`, `docs/adrs/ADR-010-feature-deletion-datagen.md`
|
|
@@ -0,0 +1,556 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read: SDPO / OPSD — Critical Audit of Cluster 2
|
| 2 |
+
|
| 3 |
+
**Date**: 2026-06-09
|
| 4 |
+
**Sources fetched from primary HTML**:
|
| 5 |
+
- SDPO: arXiv:2601.20802v2 (Hübotter et al., ETH Zürich / MIT / Stanford)
|
| 6 |
+
— note `reinforcement-learning-via-self-distillation-2` (148 K body)
|
| 7 |
+
- OPSD: arXiv:2601.18734v3 (Zhao et al., ICML 2026)
|
| 8 |
+
— note `self-distilled-reasoner-on-policy-self-distillation-for-large-language-models-2` (52 K body)
|
| 9 |
+
- OPSD code repo: github.com/siyan-zhao/OPSD (README + key args)
|
| 10 |
+
- SDPO code repo: github.com/lasgroup/SDPO (listed in abstract; fetch returned empty body)
|
| 11 |
+
|
| 12 |
+
**Repo files audited**: `composer_replication/opsd.py`, `composer_replication/trainer/composer_trainer.py::_compute_sdpo_loss`, `composer_replication/hint_generator.py`, `research/07-sdpo-hint-generator.md`, `research/11-sdpo-alignment-indices.md`, `docs/adrs/ADR-007`, `ADR-008`, `ADR-009`, `ADR-011`.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## 1. What the primary sources actually say
|
| 17 |
+
|
| 18 |
+
### 1.1 SDPO (arXiv:2601.20802v2) — core method
|
| 19 |
+
|
| 20 |
+
**Exact loss (Eq. 1)**:
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
L_SDPO(θ) := Σ_t KL( π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})) )
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
KL direction is **forward KL — KL(student ‖ teacher)**, i.e. `KL(π_θ || q_θ)`. The
|
| 27 |
+
student is in the first argument. This is the "reverse KL from the teacher's
|
| 28 |
+
perspective" but forward from the student's perspective (student wants to match teacher).
|
| 29 |
+
The paper writes it `KL(π_θ ‖ stopgrad(q_θ))`.
|
| 30 |
+
|
| 31 |
+
**Stability improvements (§2.3)**:
|
| 32 |
+
1. Regularized teacher: EMA of student params OR interpolation between current teacher
|
| 33 |
+
and initial teacher `q_{θ_ref}`. The paper calls these "trust-region" and "EMA"
|
| 34 |
+
teachers (Table 4). Non-regularized teacher (`q_θ`, the live student) diverges.
|
| 35 |
+
2. Symmetric JSD: the paper adopts JSD as the distillation loss for stability —
|
| 36 |
+
citing Agarwal et al. 2024 on-policy distillation. The pseudocode (Fig 14) calls
|
| 37 |
+
this `divergence(logprobs_student, logprobs_teacher)` with no fixed default — the
|
| 38 |
+
paper reports using JSD.
|
| 39 |
+
|
| 40 |
+
**Top-K approximation (Appendix A.3)**:
|
| 41 |
+
The paper approximates the full-vocab KL with top-K tokens of the **student** distribution:
|
| 42 |
+
```
|
| 43 |
+
L_SDPO ≈ Σ_t [ Σ_{ŷ_t ∈ topK(π_θ)} π_θ(ŷ_t|x,y_{<t}) · log(π_θ(ŷ_t) / q_θ(ŷ_t))
|
| 44 |
+
+ tail_term ]
|
| 45 |
+
```
|
| 46 |
+
The tail term aggregates the remaining probability mass. Default K=100.
|
| 47 |
+
|
| 48 |
+
**Teacher context construction (Table 2, verbatim)**:
|
| 49 |
+
```
|
| 50 |
+
User: {prompt}
|
| 51 |
+
Correct solution: {successful_previous_rollout} (skipped if unavailable)
|
| 52 |
+
The following is feedback from your unsuccessful earlier attempt:
|
| 53 |
+
{environment_output} (skipped if no env output or if solved)
|
| 54 |
+
Correctly solve the original question.
|
| 55 |
+
Assistant: {original_response} (the student's original attempt, for log-prob re-eval)
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
Critical nuance: the `original_response` is placed in the teacher context so the
|
| 59 |
+
model can re-evaluate log-probs of `y` under the teacher. **The student's original
|
| 60 |
+
attempt is always appended as the response the teacher evaluates** — this is how both
|
| 61 |
+
student and teacher evaluate log-probs of the same token sequence `y`.
|
| 62 |
+
|
| 63 |
+
Token alignment: **there is no shift / hint-insertion at an "error turn."** The teacher
|
| 64 |
+
sees `[prompt, feedback, original_response]` and both student and teacher evaluate
|
| 65 |
+
log-probs of the SAME `original_response` tokens. No additional tokens are inserted
|
| 66 |
+
into the response; the prefix is longer for the teacher (it has feedback), but the
|
| 67 |
+
response token sequence evaluated is identical.
|
| 68 |
+
|
| 69 |
+
**Three feedback types ablated (§4.6, Table 6)**:
|
| 70 |
+
1. **Sample solution** (`f = own solution`): a successful sibling rollout from the GRPO
|
| 71 |
+
group; always student-generated (no expert model). Teacher accuracy: 42.4%.
|
| 72 |
+
2. **Environment output** (`f = output`): runtime errors, failing unit tests, etc.
|
| 73 |
+
Teacher accuracy: 32.5%.
|
| 74 |
+
3. **Student's original attempt** (`f = y`): the repo calls out that including the
|
| 75 |
+
original attempt in the feedback (not just in the response slot) **reduces teacher
|
| 76 |
+
diversity** (biases teacher toward student; "Same output": 30% vs ~10–13%).
|
| 77 |
+
4. **Combined** (`f = output + own solution`): best trained student accuracy (48.3%).
|
| 78 |
+
Excluding `f = y` (the original attempt as part of conditioning) is key.
|
| 79 |
+
|
| 80 |
+
**Failure modes reported**:
|
| 81 |
+
- Non-regularized teacher (`q_θ`) diverges / training collapses (Table 4: 36.1% vs
|
| 82 |
+
50.6% for trust-region teacher).
|
| 83 |
+
- Performance depends on model in-context learning ability: SDPO underperforms GRPO on
|
| 84 |
+
weaker models (Qwen3-0.6B); hybrid SDPO+GRPO (λ=0.9 GRPO + 0.1 SDPO) is more
|
| 85 |
+
robust (§4.5).
|
| 86 |
+
- Uninformative or misleading environment feedback: SDPO cannot learn from it.
|
| 87 |
+
- SDPO adds small compute overhead (additional forward for log-prob re-computation of
|
| 88 |
+
teacher context); minor for large models, non-negligible for small models.
|
| 89 |
+
- Including the student's own attempt in the teacher conditioning (not just as the
|
| 90 |
+
response to re-evaluate) reduces diversity; the correct template excludes it from
|
| 91 |
+
the conditioning prefix.
|
| 92 |
+
|
| 93 |
+
**SDPO operates over the full rollout, not at isolated "error turns"**. The loss
|
| 94 |
+
sums over all tokens `t` in the response `y`. There is no error-site detection
|
| 95 |
+
step in the SDPO paper.
|
| 96 |
+
|
| 97 |
+
### 1.2 OPSD (arXiv:2601.18734v3) — core method
|
| 98 |
+
|
| 99 |
+
**Exact loss (Eq. 6–8)**:
|
| 100 |
+
```
|
| 101 |
+
L_OPSD(θ) = E_{(x,y*)~S} [ E_{ŷ~p_S(·|x)} [ D(p_T ‖ p_S)(ŷ|x) ] ]
|
| 102 |
+
where
|
| 103 |
+
D(p_T ‖ p_S)(ŷ|x) := (1/|ŷ|) Σ_{n=1}^{|ŷ|} D( p_T(·|x, y*, ŷ_{<n}) ‖ p_S(·|x, ŷ_{<n}) )
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
**Divergence D** can be: forward KL, reverse KL, or JSD_β. The paper defines:
|
| 107 |
+
```
|
| 108 |
+
JSD_β(p_T ‖ p_S) = β·KL(p_T ‖ m) + (1-β)·KL(p_S ‖ m)
|
| 109 |
+
m = β·p_T + (1-β)·p_S
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
**Direction convention**: `D(p_T ‖ p_S)` — teacher in first arg, student in second. For the JSD:
|
| 113 |
+
- β=0 → KL(p_S ‖ m) (approaches pure KL(p_S ‖ p_T) as m→p_T; forward KL w.r.t. teacher)
|
| 114 |
+
- β=1 → KL(p_T ‖ m) (approaches pure KL(p_T ‖ p_S); reverse KL w.r.t. teacher / forward KL w.r.t. student)
|
| 115 |
+
|
| 116 |
+
The GKD paper (arXiv:2306.13649) that OPSD cites defines JSD_β with the **same
|
| 117 |
+
convention**: `JSD_β(p ‖ q) = β·KL(p||M) + (1-β)·KL(q||M)`.
|
| 118 |
+
|
| 119 |
+
**Per-token pointwise clipping**: OPSD introduces this explicitly:
|
| 120 |
+
```
|
| 121 |
+
D_clip^(f)(p_T ‖ p_S) = (1/|ŷ|) Σ_n Σ_v min(l_{n,v}^(f), τ)
|
| 122 |
+
where l_{n,v}^(f) = p_T(v|·) · f( p_S(v|·) / p_T(v|·) )
|
| 123 |
+
```
|
| 124 |
+
This clips per vocab-entry contribution. Default τ=0.05 (from README: `--jsd_token_clip 0.05`).
|
| 125 |
+
Non-thinking mode results in README use 1e-7 (Qwen3-8B) and 1e-6 (Qwen3-4B, 1.7B).
|
| 126 |
+
|
| 127 |
+
**Teacher context**: `p_T(·|x, y*, ŷ_{<n})` — teacher sees the ground-truth answer `y*`
|
| 128 |
+
(a reference CoT / verified reasoning trace from the dataset) prepended to the problem,
|
| 129 |
+
then evaluates the student's prefix `ŷ_{<n}`. Same token sequence for both distributions
|
| 130 |
+
evaluated at each step `n`.
|
| 131 |
+
|
| 132 |
+
**`--reason_first` flag (from GitHub README)**: Prepend an explicit rationalization to the
|
| 133 |
+
teacher context before distillation. This is OPSD's self-introspection lever: the teacher
|
| 134 |
+
is first asked to rationalize why `y*` is correct, then that rationalization is folded into
|
| 135 |
+
the conditioning. Not the main results configuration; requires `--use_peft`.
|
| 136 |
+
|
| 137 |
+
**Results**: On Qwen3-1.7B (AIME24/25/HMMT25), OPSD +OPSD vs. base: 37.1% → 43.4%
|
| 138 |
+
(Avg@12). Outperforms GRPO (37.7%) and SFT (35.8%). Token-efficient: generation capped
|
| 139 |
+
at 1024 tokens vs. GRPO's 16K.
|
| 140 |
+
|
| 141 |
+
---
|
| 142 |
+
|
| 143 |
+
## 2. Audit: `composer_replication/opsd.py` — "byte-for-byte OPSD parity" claim
|
| 144 |
+
|
| 145 |
+
### 2.1 JSD formula — CORRECT, with a subtle direction note
|
| 146 |
+
|
| 147 |
+
The code implements:
|
| 148 |
+
```python
|
| 149 |
+
JSD = β·KL(teacher||M) + (1-β)·KL(student||M)
|
| 150 |
+
M = logsumexp([ log p_student + log(1-β), log p_teacher + log(β) ])
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
The OPSD paper (Eq. 7) defines:
|
| 154 |
+
```
|
| 155 |
+
JSD_β(p_T ‖ p_S) = β·KL(p_T‖m) + (1-β)·KL(p_S‖m)
|
| 156 |
+
```
|
| 157 |
+
where `m = β·p_T + (1-β)·p_S`.
|
| 158 |
+
|
| 159 |
+
The code `kl_teacher = F.kl_div(mixture_log_probs, teacher_log_probs, ...)` uses
|
| 160 |
+
PyTorch semantics where `F.kl_div(input=log_q, target=log_p, log_target=True)`
|
| 161 |
+
computes `KL(p||q) = Σ p(x)·(log p(x) - log q(x))`. So `kl_teacher` computes
|
| 162 |
+
`KL(teacher||mixture)` and `kl_student` computes `KL(student||mixture)`.
|
| 163 |
+
|
| 164 |
+
The final JSD: `β·kl_teacher + (1-β)·kl_student` = `β·KL(teacher||M) + (1-β)·KL(student||M)`.
|
| 165 |
+
|
| 166 |
+
This matches the OPSD paper's `JSD_β(p_T ‖ p_S)` exactly. **CORRECT.**
|
| 167 |
+
|
| 168 |
+
### 2.2 β convention docstring — INVERTED vs. both papers
|
| 169 |
+
|
| 170 |
+
The `opsd.py` docstring says:
|
| 171 |
+
```
|
| 172 |
+
β = 0 → KL(teacher || student) (reverse KL — mode-covering for student)
|
| 173 |
+
β = 1 → KL(student || teacher) (forward KL — mode-seeking for student)
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
From the OPSD GitHub README:
|
| 177 |
+
> `--beta`: Interpolation weight for the JSD mixture distribution.
|
| 178 |
+
> **Beta=0 means forward KL and 1 means reverse KL.**
|
| 179 |
+
|
| 180 |
+
The repo docstring has the β=0 and β=1 labels **swapped** relative to the OPSD upstream.
|
| 181 |
+
When β=0: `JSD_0 = 0·KL(teacher||M) + 1·KL(student||M)`. In the limit (degenerate β=0),
|
| 182 |
+
M approaches p_teacher, so this approaches `KL(student||teacher)` — which is **forward KL**
|
| 183 |
+
(student → teacher), **mode-seeking for student**. The README says "Beta=0 means forward KL"
|
| 184 |
+
which matches this analysis.
|
| 185 |
+
|
| 186 |
+
The code *implementation* is correct (the formula computes the right mixture). The *docstring*
|
| 187 |
+
labels β=0 as "reverse KL" and β=1 as "forward KL", which contradicts both the upstream README
|
| 188 |
+
and the mathematical analysis. This is a documentation error, not a numerical error.
|
| 189 |
+
|
| 190 |
+
**VERDICT**: Implementation is numerically correct. Docstring direction labels are inverted.
|
| 191 |
+
|
| 192 |
+
### 2.3 `reduction="batchmean"` behavior — MINOR DIVERGENCE from upstream
|
| 193 |
+
|
| 194 |
+
The repo `opsd.py` comment says:
|
| 195 |
+
> "batchmean" matches upstream OPSD: divides by `mask.sum()` when labels are given,
|
| 196 |
+
> else by the leading dim of jsd (= batch size).
|
| 197 |
+
|
| 198 |
+
The OPSD paper (Algorithm 1) normalizes by `|ŷ|` (sequence length, token-mean):
|
| 199 |
+
```
|
| 200 |
+
ℓ(x,y*) = D(p_T‖p_S)(ŷ|x) = (1/|ŷ|) Σ_n D(...)
|
| 201 |
+
L_OPSD(θ) = (1/|B|) Σ_{(x,y*)∈B} ℓ(x,y*)
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
The repo divides by `mask.sum()` (number of valid/masked tokens in the batch), which is
|
| 205 |
+
equivalent to OPSD's normalization only when every example has the same number of
|
| 206 |
+
error-turn tokens. When batch sizes vary (real training), this differs from the paper's
|
| 207 |
+
per-sequence average followed by batch average. In practice this difference is negligible
|
| 208 |
+
for stability, but it is technically not byte-for-byte OPSD parity on the reduction.
|
| 209 |
+
|
| 210 |
+
**VERDICT**: The `reduction="batchmean"` logic is borrowed from the OPSD upstream code
|
| 211 |
+
(which uses the same `mask.sum()` convention). The docstring's "matches upstream" claim
|
| 212 |
+
is accurate for the code, but the code diverges from the paper's stated per-sequence
|
| 213 |
+
normalization. Not a material issue.
|
| 214 |
+
|
| 215 |
+
### 2.4 `token_clip` parameter — CORRECT semantics, but per-token vs. per-(token,vocab) distinction
|
| 216 |
+
|
| 217 |
+
The repo implements `token_clip` as a **per-position** JSD clip:
|
| 218 |
+
```python
|
| 219 |
+
jsd = jsd.clamp(max=token_clip) # jsd shape is (B, T, V) or (n_valid, V)
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
The OPSD paper's pointwise clipping (Section 3.3) clips **per-(position, vocab-entry)**:
|
| 223 |
+
`min(l_{n,v}^(f), τ)` for each vocab entry v at each position n.
|
| 224 |
+
|
| 225 |
+
The upstream OPSD code (`--jsd_token_clip`) appears to apply the same per-(position,vocab)
|
| 226 |
+
clip. The repo's clamp on the jsd tensor before reduction would clip the full-vocab
|
| 227 |
+
contribution per position (since jsd has shape (B,T,V) before masking) — this is
|
| 228 |
+
equivalent to per-(position,vocab) clipping, which is correct.
|
| 229 |
+
|
| 230 |
+
**VERDICT**: Implementation appears correct. The parameter name (`token_clip`) is slightly
|
| 231 |
+
misleading (it clips per-token-vocab-entry, not just per-token), but the semantics match.
|
| 232 |
+
|
| 233 |
+
---
|
| 234 |
+
|
| 235 |
+
## 3. Critical structural mismatch: Composer/repo framing vs. SDPO mechanics
|
| 236 |
+
|
| 237 |
+
### 3.1 ERROR-TURN MASKING — NOT IN SDPO
|
| 238 |
+
|
| 239 |
+
The repo implements SDPO as an error-turn-masked loss:
|
| 240 |
+
- `_compute_sdpo_loss` applies JSD only at `error-turn tokens` (via `sdpo_loss_mask`).
|
| 241 |
+
- The data collator detects error sites in a trace and constructs a teacher context
|
| 242 |
+
with a hint inserted at the error turn (`ctx_teacher = ctx_student + hint`).
|
| 243 |
+
- The hint shifts teacher response tokens right, requiring explicit alignment indices
|
| 244 |
+
(ADR-011).
|
| 245 |
+
|
| 246 |
+
**The SDPO paper has no error-turn masking.** SDPO applies the KL loss to ALL tokens `t`
|
| 247 |
+
in the rollout response:
|
| 248 |
+
> "L_SDPO(θ) := Σ_t KL(π_θ(·|x, y_{<t}) ‖ stopgrad(π_θ(·|x, f, y_{<t})))"
|
| 249 |
+
|
| 250 |
+
The SDPO teacher context includes the full feedback; both student and teacher evaluate
|
| 251 |
+
log-probs of the **same response tokens** `y`. There is no "hint inserted into the
|
| 252 |
+
response" — the feedback is in the conditioning prefix, not intercalated into the
|
| 253 |
+
response sequence. Therefore the teacher response tokens are **not shifted** and token
|
| 254 |
+
alignment is trivially preserved: both contexts evaluate the same sequence `y`.
|
| 255 |
+
|
| 256 |
+
**The repo's architecture (hint at error turn → response token shift → alignment indices)**
|
| 257 |
+
is an interpretation of Composer 2.5's "hint" mechanism, not a feature of SDPO. SDPO's
|
| 258 |
+
feedback is in the prompt/conditioning context; it does not intercalate text into the
|
| 259 |
+
middle of a response.
|
| 260 |
+
|
| 261 |
+
**VERDICT**: The repo's error-turn-masking design is a reasonable extension of the
|
| 262 |
+
Composer blog's described mechanism ("insert hint at error turn") but is **NOT**
|
| 263 |
+
SDPO as described in the paper. The Composer blog's mechanism is itself not fully
|
| 264 |
+
described and may or may not match SDPO mechanics.
|
| 265 |
+
|
| 266 |
+
### 3.2 TEACHER CONTEXT — CRITICAL DIFFERENCE
|
| 267 |
+
|
| 268 |
+
**SDPO teacher context** (Table 2):
|
| 269 |
+
```
|
| 270 |
+
[prompt, feedback_f, original_response_y]
|
| 271 |
+
```
|
| 272 |
+
The teacher evaluates log-probs of `original_response_y` given `[prompt, feedback_f]`.
|
| 273 |
+
Teacher prefix = `[prompt, feedback_f]`. Response = `y` (same as student). No hint is
|
| 274 |
+
inserted *into* `y`.
|
| 275 |
+
|
| 276 |
+
**Repo teacher context**:
|
| 277 |
+
```
|
| 278 |
+
ctx_teacher = ctx_student + hint_at_error_turn
|
| 279 |
+
```
|
| 280 |
+
The hint is *intercalated* into the response sequence at an error turn. Teacher prefix
|
| 281 |
+
= student prefix. Response = `y_before_error + hint + y_after_error`. Teacher response
|
| 282 |
+
tokens are LONGER than student response tokens and SHIFTED.
|
| 283 |
+
|
| 284 |
+
This is architecturally different from SDPO. The alignment problem (ADR-008, ADR-011)
|
| 285 |
+
arises precisely because the repo's teacher context design inserts hint text into the
|
| 286 |
+
response, which SDPO does not do.
|
| 287 |
+
|
| 288 |
+
**VERDICT**: The repo's teacher context construction is a novel design inspired by the
|
| 289 |
+
Composer blog. It is not what SDPO does. The ADR-008 "trust-gap" and the entire
|
| 290 |
+
ADR-011 alignment index complexity are artifacts of this departure from SDPO, not
|
| 291 |
+
corrections to SDPO.
|
| 292 |
+
|
| 293 |
+
### 3.3 OPSD vs. SDPO as the loss source
|
| 294 |
+
|
| 295 |
+
The repo header in `opsd.py` says the loss is:
|
| 296 |
+
> "lifted from siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss (MIT)"
|
| 297 |
+
|
| 298 |
+
And:
|
| 299 |
+
> "SDPO paper: Hübotter et al. … formalizes the same loss as Composer 2.5's
|
| 300 |
+
> 'Targeted RL with Textual Feedback.'"
|
| 301 |
+
|
| 302 |
+
These are TWO DIFFERENT methods with related but not identical losses:
|
| 303 |
+
|
| 304 |
+
- **OPSD loss** (Eq. 7–8): `JSD_β(p_T ‖ p_S)` with teacher having `y*` (ground truth).
|
| 305 |
+
Normalization: per-sequence average then batch average. Pointwise vocab clipping.
|
| 306 |
+
Training runs ~100 steps. Fixed teacher (initial checkpoint, not live).
|
| 307 |
+
|
| 308 |
+
- **SDPO loss** (Eq. 1): `KL(π_θ(·|x,y_{<t}) ‖ stopgrad(π_θ(·|x,f,y_{<t})))` where
|
| 309 |
+
KL is applied per-position over the full response. The paper adopts JSD as a stability
|
| 310 |
+
improvement (§2.3) but the base formulation is reverse KL (student ‖ teacher).
|
| 311 |
+
Teacher is regularized via EMA or trust-region. No per-vocab clipping in the paper
|
| 312 |
+
(top-K approximation instead).
|
| 313 |
+
|
| 314 |
+
The repo correctly implements the OPSD JSD formula (which SDPO also uses for stability).
|
| 315 |
+
The claim "verified port of siyan-zhao/OPSD::OPSDTrainer.generalized_jsd_loss" is
|
| 316 |
+
accurate for the loss kernel. The claim "Composer 2.5's 'Targeted RL with Textual Feedback'"
|
| 317 |
+
is an assertion that Composer uses the same loss — this is not confirmed anywhere in the
|
| 318 |
+
Cursor blog or Composer 2 tech report.
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
## 4. Audit: `_compute_sdpo_loss` in `composer_trainer.py`
|
| 323 |
+
|
| 324 |
+
### 4.1 Gradient flow — CORRECT
|
| 325 |
+
|
| 326 |
+
```python
|
| 327 |
+
student_logits = model(input_ids=inputs["input_ids"]).logits
|
| 328 |
+
with torch.no_grad():
|
| 329 |
+
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
Teacher is `no_grad` — matches SDPO's `stopgrad(π_θ(·|x,f,y_{<t}))`. Student has
|
| 333 |
+
gradient. Correct.
|
| 334 |
+
|
| 335 |
+
### 4.2 Alignment index machinery — NECESSARY GIVEN THE DESIGN, BUT NOT FROM SDPO
|
| 336 |
+
|
| 337 |
+
The `student_response_idx` / `teacher_response_idx` machinery (ADR-011) is needed
|
| 338 |
+
because the hint is inserted into the teacher response sequence. This complexity does
|
| 339 |
+
not exist in SDPO or OPSD because those methods never insert text into the response.
|
| 340 |
+
The repo's `strict_sdpo_alignment` guard is a correct defense against the problem it
|
| 341 |
+
has created for itself.
|
| 342 |
+
|
| 343 |
+
### 4.3 Batch-level masking — CORRECT for the repo's error-turn interpretation
|
| 344 |
+
|
| 345 |
+
The loss is masked to error-turn tokens only (`aligned_labels` with -100 elsewhere).
|
| 346 |
+
This means the SDPO channel only trains on error recovery tokens, not the full rollout.
|
| 347 |
+
SDPO trains on the full rollout. For Composer's intent (correcting error turns), the
|
| 348 |
+
masking is reasonable, but it produces a loss that is more like a targeted distillation
|
| 349 |
+
at error sites than SDPO's full-rollout advantage assignment.
|
| 350 |
+
|
| 351 |
+
---
|
| 352 |
+
|
| 353 |
+
## 5. Audit: `research/07-sdpo-hint-generator.md` — Accuracy check
|
| 354 |
+
|
| 355 |
+
### 5.1 Three feedback types from SDPO paper — CORRECTLY REPORTED
|
| 356 |
+
|
| 357 |
+
research/07 correctly identifies the three types:
|
| 358 |
+
1. Sample solution (successful sibling rollout)
|
| 359 |
+
2. Environment output (runtime errors)
|
| 360 |
+
3. Student's original attempt
|
| 361 |
+
|
| 362 |
+
The paper (Table 6 results, which research/07 did NOT have access to) shows:
|
| 363 |
+
- Best configuration: `f = output + own solution` (48.3% accuracy)
|
| 364 |
+
- Including `f = y` (original attempt as conditioning, not as response) **hurts diversity**
|
| 365 |
+
and slightly reduces final accuracy (44.5% vs 48.3%)
|
| 366 |
+
|
| 367 |
+
Research/07 correctly notes the sibling rollout is "always generated by the student, not
|
| 368 |
+
an expert model" — confirmed in the paper: "We emphasize that these sample solutions are
|
| 369 |
+
always generated by the student, as in GRPO, and do not require an expert model."
|
| 370 |
+
|
| 371 |
+
### 5.2 "Successful sibling rollout as implicit feedback" claim — CORRECTLY REPORTED
|
| 372 |
+
|
| 373 |
+
The abstract: "SDPO also outperforms baselines in standard RLVR environments that only
|
| 374 |
+
return scalar feedback by using successful rollouts as implicit feedback for failed attempts."
|
| 375 |
+
|
| 376 |
+
Research/07 cites this correctly and uses it as the basis for the `SiblingBootstrapGenerator`.
|
| 377 |
+
|
| 378 |
+
### 5.3 OPSD `--reason_first` flag — CORRECTLY DESCRIBED
|
| 379 |
+
|
| 380 |
+
The OPSD README confirms: `--reason_first False: Prepend an explicit rationalization to
|
| 381 |
+
the teacher context before distillation.` Research/07 correctly calls this "OPSD's own
|
| 382 |
+
knob for same-model introspection."
|
| 383 |
+
|
| 384 |
+
### 5.4 `--jsd_token_clip default 0.05` — CORRECTLY CITED
|
| 385 |
+
|
| 386 |
+
Confirmed from OPSD README: `--jsd_token_clip 0.05` is the default.
|
| 387 |
+
|
| 388 |
+
---
|
| 389 |
+
|
| 390 |
+
## 6. Audit: `SiblingBootstrapGenerator` — Is it supported by the papers?
|
| 391 |
+
|
| 392 |
+
The repo's `hint_generator.py` sketch (lines 319–331) and `research/07` §6.3:
|
| 393 |
+
```python
|
| 394 |
+
class SiblingBootstrapGenerator:
|
| 395 |
+
def generate(self, ctx):
|
| 396 |
+
sibs = ctx.get("sibling_rollouts") or []
|
| 397 |
+
winners = [s for s in sibs if s.get("reward", 0.0) > 0.0]
|
| 398 |
+
if not winners:
|
| 399 |
+
return None
|
| 400 |
+
best = max(winners, key=lambda s: s["reward"])
|
| 401 |
+
snippet = (best.get("solution_excerpt") or "")[:200]
|
| 402 |
+
return ("Reminder: a working approach for this task looks like:\n"
|
| 403 |
+
f"{snippet}\nAdapt this to the current step.")
|
| 404 |
+
```
|
| 405 |
+
|
| 406 |
+
**What the SDPO paper actually does** (Table 2 template):
|
| 407 |
+
```
|
| 408 |
+
Correct solution: {successful_previous_rollout}
|
| 409 |
+
```
|
| 410 |
+
The successful rollout is passed as the full solution (or relevant excerpt) in the
|
| 411 |
+
teacher context. The teacher then evaluates log-probs of the student's original
|
| 412 |
+
response given this context.
|
| 413 |
+
|
| 414 |
+
**Key difference**: In SDPO, the sibling solution goes into the teacher's conditioning
|
| 415 |
+
prefix. The teacher does not generate a new hint; it just re-evaluates the student's
|
| 416 |
+
response log-probs with the solution visible. In the repo, the sibling solution is
|
| 417 |
+
used to *generate a hint string* that gets inserted into the response sequence.
|
| 418 |
+
|
| 419 |
+
This is an extrapolation beyond what the SDPO paper supports. SDPO's "successful rollout
|
| 420 |
+
as implicit feedback" mechanism does NOT:
|
| 421 |
+
1. Generate a "Reminder: a working approach..." hint string.
|
| 422 |
+
2. Insert text into the student's response sequence.
|
| 423 |
+
3. Require error-turn detection.
|
| 424 |
+
|
| 425 |
+
The SDPO sibling mechanism IS:
|
| 426 |
+
1. Condition the teacher on the full successful solution.
|
| 427 |
+
2. Re-evaluate ALL student response token log-probs under that teacher.
|
| 428 |
+
3. Apply the KL loss across the entire response.
|
| 429 |
+
|
| 430 |
+
**VERDICT**: The `SiblingBootstrapGenerator` as sketched is an extrapolation from SDPO's
|
| 431 |
+
mechanism, not a faithful implementation of it. The paper supports using a sibling rollout
|
| 432 |
+
as teacher conditioning context; it does not support generating a textual hint from it
|
| 433 |
+
to splice into the response. The Composer blog's "hint" framing is the source of this
|
| 434 |
+
architectural decision; SDPO is cited as inspiration but is not the mechanism.
|
| 435 |
+
|
| 436 |
+
Research/07 acknowledges this at several points ("A working approach looks like: …" in
|
| 437 |
+
the class comment vs the actual SDPO template) but does not flag it as a divergence — it
|
| 438 |
+
presents the sibling-bootstrap hint approach as if it naturally follows from SDPO.
|
| 439 |
+
|
| 440 |
+
---
|
| 441 |
+
|
| 442 |
+
## 7. Audit: `research/11-sdpo-alignment-indices.md`
|
| 443 |
+
|
| 444 |
+
### 7.1 Problem correctly identified
|
| 445 |
+
|
| 446 |
+
ADR-011 correctly identifies that inserting a hint into the teacher context shifts teacher
|
| 447 |
+
response tokens right. The alignment indices machinery (`_mask_to_padded_indices`,
|
| 448 |
+
`student_response_idx`, `teacher_response_idx`, sentinel handling) is a sound engineering
|
| 449 |
+
solution to the problem the repo's design creates.
|
| 450 |
+
|
| 451 |
+
### 7.2 Root cause attribution — MISLEADING
|
| 452 |
+
|
| 453 |
+
ADR-011 and the trainer comments frame the alignment problem as an SDPO issue that the
|
| 454 |
+
papers fail to address ("the exact trust-gap flagged in ADR-008"). This is not accurate.
|
| 455 |
+
The alignment problem does not exist in SDPO or OPSD because those methods never insert
|
| 456 |
+
text into the response sequence. The alignment problem is entirely self-created by the
|
| 457 |
+
repo's decision to implement the Composer blog's "hint at error turn" as a text insertion
|
| 458 |
+
into the teacher's response sequence.
|
| 459 |
+
|
| 460 |
+
---
|
| 461 |
+
|
| 462 |
+
## 8. Audit: ADR-007, ADR-008, ADR-009 — Key claims
|
| 463 |
+
|
| 464 |
+
### ADR-007 — JSD as "the kernel of SDPO arXiv:2601.20802"
|
| 465 |
+
|
| 466 |
+
The ADR says `generalized_jsd_loss` is "verified port of siyan-zhao/OPSD, the kernel of
|
| 467 |
+
SDPO arXiv:2601.20802." This telescopes two papers. The JSD is the kernel of OPSD
|
| 468 |
+
(the Zhao et al. paper, 2601.18734). SDPO (Hübotter et al., 2601.20802) uses JSD as a
|
| 469 |
+
**stability improvement** over the base KL loss; the primary SDPO loss is the KL. Both
|
| 470 |
+
papers use the same JSD formula (citing GKD paper 2306.13649). The conflation is not
|
| 471 |
+
consequential for the loss code but creates confusion in documentation.
|
| 472 |
+
|
| 473 |
+
### ADR-008 — "SDPO needs full vocabulary logits"
|
| 474 |
+
|
| 475 |
+
Confirmed. SDPO Appendix A.3 discusses top-K approximation of the KL because "naively
|
| 476 |
+
computing the KL divergence between student and teacher requires holding full logits of
|
| 477 |
+
both models in memory." The repo's claim about needing full logits is correct; PRIME-RL's
|
| 478 |
+
log-probs-only interface is correctly identified as incompatible with the SDPO channel.
|
| 479 |
+
|
| 480 |
+
### ADR-008 — Dr. GRPO as the Composer algorithm
|
| 481 |
+
|
| 482 |
+
This is sourced from research/10 (Composer 2 tech report mining). Not audited here (out
|
| 483 |
+
of scope for this cluster).
|
| 484 |
+
|
| 485 |
+
### ADR-009 — "How Cursor generates that hint is unstated"
|
| 486 |
+
|
| 487 |
+
Confirmed true. The Composer 2 tech report (arXiv:2603.24477) is cited as unread in
|
| 488 |
+
research/07 §8 and ADR-009. ADR-009 correctly acknowledges the open question.
|
| 489 |
+
|
| 490 |
+
---
|
| 491 |
+
|
| 492 |
+
## 9. Summary of findings
|
| 493 |
+
|
| 494 |
+
| Claim | Source | Verdict |
|
| 495 |
+
|---|---|---|
|
| 496 |
+
| JSD formula in opsd.py is numerically correct | OPSD Eq. 7 | CORRECT |
|
| 497 |
+
| β=0 = "reverse KL" in docstring | OPSD README: "β=0 = forward KL" | INVERTED label |
|
| 498 |
+
| "byte-for-byte OPSD parity" | OPSD code | Mostly correct; β direction label wrong; reduction differs from paper's per-sequence normalization; otherwise matches upstream code |
|
| 499 |
+
| Error-turn masking is from SDPO | SDPO paper | FALSE — SDPO applies loss to full rollout, no error-turn detection |
|
| 500 |
+
| Teacher context = ctx_student + hint_at_error_turn | SDPO paper | FALSE — SDPO teacher = [prompt, feedback, student_response]; feedback is in prefix, not intercalated |
|
| 501 |
+
| SiblingBootstrapGenerator follows from SDPO "successful rollout as implicit feedback" | SDPO §4.6 | EXTRAPOLATION — SDPO conditions teacher on full solution; repo generates a hint string and inserts it into response sequence |
|
| 502 |
+
| Alignment indices machinery (ADR-011) addresses SDPO misalignment | SDPO paper | MISLEADING — problem is self-created by hint-insertion design; does not exist in SDPO |
|
| 503 |
+
| SDPO needs full vocabulary logits (ADR-008) | SDPO Appendix A.3 | CORRECT |
|
| 504 |
+
| Three feedback types in research/07 | SDPO §4.6 | CORRECTLY REPORTED |
|
| 505 |
+
| --jsd_token_clip default 0.05 | OPSD README | CORRECT |
|
| 506 |
+
| --reason_first flag | OPSD README | CORRECTLY DESCRIBED |
|
| 507 |
+
| "Successful rollouts as implicit feedback" claim | SDPO abstract | CORRECTLY CITED |
|
| 508 |
+
| Teacher is stop-grad, student has gradient | SDPO Eq. 1 | CORRECT in opsd.py and composer_trainer.py |
|
| 509 |
+
|
| 510 |
+
---
|
| 511 |
+
|
| 512 |
+
## 10. Recommendations
|
| 513 |
+
|
| 514 |
+
1. **Fix the β docstring** in `opsd.py` to match the OPSD upstream convention:
|
| 515 |
+
β=0 → forward KL (KL(student‖teacher)), β=1 → reverse KL (KL(teacher‖student)).
|
| 516 |
+
|
| 517 |
+
2. **Clarify the architectural departure from SDPO** in `composer_trainer.py` docstring
|
| 518 |
+
and `research/07`: the repo implements a Composer-blog-inspired error-turn hint
|
| 519 |
+
injection, which is an extension beyond SDPO. SDPO uses the feedback in the prompt
|
| 520 |
+
prefix and evaluates the full response; the repo intercalates text into the response.
|
| 521 |
+
|
| 522 |
+
3. **Reconsider framing of `SiblingBootstrapGenerator`**: it is an original design choice,
|
| 523 |
+
not an SDPO mechanism. The SDPO "sibling as implicit feedback" mechanism would look like:
|
| 524 |
+
build a teacher context `[prompt, successful_sibling_rollout, original_response]` and
|
| 525 |
+
apply KL over the whole original response — without generating a hint string or
|
| 526 |
+
error-turn detection. This would be simpler and more faithful to SDPO.
|
| 527 |
+
|
| 528 |
+
4. **Teacher regularization is not implemented**: the SDPO paper shows a non-regularized
|
| 529 |
+
teacher diverges (Table 4: 36.1% vs. 50.6%). The repo's teacher is the live model
|
| 530 |
+
weights at each step with no EMA or trust-region regularization. For production SDPO
|
| 531 |
+
runs this is a gap. (The `sdpo_jsd_beta` default of 0.5 uses symmetric JSD which is
|
| 532 |
+
one of SDPO's stability improvements, but the teacher regularization is absent.)
|
| 533 |
+
|
| 534 |
+
5. **SDPO's original attempt placement**: the paper includes the student's original
|
| 535 |
+
response as the sequence being log-prob-evaluated (i.e., the "response" slot in the
|
| 536 |
+
teacher context). The repo's collator instead masks specific error-turn tokens within
|
| 537 |
+
a modified response. These are architecturally different. The paper-accurate approach
|
| 538 |
+
would re-evaluate log-probs of the entire original response under the hint-conditioned
|
| 539 |
+
teacher, not just the tokens after the error.
|
| 540 |
+
|
| 541 |
+
6. **Failure mode from SDPO paper**: the strongest limitation is model capability
|
| 542 |
+
dependence — SDPO underperforms GRPO on weak models (Qwen3-0.6B); SDPO+GRPO with
|
| 543 |
+
λ=0.9 is recommended for weaker base models. This is not documented in the repo's
|
| 544 |
+
SDPO usage guidance.
|
| 545 |
+
|
| 546 |
+
---
|
| 547 |
+
|
| 548 |
+
## 11. What the papers do NOT say (repo-claimed but unconfirmed in sources)
|
| 549 |
+
|
| 550 |
+
- That Composer 2.5's "Targeted RL with Textual Feedback" is SDPO specifically (the
|
| 551 |
+
Cursor blog does not cite SDPO; it describes a mechanism consistent with SDPO but
|
| 552 |
+
the connection is an inference, not a citation).
|
| 553 |
+
- That error-turn masking is part of SDPO.
|
| 554 |
+
- That the repo's hint-at-error-turn teacher context is the SDPO mechanism.
|
| 555 |
+
- That the alignment index problem (ADR-011) is an issue in SDPO.
|
| 556 |
+
- How Cursor generates the hint (confirmed absent in all Cursor artifacts).
|
|
@@ -0,0 +1,421 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read: GRPO Objective Family — Critical Review
|
| 2 |
+
**Cluster:** Policy-optimization objective family
|
| 3 |
+
**Date:** 2026-06-09
|
| 4 |
+
**Reviewer:** Claude Code subagent (critical-review pipeline)
|
| 5 |
+
**Sources fetched and read (full HTML bodies):**
|
| 6 |
+
- arXiv:2402.03300 — DeepSeekMath / original GRPO (Shao et al.)
|
| 7 |
+
- arXiv:2503.20783 — Dr.GRPO "Understanding R1-Zero-Like Training" (Liu et al.)
|
| 8 |
+
- arXiv:2503.14476 — DAPO (Yu et al., ByteDance/Tsinghua)
|
| 9 |
+
- arXiv:2507.18071 — GSPO (Zheng et al., Qwen/Alibaba)
|
| 10 |
+
- arXiv:2506.13585 — CISPO / MiniMax-M1 (MiniMax)
|
| 11 |
+
- arXiv:2512.21852 — "A Comedy of Estimators" (Shah, Obando-Ceron et al.) — **abstract only** (HTML 404; PDF binary; full text unavailable via hyperresearch; findings below rely on the abstract + cross-refs in the repo)
|
| 12 |
+
- arXiv:2603.24477 — Composer 2 Technical Report (full body via `research/notes/composer-2-technical-report.md`)
|
| 13 |
+
- **Repo files read:** `composer_replication/trainer/composer_trainer.py`, `composer_replication/trainer/kl_in_reward.py`, `docs/adrs/ADR-014-policy-optimization-objective-menu.md`, `research/10-composer2-techreport-mining.md`, `research/notes/channel-1-drgrpo-base-po-objective-menu-*.md`, `research/design-F5-fidelity-audit.md`
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## 1. Exact loss-math delta: each objective vs original GRPO
|
| 18 |
+
|
| 19 |
+
### 1.1 Original GRPO (arXiv:2402.03300, Shao et al. 2024)
|
| 20 |
+
|
| 21 |
+
**Loss (Eq. 11 in the DeepSeekMath paper):**
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
J_GRPO(θ) = E_{q, {o_i}~π_old}
|
| 25 |
+
(1/G) Σ_i (1/|o_i|) Σ_t { min[r_{i,t} · Â_{i,t}, clip(r_{i,t}, 1-ε, 1+ε)·Â_{i,t}]
|
| 26 |
+
- β · D_KL[π_θ || π_ref] }
|
| 27 |
+
|
| 28 |
+
Â_{i,t} = (R(q,o_i) - mean({R_j})) / std({R_j})
|
| 29 |
+
|
| 30 |
+
r_{i,t} = π_θ(o_{i,t}|q,o_{i,<t}) / π_old(o_{i,t}|q,o_{i,<t}) # token-level IS ratio
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
Key design choices:
|
| 34 |
+
- **Per-sequence length normalization** (`1/|o_i|` inside the sum) — averages loss over tokens *within* each response
|
| 35 |
+
- **Advantage = group-mean-centered, then divided by group std** (std-normalization)
|
| 36 |
+
- **KL in the LOSS** (not in the reward): `β · D_KL[π_θ||π_ref]` added as a per-token penalty inside the surrogate
|
| 37 |
+
- KL estimated via the k3 (Schulman) estimator: `D_KL ≈ π_ref/π_θ - log(π_ref/π_θ) - 1`, i.e., `exp(ref_logp - logp) - (ref_logp - logp) - 1` (verified at DeepSeekMath full text L1723-1826, and `composer_trainer.py` docstring referencing trl grpo_trainer.py ~L2513)
|
| 38 |
+
|
| 39 |
+
### 1.2 Dr. GRPO (arXiv:2503.20783, Liu et al., Sea AI Lab)
|
| 40 |
+
|
| 41 |
+
**Exact delta from GRPO:**
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
CHANGE 1: Remove 1/|o_i| length normalization
|
| 45 |
+
→ loss is now sum over tokens (not mean), normalized only by batch/group size 1/G
|
| 46 |
+
→ in implementation: masked_mean divides by MAX_TOKENS (a constant), not mask.sum(axis=dim)
|
| 47 |
+
(Listing 1, Dr.GRPO paper)
|
| 48 |
+
|
| 49 |
+
CHANGE 2: Remove std(R) normalization from advantage
|
| 50 |
+
Â_{i,t}^{Dr.GRPO} = R(q,o_i) - mean({R_j}) ← NO /std
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
**Why (paper's argument, Sec. 3.1–3.2):**
|
| 54 |
+
- Length normalization introduces *response-level length bias*: for positive advantages, shorter correct responses get larger gradient; for negative advantages, longer wrong responses are penalized less → policy artificially lengthens incorrect responses
|
| 55 |
+
- Std-normalization introduces *question-level difficulty bias*: easy/hard questions (std≈0) get up-weighted in the batch regardless of their value
|
| 56 |
+
|
| 57 |
+
**Empirical result:** Dr.GRPO achieves better token efficiency and prevents "wild length growth" for incorrect responses; accuracy is maintained or improved (Appendix C ablation, Fig. 9 3-seed comparison).
|
| 58 |
+
|
| 59 |
+
**Connection to RLOO:** The unbiased advantage `Â_{i,t} = R(q,o_i) - mean({R_j})` is proportional to the REINFORCE Leave-One-Out estimator: `G/(G-1) · Â_{i,t} = Â^{RLOO}_{i,t}` (App. A). This is a principled unbiased baseline without a value model.
|
| 60 |
+
|
| 61 |
+
### 1.3 DAPO (arXiv:2503.14476, Yu et al., ByteDance)
|
| 62 |
+
|
| 63 |
+
**Four changes from GRPO. Exact deltas:**
|
| 64 |
+
|
| 65 |
+
**DAPO-1: Clip-Higher (Decoupled Clip)**
|
| 66 |
+
```
|
| 67 |
+
PPO/GRPO: clip(r, 1-ε, 1+ε) # symmetric, same bound both directions
|
| 68 |
+
DAPO: clip(r, 1-ε_low, 1+ε_high) where ε_high > ε_low
|
| 69 |
+
→ paper: ε_low=0.2, ε_high=0.28
|
| 70 |
+
```
|
| 71 |
+
Purpose: allow larger upward updates on positive-advantage responses (more entropy diversity) while keeping the lower bound conservative (no degradation on negative responses).
|
| 72 |
+
|
| 73 |
+
**DAPO-2: Dynamic Sampling (filters zero-gradient groups)**
|
| 74 |
+
```
|
| 75 |
+
If all G responses in a group have reward=0 or reward=1, the group gradient is zero
|
| 76 |
+
→ DAPO filters these "degenerate" groups and resamples until a mixed-reward group appears
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
**DAPO-3: Token-Level Policy Gradient Loss**
|
| 80 |
+
DAPO normalizes loss by total tokens across the batch (not per-sequence). Effectively same as Dr.GRPO's length-unbiased formulation but motivated differently (stability in long-CoT).
|
| 81 |
+
|
| 82 |
+
**DAPO-4: Overlong Reward Shaping (masking truncated responses)**
|
| 83 |
+
```
|
| 84 |
+
Default: assign punitive reward to truncated (overlong) responses
|
| 85 |
+
DAPO: instead mask the loss on truncated responses entirely
|
| 86 |
+
→ paper calls this "Overlong Filtering" (Section 3.4)
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
**DAPO KL treatment:** `beta=0.0` �� KL is REMOVED entirely. No reference policy.
|
| 90 |
+
|
| 91 |
+
**repo's DAPO preset (`PO_OBJECTIVES["dapo"]`):**
|
| 92 |
+
```python
|
| 93 |
+
{
|
| 94 |
+
"loss_type": "dapo",
|
| 95 |
+
"scale_rewards": "none",
|
| 96 |
+
"epsilon": 0.2,
|
| 97 |
+
"epsilon_high": 0.28, # correct per paper
|
| 98 |
+
"mask_truncated_completions": True, # correct = DAPO's overlong filter
|
| 99 |
+
"beta": 0.0, # correct: KL removed
|
| 100 |
+
"importance_sampling_level": "token",
|
| 101 |
+
"num_iterations": 1,
|
| 102 |
+
}
|
| 103 |
+
```
|
| 104 |
+
**Assessment: ACCURATE to the paper.** The `epsilon_high=0.28` matches the paper's recommended value. `mask_truncated_completions=True` maps to DAPO's "Overlong Filtering." `beta=0.0` is correct — DAPO removes KL.
|
| 105 |
+
|
| 106 |
+
**Composer 2's rejection of DAPO overlong masking:** The Composer 2 technical report (research/notes/composer-2-technical-report.md, L362) states verbatim:
|
| 107 |
+
> "We did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length. Our self-summary system (discussed below) also limits the occurrence of these cases in practice."
|
| 108 |
+
|
| 109 |
+
The repo's research/10-composer2-techreport-mining.md records this correctly at line 51: "DAPO overlong-rollout masking explicitly TRIED and REJECTED ('did not see benefits at small scale')."
|
| 110 |
+
|
| 111 |
+
**VERDICT on repo's research/10 claim about DAPO overlong masking:** VERIFIED. The Composer 2 report does explicitly try and reject DAPO overlong masking. The claim "not adopted at small scale" is accurate.
|
| 112 |
+
|
| 113 |
+
### 1.4 GSPO (arXiv:2507.18071, Zheng et al., Qwen Team, Alibaba)
|
| 114 |
+
|
| 115 |
+
**Key insight:** GRPO's instability for large/MoE models stems from token-level importance sampling weights accumulating high variance over long sequences. The IS weight at each token position is a single sample from a distribution — when the policy updates, MoE routing changes, making token-level ratios for the same token in two different forward passes unreliable.
|
| 116 |
+
|
| 117 |
+
**GSPO objective:**
|
| 118 |
+
```
|
| 119 |
+
J_GSPO(θ) = E_{x, {y_i}~π_old} [ (1/G) Σ_i min(s_i(θ)·Â_i, clip(s_i(θ), 1-ε, 1+ε)·Â_i) ]
|
| 120 |
+
|
| 121 |
+
where s_i(θ) = (π_θ(y_i|x) / π_old(y_i|x))^{1/|y_i|}
|
| 122 |
+
= exp((1/|y_i|) Σ_t log(π_θ(y_{i,t}|x,y_{i,<t}) / π_old(y_{i,t}|x,y_{i,<t})))
|
| 123 |
+
|
| 124 |
+
Â_i = (r(x,y_i) - mean({r_j})) / std({r_j}) # advantage still uses std-norm!
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
**Key delta from GRPO:**
|
| 128 |
+
- IS ratio is the **geometric mean** of per-token ratios = the |y_i|-th root of the sequence-level ratio
|
| 129 |
+
- Clipping is done at the **sequence level** (one clip decision per response), not per token
|
| 130 |
+
- Tokens within an unclipped response contribute equally to the gradient (no per-token IS weighting)
|
| 131 |
+
- Advantage still uses std-normalization (unlike Dr.GRPO) — GSPO does NOT remove std-norm
|
| 132 |
+
|
| 133 |
+
**Gradient analysis:** GSPO token-level gradient = `s_i(θ) · (1/|y_i|) Σ_t Â_i · ∇_θ log π_θ(y_{i,t})`. The scalar `s_i(θ)` is the same for all tokens in a response — stable, does not accumulate per-token noise.
|
| 134 |
+
|
| 135 |
+
**MoE benefit (Section 5.3):** With MoE models, expert activations change between π_old and π_θ forward passes (~10% of experts per layer differ after one update for Qwen3-30B-A3B). This makes token-level IS ratios volatile. GSPO's sequence-level ratio depends only on the sequence likelihood — stable even when some expert routing changes. Eliminates need for "Routing Replay" strategy.
|
| 136 |
+
|
| 137 |
+
**repo's GSPO preset:**
|
| 138 |
+
```python
|
| 139 |
+
"gspo": {
|
| 140 |
+
"loss_type": "grpo", # trl has no literal "gspo"
|
| 141 |
+
"scale_rewards": "group", # ← std-norm kept (correct for GSPO!)
|
| 142 |
+
"importance_sampling_level": "sequence",
|
| 143 |
+
"num_iterations": 1,
|
| 144 |
+
}
|
| 145 |
+
```
|
| 146 |
+
**Assessment: CORRECT** — TRL expresses GSPO as grpo-loss + sequence IS, and `scale_rewards="group"` (std-norm) matches the GSPO paper. The comment in `composer_trainer.py:818` accurately notes "trl has no literal 'gspo'."
|
| 147 |
+
|
| 148 |
+
**ONE SUBTLE ISSUE:** The GSPO paper's clipping ranges are of a **different magnitude** than GRPO. GSPO uses `ε ≈ 3e-4` (left) and `4e-4` (right) vs GRPO's `0.2/0.27`. The repo's GSPO preset does not set custom clipping ranges — it uses whatever TRL's default is for `loss_type="grpo"`. The `epsilon` and `epsilon_high` are likely to be GRPO-scale defaults (0.2), which would be dramatically wrong for GSPO's geometric-mean ratio. **This is an unaddressed subtlety** — the GSPO paper explicitly states the clipping ranges "differ by orders of magnitude" between GSPO and GRPO (Section 4.1, Fig. 2). The repo's GSPO preset needs custom `epsilon`/`epsilon_high` at GSPO scale (e.g., 3e-4/4e-4).
|
| 149 |
+
|
| 150 |
+
### 1.5 CISPO (arXiv:2506.13585, MiniMax-M1)
|
| 151 |
+
|
| 152 |
+
**CISPO key idea:** Rather than clipping the PPO surrogate objective (which zeroes gradient for out-of-range IS weights), CISPO clips the IS weight itself as a **stop-gradient coefficient** — every token always contributes to gradient.
|
| 153 |
+
|
| 154 |
+
**CISPO objective:**
|
| 155 |
+
```
|
| 156 |
+
J_CISPO(θ) = E_{q, {o_i}~π_old} [
|
| 157 |
+
(1/Σ_i|o_i|) Σ_i Σ_t sg(clip(r_{i,t}(θ), 1-ε_low^IS, 1+ε_high^IS)) · Â_{i,t} · log π_θ(o_{i,t}|q,o_{i,<t})
|
| 158 |
+
]
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
where `sg(·)` = stop-gradient. So the IS weight `r̂_{i,t}` is a **detached constant coefficient** on log π, not a differentiable term.
|
| 162 |
+
|
| 163 |
+
**Key deltas from GRPO/PPO:**
|
| 164 |
+
1. **No min(r·A, clip(r,...)·A)** surrogate — just `r̂(detached) · log π` (pure weighted REINFORCE)
|
| 165 |
+
2. IS weight is clipped but stop-grad'd — zeroing a token's IS weight sets its gradient contribution to zero; NOT clipping means every token contributes, even far-from-policy ones
|
| 166 |
+
3. In MiniMax-M1's implementation: `ε_low^IS` is not applied (no lower bound); only `ε_high^IS` is tuned (~5 in their experiments, which is "ScaleRL" range)
|
| 167 |
+
4. **KL removed**: no KL penalty, beta=0
|
| 168 |
+
5. Advantage uses group-relative std-norm (like original GRPO, unlike Dr.GRPO)
|
| 169 |
+
|
| 170 |
+
**Why it's more efficient than GRPO/DAPO (paper's argument):** GRPO/DAPO zero gradients for tokens whose IS ratio falls outside the clip window. In long-response reasoning, MANY tokens get zeroed. CISPO avoids this — the IS weight is merely down-weighted, never zeroed. Result: 2× speedup vs DAPO empirically on Qwen2.5-32B math (Fig. 2 of MiniMax-M1 report).
|
| 171 |
+
|
| 172 |
+
**repo's CISPO preset:**
|
| 173 |
+
```python
|
| 174 |
+
"cispo": {
|
| 175 |
+
"loss_type": "cispo",
|
| 176 |
+
"scale_rewards": "none", # ← Dr.GRPO regime, no std-norm
|
| 177 |
+
"epsilon_high": 5.0, # ← matches ScaleRL range
|
| 178 |
+
"importance_sampling_level": "token",
|
| 179 |
+
"num_iterations": 1,
|
| 180 |
+
}
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
**Assessment: MOSTLY CORRECT but two issues:**
|
| 184 |
+
1. The MiniMax-M1 paper uses `scale_rewards` with group std-norm (the paper's advantage formula includes std-normalization). Setting `scale_rewards="none"` deviates from the paper — though Dr.GRPO argument applies here too, this is a deliberate design choice not explicitly sourced from the CISPO paper itself.
|
| 185 |
+
2. The KL treatment is not explicitly set to `beta=0` in the CISPO preset. DAPO sets `beta=0.0` explicitly, but CISPO doesn't — this relies on TRL's default. Should be made explicit to match the paper ("There is no KL penalty term in CISPO similar to other recent works" — MiniMax-M1 L1170-1174).
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## 2. Comedy of Estimators (arXiv:2512.21852) — what it claims, what the repo says
|
| 190 |
+
|
| 191 |
+
**Critical limitation:** The HTML experimental version returns 404; PDF is binary. Only the abstract and cross-references within the repo's research notes are available for full-text analysis. The abstract (v3, 2026-03-18) states:
|
| 192 |
+
|
| 193 |
+
> "we observe that, in on-policy settings: (1) estimator configurations with **biased gradients can result in training instabilities**; and (2) using estimator configurations resulting in **unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks**."
|
| 194 |
+
> "We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups."
|
| 195 |
+
|
| 196 |
+
**Models tested (from abstract):** Qwen2.5-7B, Llama-3.1-8B-Instruct, Qwen3-4B-Instruct-2507.
|
| 197 |
+
|
| 198 |
+
**What the repo claims about it:**
|
| 199 |
+
|
| 200 |
+
The repo (`research/design-F5-fidelity-audit.md:10`, `kl_in_reward.py` docstring, `composer_trainer.py:79`) claims:
|
| 201 |
+
> "arXiv:2512.21852 ('A Comedy of Estimators') — k1-in-reward improves OOD generalization; k3-in-reward can collapse."
|
| 202 |
+
|
| 203 |
+
**ACCURACY ASSESSMENT:** The abstract supports the general claim that biased-gradient configurations cause instabilities and unbiased ones improve OOD performance. However the specific claim "k3-in-reward can collapse" requires verification against the full paper text — the abstract doesn't explicitly say which estimator is biased in which placement. The repo's characterization is **plausible but partially unverified** (full text unavailable). The abstract's claim about "unbiased gradients → better OOD" supports the repo's k1-in-reward motivation, but the specific k1-vs-k3 ranking and which placement (in-reward vs in-loss) is correct cannot be confirmed from the abstract alone.
|
| 204 |
+
|
| 205 |
+
**What the repo says about verl:** `kl_in_reward.py:12` claims "verl adopted k1-in-reward as its *only* reverse-KL option." This cannot be verified against source without reading verl's codebase — but the research/design-F5-fidelity-audit.md records it as a reference alongside TRL issue #4967 (also not read directly).
|
| 206 |
+
|
| 207 |
+
**SCALE / TASK CAVEAT:** The Comedy of Estimators was tested on 7B/8B models on reasoning tasks. Composer 2 runs a 1.04T/32B-active MoE on agentic coding. Generalization of 7B math findings to 1T MoE coding is an extrapolation the repo does not explicitly flag.
|
| 208 |
+
|
| 209 |
+
---
|
| 210 |
+
|
| 211 |
+
## 3. k1-in-reward implementation: faithfulness audit
|
| 212 |
+
|
| 213 |
+
### 3.1 What the Composer 2 report specifies
|
| 214 |
+
|
| 215 |
+
Composer 2 Technical Report §4.1 (verified verbatim, `research/notes/composer-2-technical-report.md:442-497`):
|
| 216 |
+
> "Many open-source implementations of RL estimate KL with the estimator k3 = (r−1) − log r... However, Amini et al. shows [that] the variance increases drastically as p and q diverge... Therefore, we use the standard estimator **k1 = −log r** instead."
|
| 217 |
+
|
| 218 |
+
This is KL applied **in the reward** (PPO-style), not in the loss:
|
| 219 |
+
> "Similar to prior work... we use a Kullback–Leibler divergence for regularization, KL(q||p) = E_{x~q}[-log r(x)]"
|
| 220 |
+
|
| 221 |
+
So the Composer 2 choice is:
|
| 222 |
+
- **k1 estimator**: `KL_i ≈ Σ_t (logp_{i,t} - ref_logp_{i,t})` (= `-log r` where r = π_ref/π_θ... wait, need to check sign)
|
| 223 |
+
|
| 224 |
+
**Sign convention note:** In the Composer 2 report, `r(x) = p(x)/q(x)` where p=π_ref (reference) and q=π_θ (current). So `−log r = −log(π_ref/π_θ) = logp - ref_logp` per token. This is the KL(π_θ||π_ref) direction.
|
| 225 |
+
|
| 226 |
+
**In the repo's kl_in_reward.py:**
|
| 227 |
+
```python
|
| 228 |
+
# k1 estimator:
|
| 229 |
+
per_token = (policy_logps - ref_logps) * completion_mask # logp - ref_logp
|
| 230 |
+
return per_token.sum(dim=-1)
|
| 231 |
+
```
|
| 232 |
+
And `apply_kl_in_reward` subtracts `coef*(KL_i - group_mean(KL))` from advantages.
|
| 233 |
+
|
| 234 |
+
**Sign check:** `adv'_i = adv_i - coef*(KL_i - mean(KL))`. If KL_i is large (policy has drifted far from reference), the advantage is reduced — penalizing divergence. This matches the standard RLHF KL penalty direction. ✓
|
| 235 |
+
|
| 236 |
+
**The penalty direction in original GRPO/PPO-RLHF:** The standard KL-in-reward is:
|
| 237 |
+
`reward'_i = reward_i - coef * KL(π_θ||π_ref)`, where KL ≥ 0.
|
| 238 |
+
|
| 239 |
+
With k1: `KL ≈ Σ_t (logp - ref_logp)`. If policy is very close to reference, `logp ≈ ref_logp`, KL ≈ 0. If policy drifts positive (puts more mass on this response), `logp > ref_logp`, KL > 0, penalty increases. This is the reverse KL direction (π_θ||π_ref), appropriate for preventing drift. ✓
|
| 240 |
+
|
| 241 |
+
### 3.2 The algebra: fold-then-baseline identity
|
| 242 |
+
|
| 243 |
+
The `kl_in_reward.py` docstring and `apply_kl_in_reward` implement:
|
| 244 |
+
```
|
| 245 |
+
adv'_i = adv_i - coef*(KL_i - group_mean(KL))
|
| 246 |
+
= [R_i - group_mean(R)] - coef*[KL_i - group_mean(KL)]
|
| 247 |
+
= [R_i - coef*KL_i] - group_mean([R - coef*KL])
|
| 248 |
+
= reward'_i - group_mean(reward')
|
| 249 |
+
```
|
| 250 |
+
This identity is exact under a **linear** (group-mean) baseline with **no std-normalization** (`scale_rewards="none"`). The docstring correctly states this requires the Dr.GRPO regime.
|
| 251 |
+
|
| 252 |
+
**With std-normalization (vanilla GRPO):** the identity breaks because `group_std(R - coef*KL) ≠ group_std(R) - coef*group_std(KL)` in general. `validate_kl_in_reward_config` correctly raises if `scale_rewards` is not in {none, false}.
|
| 253 |
+
|
| 254 |
+
**FAITHFUL IMPLEMENTATION? YES** for the on-policy Dr.GRPO regime. The identity algebra is mathematically sound and unit-tested.
|
| 255 |
+
|
| 256 |
+
### 3.3 Deferred penalty case (aligned non-vLLM)
|
| 257 |
+
|
| 258 |
+
The `_generate_and_score_completions` override handles two cases:
|
| 259 |
+
1. **vLLM path**: `old_per_token_logps` is available at scoring time → apply penalty immediately, fold into advantages
|
| 260 |
+
2. **Aligned non-vLLM path**: `old_per_token_logps` may be None at scoring time → stash `_kl_in_reward_applied=False`, defer to `_grpo_loss_kl_in_reward` which gets `per_token_logps` from the model forward pass
|
| 261 |
+
|
| 262 |
+
**Issue in case 2:** The deferred path re-runs the model forward pass inside `torch.no_grad()` (`_grpo_loss_kl_in_reward:308-322`) to get `policy_logps`. This is the **sampling-time policy** (old policy), but by the time `_compute_loss` is called, gradient updates may have already been applied (if using multiple inner iterations). With `num_iterations=1` (the Dr.GRPO default), there's exactly one update epoch per batch, so this is fine. With `num_iterations>1`, the deferred path would compute KL against the *updated* policy instead of the *sampling-time* old policy — a subtle correctness bug. The `validate_kl_in_reward_config` does NOT guard against `num_iterations>1`. **This is a missing precondition.**
|
| 263 |
+
|
| 264 |
+
### 3.4 Per-token vs per-sequence: where the KL penalty lands
|
| 265 |
+
|
| 266 |
+
The repo computes the KL penalty as a **per-sequence sum** (not per-token mean):
|
| 267 |
+
```python
|
| 268 |
+
per_token = (policy_logps - ref_logps) * completion_mask
|
| 269 |
+
return per_token.sum(dim=-1) # ← SUM, not mean
|
| 270 |
+
```
|
| 271 |
+
|
| 272 |
+
Then `apply_kl_in_reward` subtracts `coef * (KL_i - group_mean(KL))` from the **scalar advantage** (one number per sequence).
|
| 273 |
+
|
| 274 |
+
This is the correct "in-reward" treatment: the penalty is on the total sequence reward, not added token-by-token to the loss as verl's per-token path would. The per-token sum equals the total sequence KL contribution.
|
| 275 |
+
|
| 276 |
+
**Compared to verl:** The original GRPO/PPO approach adds `β * KL_per_token` inside the per-token surrogate loss. The repo instead adjusts the advantage (scalar) by the summed KL. The gradient signal is different:
|
| 277 |
+
- In-loss (per-token): every token gets a gradient from the KL term
|
| 278 |
+
- In-reward (advantage adjustment): the KL modifies the *advantage*, which then scales all token gradients uniformly
|
| 279 |
+
|
| 280 |
+
The "in-reward" vs "in-loss" distinction affects whether the KL gradient flows through the IS ratio or not. In the in-reward approach (the repo's), the KL is folded into the advantage before the surrogate is computed — so the surrogate objective is `min(r_t * (A + KL_correction), clip(...) * (A + KL_correction))`. In the in-loss approach (TRL's native k3), the KL is added directly as `β * k3_per_token` outside the surrogate clip.
|
| 281 |
+
|
| 282 |
+
---
|
| 283 |
+
|
| 284 |
+
## 4. Findings on the repo's research notes accuracy
|
| 285 |
+
|
| 286 |
+
### 4.1 research/10: Dr.GRPO claims — ACCURATE
|
| 287 |
+
|
| 288 |
+
The mining note (`research/10-composer2-techreport-mining.md:43-55`) states:
|
| 289 |
+
- "Built on Dr. GRPO [34=Liu et al., 2503.20783]" ✓ (verified against Composer 2 report §4.1)
|
| 290 |
+
- "Remove the length-standardization term" ✓
|
| 291 |
+
- "Do NOT normalize group advantages by std" ✓
|
| 292 |
+
- "k1 estimator (-log r)" ✓ (verbatim from report §4.1 L488-497)
|
| 293 |
+
- "DAPO overlong-rollout masking explicitly tried and rejected" ✓ (verbatim from report L362)
|
| 294 |
+
|
| 295 |
+
### 4.2 research/10: DAPO overlong masking rejection — ACCURATELY ATTRIBUTED
|
| 296 |
+
|
| 297 |
+
The note correctly attributes the DAPO overlong masking rejection to small-scale experiments. The Composer 2 report states "at small scale" — this qualifier is preserved in research/10. ACCURATE.
|
| 298 |
+
|
| 299 |
+
### 4.3 The k1-in-reward claim for verl — PARTIALLY UNVERIFIED
|
| 300 |
+
|
| 301 |
+
`kl_in_reward.py:12` claims: "verl adopted k1-in-reward as its *only* reverse-KL option." This appears in the docstring as motivation but has not been verified against verl source code in this review. The claim is consistent with verl's documented `kl_penalty="kl"` option (= `logp - ref_logp` = k1) but verl also supports `kl_penalty="low_var_kl"` (= k3). **Claiming it's the "only" option is likely an overclaim** — verl supports both k1 and k3 (see `research/notes/eks-primary-sagemaker-hybrid-architecture-and-the-minimal-repo-delta.md:58` which mentions `kl_coef=0.001` without specifying which estimator). **This claim should be softened to "verl defaults to / recommends k1-in-reward."**
|
| 302 |
+
|
| 303 |
+
### 4.4 Comedy of Estimators claim — SUPPORTED BUT IMPRECISE
|
| 304 |
+
|
| 305 |
+
The repo claims the Comedy of Estimators paper shows "k1-in-reward improves OOD generalization; k3-in-reward can collapse." The abstract says "biased gradient configurations → instabilities; unbiased → better OOD." The repo's characterization maps the abstract's claim to specific estimators (k1 vs k3) and specific placements (in-reward vs in-loss). This mapping is directionally correct based on the mathematical argument (k1-in-reward has unbiased gradient; k3-in-loss has biased gradient for the in-reward objective) but the paper's own empirical scope (7B/8B models, reasoning tasks) is not mentioned in the repo's framing.
|
| 306 |
+
|
| 307 |
+
**Specific gap:** The Comedy of Estimators paper (v3 2026-03-18) was updated after the repo's k1-in-reward implementation (Wave 20, 2026-06-09). The v3 abstract now includes Qwen3-4B-Instruct-2507, suggesting the empirical results were expanded. Whether the v3 conclusions changed from v1 is unknown (no full text).
|
| 308 |
+
|
| 309 |
+
### 4.5 TRL k3 in production — CORRECTLY DIAGNOSED AND GUARDED
|
| 310 |
+
|
| 311 |
+
The repo's honest finding (`composer_trainer.py:708-716`, `make_dr_grpo_config` docstring):
|
| 312 |
+
> "GRPOTrainer._compute_loss uses the k3 estimator `exp(ref_logp - logp) - (ref_logp - logp) - 1` (trl/trainer/grpo_trainer.py ~L2513), NOT the k1 estimator. k3 is Schulman's low-variance, always-non-negative KL approximation; k1 is its unbiased but higher-variance counterpart."
|
| 313 |
+
|
| 314 |
+
This is correct. The original GRPO paper (DeepSeekMath) explicitly uses the k3 estimator in the loss:
|
| 315 |
+
```
|
| 316 |
+
D_KL[π_θ||π_ref] = π_ref(o_{i,t}|q,o_{i,<t}) / π_θ(o_{i,t}|q,o_{i,<t})
|
| 317 |
+
- log(π_ref(o_{i,t}|q,o_{i,<t}) / π_θ(o_{i,t}|q,o_{i,<t})) - 1
|
| 318 |
+
```
|
| 319 |
+
(DeepSeekMath full text L1826, verified in this review)
|
| 320 |
+
|
| 321 |
+
The comment "k3 = k1 + O((Δlogp)²)" is also correct mathematically (Taylor expansion), and the repo correctly notes the delta is small for r≈1.
|
| 322 |
+
|
| 323 |
+
---
|
| 324 |
+
|
| 325 |
+
## 5. Issues, Gaps, and Open Items
|
| 326 |
+
|
| 327 |
+
### ISSUE 1: GSPO preset missing GSPO-scale clipping ranges — **UNADDRESSED, LOW-MEDIUM RISK**
|
| 328 |
+
|
| 329 |
+
The GSPO paper (arXiv:2507.18071, Section 5.1) explicitly states:
|
| 330 |
+
> "In GSPO, we set the left and right clipping ranges in Equation (5) to **3e-4 and 4e-4**, respectively. We compare against GRPO... and set the left and right clipping ranges in Equation (2) to **0.2 and 0.27**... Note that we observe a difference of **two orders of magnitude** in the fractions of clipped tokens between GSPO and GRPO."
|
| 331 |
+
|
| 332 |
+
The repo's GSPO preset does not set `epsilon` or `epsilon_high`, so it inherits TRL defaults (likely 0.2). This means the GSPO preset would actually behave more like a GRPO run because the clipping range is 2 orders of magnitude too wide — almost nothing gets clipped, making the sequence-level ratio identical to no-clip REINFORCE. **The GSPO preset is architecturally correct but operationally wrong without custom epsilon values.**
|
| 333 |
+
|
| 334 |
+
**Fix needed:**
|
| 335 |
+
```python
|
| 336 |
+
"gspo": {
|
| 337 |
+
"loss_type": "grpo",
|
| 338 |
+
"scale_rewards": "group",
|
| 339 |
+
"importance_sampling_level": "sequence",
|
| 340 |
+
"epsilon": 3e-4, # ← ADD: GSPO-scale clipping
|
| 341 |
+
"epsilon_high": 4e-4, # ← ADD: GSPO-scale clipping
|
| 342 |
+
"num_iterations": 1,
|
| 343 |
+
}
|
| 344 |
+
```
|
| 345 |
+
|
| 346 |
+
### ISSUE 2: CISPO preset missing explicit beta=0 — **MINOR, EASY FIX**
|
| 347 |
+
|
| 348 |
+
The MiniMax-M1 paper explicitly states no KL penalty in CISPO. The `cispo` preset doesn't set `beta=0.0` explicitly (unlike the `dapo` preset which does). If TRL defaults beta to something non-zero, the CISPO behavior would diverge from the paper. Should add `"beta": 0.0` to the `cispo` preset for safety.
|
| 349 |
+
|
| 350 |
+
### ISSUE 3: kl_in_reward deferred path and num_iterations>1 — **CORRECTNESS BUG, LOW RISK IN PRACTICE**
|
| 351 |
+
|
| 352 |
+
As noted in §3.3, the deferred penalty path (`_kl_in_reward_applied=False`) computes KL against the model's current weights at `_compute_loss` time. With `num_iterations>1`, this is no longer the sampling-time policy. `validate_kl_in_reward_config` does not guard against `num_iterations>1`. Since all current presets use `num_iterations=1`, this is currently moot but should be added as a precondition check.
|
| 353 |
+
|
| 354 |
+
**Fix needed in `validate_kl_in_reward_config`:**
|
| 355 |
+
```python
|
| 356 |
+
# Add after existing checks:
|
| 357 |
+
num_iterations = kwargs.get("num_iterations", 1)
|
| 358 |
+
if int(num_iterations) > 1:
|
| 359 |
+
raise ValueError(
|
| 360 |
+
"kl_in_reward=True requires num_iterations=1: with >1 inner update epochs, "
|
| 361 |
+
"the deferred-penalty path computes KL against the updated policy, not the "
|
| 362 |
+
"sampling-time old policy. This is incorrect."
|
| 363 |
+
)
|
| 364 |
+
```
|
| 365 |
+
|
| 366 |
+
### ISSUE 4: CISPO scale_rewards="none" vs paper's std-norm — **DOCUMENTED DESIGN CHOICE, NOT A BUG**
|
| 367 |
+
|
| 368 |
+
The CISPO paper uses group std-norm in the advantage. The repo sets `scale_rewards="none"` for CISPO. This is a deliberate choice to keep consistency with the Dr.GRPO regime. It is not a mistake but should be explicitly documented in the preset comment — currently the comment says "eps_max≈5 (ScaleRL)" but doesn't explain the std-norm deviation from the paper.
|
| 369 |
+
|
| 370 |
+
### ISSUE 5: "verl only reverse-KL option" overclaim in docstring — **DOCUMENTATION INACCURACY**
|
| 371 |
+
|
| 372 |
+
The `kl_in_reward.py:12` docstring claims "verl adopted k1-in-reward as its *only* reverse-KL option." This appears to be an overclaim — verl supports both `kl_penalty="kl"` (k1) and `kl_penalty="low_var_kl"` (k3-like). The phrase "only" should be removed or qualified.
|
| 373 |
+
|
| 374 |
+
### ISSUE 6: Comedy of Estimators task/scale scope not flagged — **MINOR DOCUMENTATION GAP**
|
| 375 |
+
|
| 376 |
+
The repo uses the Comedy of Estimators paper as motivation for k1-in-reward on a 1T MoE agentic system, but the paper's empirical scope is 7B/8B models on math reasoning. This extrapolation is not flagged anywhere in the codebase. Should add a caveat in the docstring noting the scale/task gap.
|
| 377 |
+
|
| 378 |
+
---
|
| 379 |
+
|
| 380 |
+
## 6. Summary Table
|
| 381 |
+
|
| 382 |
+
| Paper | Loss math delta vs GRPO | KL treatment | repo accuracy |
|
| 383 |
+
|---|---|---|---|
|
| 384 |
+
| GRPO (2402.03300) | Baseline: 1/|o_i| length-norm + std-norm advantage | k3 in loss, beta=0.04 | Reference |
|
| 385 |
+
| Dr.GRPO (2503.20783) | Remove 1/|o_i| AND /std(R) | k3 in loss (TRL) OR k1-in-reward (opt-in) | ACCURATE |
|
| 386 |
+
| DAPO (2503.14476) | Decoupled clip-higher (ε_high>ε_low) + overlong filter + dynamic sampling | beta=0 (KL removed) | ACCURATE (ε_high=0.28 correct) |
|
| 387 |
+
| GSPO (2507.18071) | Sequence-level IS ratio (geomean); sequence-level clip | KL not primary focus | ARCHITECTURALLY CORRECT but OPERATIONALLY WRONG (missing GSPO-scale epsilon 3e-4/4e-4) |
|
| 388 |
+
| CISPO (2506.13585) | Detached clipped IS weight as scalar coefficient on log π; every token keeps gradient | beta=0 (no KL) | MOSTLY CORRECT; missing explicit beta=0; scale_rewards deviation from paper (deliberate) |
|
| 389 |
+
| Comedy of Estimators (2512.21852) | KL estimator comparison: k1 (unbiased) vs k3 (biased-gradient in some placements) → unbiased → better OOD | k1-in-reward recommended | SUPPORTED but full-text unavailable; scale/task scope not flagged |
|
| 390 |
+
|
| 391 |
+
---
|
| 392 |
+
|
| 393 |
+
## 7. Load-Bearing Verbatim Quotes
|
| 394 |
+
|
| 395 |
+
**Dr.GRPO paper (arXiv:2503.20783, Sec. 3.2):**
|
| 396 |
+
> "To avoid the aforementioned optimization bias in GRPO, we propose to simply remove the `1/|o_i|` and `std({R(q,o_1),...,R(q,o_G)})` normalization terms."
|
| 397 |
+
|
| 398 |
+
**Composer 2 Technical Report (§4.1, verified via research/notes/composer-2-technical-report.md:442-497):**
|
| 399 |
+
> "Many open-source implementations of RL estimate KL with the estimator k3 = (r−1) − log r... the variance increases drastically as p and q diverge... Therefore, we use the standard estimator **k1 = −log r** instead."
|
| 400 |
+
> "We did not see benefits with overlong masking at small scale and opted not to mask rollouts that exceed the maximum sequence length."
|
| 401 |
+
|
| 402 |
+
**GSPO paper (arXiv:2507.18071, Sec. 5.1):**
|
| 403 |
+
> "Note that we observe a difference of **two orders of magnitude** in the fractions of clipped tokens between GSPO and GRPO (while adjusting the clipping ranges does not alter the disparity in magnitude)."
|
| 404 |
+
|
| 405 |
+
**CISPO paper (arXiv:2506.13585, Sec. 3.1):**
|
| 406 |
+
> "Rather than clipping the token updates as in PPO/GRPO, we instead clip the importance sampling weight in Eq. 3 to stabilize training... **CISPO... always leverages all tokens for gradient computations.**"
|
| 407 |
+
> "There is no KL penalty term in CISPO similar to other recent works."
|
| 408 |
+
|
| 409 |
+
---
|
| 410 |
+
|
| 411 |
+
## 8. Priority Action Items
|
| 412 |
+
|
| 413 |
+
1. **(HIGH)** Fix GSPO preset: add `"epsilon": 3e-4, "epsilon_high": 4e-4`. Without GSPO-scale clipping, the sequence IS ratio will rarely be clipped and the behavior degrades to no-clip sequence-level REINFORCE — not what Qwen's GSPO experiments demonstrate.
|
| 414 |
+
|
| 415 |
+
2. **(MEDIUM)** Add `"beta": 0.0` to CISPO preset for explicitness.
|
| 416 |
+
|
| 417 |
+
3. **(LOW)** Add `num_iterations=1` guard to `validate_kl_in_reward_config` for the deferred-penalty correctness.
|
| 418 |
+
|
| 419 |
+
4. **(LOW)** Soften the "verl *only* reverse-KL option" claim in `kl_in_reward.py:12` to "verl defaults to / recommends k1-in-reward."
|
| 420 |
+
|
| 421 |
+
5. **(LOW)** Add a docstring note to the k1-in-reward motivation flagging the Comedy of Estimators paper was tested at 7B/8B math scale, not at 1T MoE agentic scale — the extrapolation is reasonable but should be disclosed.
|
|
@@ -0,0 +1,309 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-read critical review: DiLoCo / Streaming DiLoCo family
|
| 2 |
+
**Date:** 2026-06-09
|
| 3 |
+
**Sources read verbatim:**
|
| 4 |
+
- arXiv:2311.08105 (DiLoCo, Douillard et al. 2023/2024, HTML full text, 9503 words)
|
| 5 |
+
- arXiv:2501.18512 (Streaming DiLoCo, Douillard et al. 2025, HTML full text, 19297 words)
|
| 6 |
+
- arXiv:2502.12996 (Eager Updates, Kale, Douillard, Donchev 2025, abstract)
|
| 7 |
+
- torchft/local_sgd.py (live main branch, fetched 2026-06-10)
|
| 8 |
+
|
| 9 |
+
**Repo artefacts cross-checked:**
|
| 10 |
+
- `composer_replication/diloco/__init__.py`
|
| 11 |
+
- `composer_replication/diloco/serverless/allreduce.py`
|
| 12 |
+
- `docs/adrs/ADR-003-diloco-impl.md`
|
| 13 |
+
- `docs/adrs/ADR-005-serverless-diloco.md`
|
| 14 |
+
- `research/02-diloco-family.md`
|
| 15 |
+
- `research/design-F4-decoupled-diloco-s3.md`
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 1. Primary-source extraction: DiLoCo (arXiv:2311.08105)
|
| 20 |
+
|
| 21 |
+
### 1.1 Algorithm
|
| 22 |
+
|
| 23 |
+
Algorithm 1 (paper §2) is:
|
| 24 |
+
|
| 25 |
+
1. Outer step t = 1…T
|
| 26 |
+
2. Each worker i: θ_i^(t) ← θ^(t-1) (re-initialized from global params)
|
| 27 |
+
3. H inner steps: θ_i^(t) ← InnerOpt(θ_i^(t), ∇L) using AdamW
|
| 28 |
+
4. Outer gradient: **Δ^(t) = (1/k) Σ_i ( θ^(t-1) − θ_i^(t) )**
|
| 29 |
+
5. Outer update: θ^(t) ← OuterOpt(θ^(t-1), Δ^(t))
|
| 30 |
+
|
| 31 |
+
**Sign convention (paper, verbatim):** "the delta in parameters space…is computed per worker…Δ^(t) ← (1/k) Σ_i (θ^(t-1) − θ_i^(t))". This is θ_initial minus θ_local — the **negative of the local update direction**.
|
| 32 |
+
|
| 33 |
+
### 1.2 Outer optimizer hyperparameters (paper §3.2 ablation)
|
| 34 |
+
|
| 35 |
+
Paper reports tuning outer optimizer across SGD, SGDM, Nesterov, Adam. Results in Table 5:
|
| 36 |
+
|
| 37 |
+
> "the setting with outer learning rate equal to **0.7** and outer momentum equal to **0.9** is very robust, and it is adopted for all our experiments throughout."
|
| 38 |
+
|
| 39 |
+
Chosen values (bold in Table 5): **outer LR = 0.7, outer momentum = 0.9, Nesterov.**
|
| 40 |
+
|
| 41 |
+
### 1.3 Sync frequency H
|
| 42 |
+
|
| 43 |
+
Figure 4 sweeps H ∈ {50, 100, 250, **500**, 1000, 2000}. Main experiment default: **H = 500**.
|
| 44 |
+
|
| 45 |
+
> "communicating more frequently than H = 500 steps leads to diminishing returns. Moreover, the performance degradation is very mild up to H = 1000 steps."
|
| 46 |
+
|
| 47 |
+
H = 2000 causes meaningful degradation. H = 500 chosen as best trade-off.
|
| 48 |
+
|
| 49 |
+
### 1.4 Heterogeneous / unreliable workers
|
| 50 |
+
|
| 51 |
+
Section 3.1 "Adaptive compute pool": workers can join/leave; final perplexity tracks total compute budget regardless of schedule. "Models quality is affected by the total amount of compute, but not as much by how such computed is allocated over time."
|
| 52 |
+
|
| 53 |
+
Section 3.1 "Asynchronous Communication": outer gradients dropped with probability up to 50%. At 50% drop rate, perplexity degrades only 2.1% relative to perfect communication.
|
| 54 |
+
|
| 55 |
+
**Limitation (§5):** "The version of DiLoCo presented here assumes that all workers are **homogeneous**. However, in practice workers might operate at wildly different speed." Heterogeneous/async DiLoCo is explicitly listed as future work.
|
| 56 |
+
|
| 57 |
+
### 1.5 What the original paper does NOT cover
|
| 58 |
+
|
| 59 |
+
- No scaling laws for optimal H vs model size.
|
| 60 |
+
- No quantization (mentioned pruning in appendix: 50% sign-pruning costs only +0.39% PPL).
|
| 61 |
+
- Models only up to 400M params.
|
| 62 |
+
- All experiments start from a 24k-step pretrained checkpoint (cold-start is studied but as secondary).
|
| 63 |
+
- No measurement of real wall-clock with actual network constraints.
|
| 64 |
+
|
| 65 |
+
---
|
| 66 |
+
|
| 67 |
+
## 2. Primary-source extraction: Streaming DiLoCo (arXiv:2501.18512)
|
| 68 |
+
|
| 69 |
+
### 2.1 Three contributions
|
| 70 |
+
|
| 71 |
+
1. **Fragment streaming (§2.2):** Partition model into P fragments. Each fragment syncs every H steps, but fragments are staggered (offsets t_p). Peak bandwidth reduced by factor |p|/L (fragment size/total layers).
|
| 72 |
+
|
| 73 |
+
2. **Overlapping communication with computation (§2.3):** τ parameter (inner overlap delay). Fragment's allreduce initiated at step t, results applied τ steps later. Workers keep training during those τ steps. For heterogeneous workers: use per-worker τ values. Robust to τ up to ~5 inner steps.
|
| 74 |
+
|
| 75 |
+
3. **Low-precision outer gradients (§2.4):** Outer gradients (what's communicated, not optimizer state) quantized to **FP4 = E3M0** (1 sign bit, 3 exponent bits, 0 mantissa bits). Accumulation in FP32. No regression found at 4-bit. Applied at send time, before allreduce.
|
| 76 |
+
|
| 77 |
+
**Combined result:** "reducing required bandwidth by two orders of magnitude" (abstract). Table 1 shows Data-Parallel 441 TB vs Streaming DiLoCo 1.10 TB (≈400× total bits reduction for 1B model).
|
| 78 |
+
|
| 79 |
+
### 2.2 Outer hyperparameters in Streaming DiLoCo
|
| 80 |
+
|
| 81 |
+
> "The main hyperparameter of DiLoCo is its outer learning rate; we tuned it to be optimal at small scale at **0.4**, and kept it fixed across all scales."
|
| 82 |
+
|
| 83 |
+
Streaming paper uses **outer LR = 0.4**, not 0.7. Momentum: not explicitly restated; inherits from original paper's Nesterov momentum = 0.9.
|
| 84 |
+
|
| 85 |
+
H values in experiments: **H = 30 and H = 100** (not H = 500 from the original paper).
|
| 86 |
+
|
| 87 |
+
### 2.3 Fragment scheduling
|
| 88 |
+
|
| 89 |
+
Algorithm 2: condition `t - t_p mod H == 0` to decide which fragment syncs at step t. Fragmented offsets allow continuous streaming. "As we increase model scale, the fragment definition…is maintained, which means that larger models have more fragments."
|
| 90 |
+
|
| 91 |
+
### 2.4 Heterogeneous workers
|
| 92 |
+
|
| 93 |
+
§3.3.2 (Overlapping with slack between workers): per-worker τ_m handles execution speed differences. "the loss degradation is limited under a delay of up to **5 inner steps**." Above τ ≈ 5, degradation increases.
|
| 94 |
+
|
| 95 |
+
### 2.5 Memory overhead
|
| 96 |
+
|
| 97 |
+
66% more memory than Data-Parallel (outer parameters copy + Nesterov state). For Streaming: only active fragment's outer state needs to be in HBM; rest can be on CPU. For 100B model with 3-layer fragment out of 108 layers: ~2% additional memory.
|
| 98 |
+
|
| 99 |
+
### 2.6 "Liu et al. 2024a" citation in Streaming paper
|
| 100 |
+
|
| 101 |
+
The Streaming paper references "Liu et al. (2024a)" = Bo Liu, Rachita Chhaparia, Arthur Douillard, Satyen Kale, Andrei A. Rusu, Jiajun Shen, Arthur Szlam, Marc'Aurelio Ranzato. "Asynchronous local-sgd training for language modeling." arXiv:2401.09135. This is cited in the context of per-worker slack for heterogeneous workers.
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## 3. Primary-source extraction: torchft/local_sgd.py
|
| 106 |
+
|
| 107 |
+
### 3.1 Sign convention (verbatim from `_save_grads`)
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
def _save_grads(self) -> None:
|
| 111 |
+
with torch.no_grad():
|
| 112 |
+
for name, p in self._model_fragment.named_parameters():
|
| 113 |
+
...
|
| 114 |
+
pseudogradient = (
|
| 115 |
+
self.original_parameters[name].to(p.device) - local_param
|
| 116 |
+
)
|
| 117 |
+
self._grads[name] = pseudogradient
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
`original_parameters[name]` = θ_initial, `local_param` = θ_local.
|
| 121 |
+
**torchft computes: pseudogradient = θ_initial − θ_local** (same sign as paper's Δ).
|
| 122 |
+
|
| 123 |
+
### 3.2 Outer optimizer application chain (verbatim from `perform_sync`)
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
1. _save_local_parameters() # saves θ_local into _local_parameters
|
| 127 |
+
2. restore_parameters() # p.data ← θ_initial (from original_parameters)
|
| 128 |
+
3. _set_grads() # p.grad ← averaged_pseudogradient
|
| 129 |
+
4. _outer_optimizer.step() # SGD Nesterov: p.data ← θ_initial - lr*Nesterov(avg_pseudograd)
|
| 130 |
+
5. save_parameters() # original_parameters ← p.data (θ_outer_updated)
|
| 131 |
+
6. _merge_parameters() # p.data ← alpha*p.data + (1-alpha)*_local_parameters
|
| 132 |
+
# For alpha=0.0 (vanilla): p.data ← θ_local
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
Post-sync state: `p.data = θ_local`, `original_parameters = θ_outer_updated`.
|
| 136 |
+
Next outer round: `restore_parameters()` sets `p.data = θ_outer_updated`; H inner steps produce `θ_local_next`.
|
| 137 |
+
Pseudograd_next = θ_outer_updated − θ_local_next. **This is faithful to Algorithm 1.**
|
| 138 |
+
|
| 139 |
+
### 3.3 Fragment rotation logic
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
def _current_fragment(self) -> int:
|
| 143 |
+
step = self._manager.current_step()
|
| 144 |
+
return step % len(self._fragments)
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
For vanilla (1 fragment): always fragment 0. For Streaming with P fragments: round-robin by `step % P`.
|
| 148 |
+
|
| 149 |
+
`start_quorum()` is called at `step = sync_every - fragment_sync_delay` (prepare_sync time).
|
| 150 |
+
`current_step()` is read at `step = sync_every` (perform_sync time).
|
| 151 |
+
|
| 152 |
+
### 3.4 `_use_async_quorum` constraint
|
| 153 |
+
|
| 154 |
+
`DiLoCo.__init__` raises `ValueError` if `manager._use_async_quorum` is truthy. Must be False for synchronous quorum (as in the paper's design).
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## 4. Repo correctness findings
|
| 159 |
+
|
| 160 |
+
### F1. Sign convention — CORRECT
|
| 161 |
+
|
| 162 |
+
`__init__.py` lines 17–38 documents the convention: "pseudograd = θ_initial - θ_local (per torchft's `_save_grads()`)". The code's outer SGD uses standard `p.data ← p.data - lr * grad` where `p.data = θ_initial` (restored before outer step) and `grad = avg_pseudograd = avg(θ_initial - θ_local)`. The arithmetic is correct and matches Algorithm 1. The sign-convention test in `spikes/008-streaming-diloco/tests/test_diloco_smoke.py` is appropriate insurance.
|
| 163 |
+
|
| 164 |
+
**No bug. No mischaracterization.**
|
| 165 |
+
|
| 166 |
+
### F2. Outer optimizer hyperparams — MINOR DIVERGENCE
|
| 167 |
+
|
| 168 |
+
Repo default: `outer_lr=0.7, outer_momentum=0.9, nesterov=True` (lines 69–72 of `__init__.py`). These match the **original DiLoCo paper** (§3.2, Table 5 bold values).
|
| 169 |
+
|
| 170 |
+
The **Streaming DiLoCo paper** uses `outer_lr=0.4` tuned at small scale. The repo uses original-paper values, which is appropriate for v0.1 (vanilla DiLoCo). But if someone enables Streaming by setting `fragment_sync_delay>0` and `model_fragments=[f0,...,fN]`, they will use lr=0.7 where the Streaming paper used lr=0.4. This is not a correctness bug but may underperform Streaming DiLoCo in practice.
|
| 171 |
+
|
| 172 |
+
**Recommendation:** Add a note in `make_diloco_outer_loop` docstring: "Streaming DiLoCo (Douillard et al. 2025) tunes outer_lr=0.4; the default 0.7 is optimal for vanilla DiLoCo."
|
| 173 |
+
|
| 174 |
+
### F3. Default sync_every=100 vs paper's H=500
|
| 175 |
+
|
| 176 |
+
Repo default: `sync_every=100` (line 72 of `__init__.py`).
|
| 177 |
+
|
| 178 |
+
Original DiLoCo paper main experiment: **H=500**. OpenDiLoCo: H=125. Streaming paper: H=30 or H=100.
|
| 179 |
+
|
| 180 |
+
The default H=100 is consistent with the Streaming paper and OpenDiLoCo but does NOT match the original paper's "main experiment default." The ADR-003 docstring says "Default hyperparams (DiLoCo paper §3.2): outer_lr = 0.7, outer_momentum = 0.9, Nesterov" — this is correct for lr/momentum but the paper's main default H is 500, not 100.
|
| 181 |
+
|
| 182 |
+
**Not a correctness bug** (H=100 is within the paper's tested range and performs well). But the docstring claiming it cites "DiLoCo paper §3.2" for the H default is misleading — §3.2 chooses H=500.
|
| 183 |
+
|
| 184 |
+
**ADR-005** says "H = 500-1000 inner steps" — this is consistent with the original paper's claimed range, though the practical default in code is 100.
|
| 185 |
+
|
| 186 |
+
### F4. Author misattribution of Streaming DiLoCo — INCORRECT
|
| 187 |
+
|
| 188 |
+
`__init__.py` line 8: `"Streaming DiLoCo (Liu et al. 2025)"`.
|
| 189 |
+
|
| 190 |
+
`design-F4-decoupled-diloco-s3.md` line 109: `"Streaming DiLoCo (Liu et al. 2025, 'Eager Updates for Overlapped Communication in DiLoCo', arXiv:2501.18512)"`.
|
| 191 |
+
|
| 192 |
+
**All three facts here are wrong:**
|
| 193 |
+
- arXiv:2501.18512 is authored by **Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, et al.** — not "Liu et al."
|
| 194 |
+
- The title of arXiv:2501.18512 is "**Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch**" — not "Eager Updates for Overlapped Communication in DiLoCo."
|
| 195 |
+
- "Eager Updates for Overlapped Communication and Computation in DiLoCo" is a **separate paper**, arXiv:2502.12996, by Satyen Kale, Arthur Douillard, Yanislav Donchev — also not "Liu et al."
|
| 196 |
+
- "Liu et al. (2024a)" is the correct citation for the **Async Local-SGD** paper (arXiv:2401.09135), not Streaming DiLoCo.
|
| 197 |
+
|
| 198 |
+
The confusion appears to be: design-F4 imported "Liu et al. 2025" from a different context (possibly confusing with Dr.GRPO or another Liu et al. paper), attached it to the wrong arXiv ID, and applied the wrong title. The `__init__.py` propagated the wrong author name from design-F4.
|
| 199 |
+
|
| 200 |
+
**Correct attributions:**
|
| 201 |
+
- Vanilla DiLoCo: Douillard et al. 2023/2024, arXiv:2311.08105.
|
| 202 |
+
- Streaming DiLoCo: **Douillard et al. 2025**, arXiv:2501.18512.
|
| 203 |
+
- Eager Updates: Kale, Douillard, Donchev 2025, arXiv:2502.12996 (companion/workshop paper with noted text overlap with 2501.18512).
|
| 204 |
+
|
| 205 |
+
**Files to fix:** `composer_replication/diloco/__init__.py` line 8; `research/design-F4-decoupled-diloco-s3.md` line 109.
|
| 206 |
+
|
| 207 |
+
### F5. "Streaming degrades to vanilla" — IMPRECISE BUT DEFENSIBLE
|
| 208 |
+
|
| 209 |
+
`design-F4-decoupled-diloco-s3.md` line 116 states:
|
| 210 |
+
|
| 211 |
+
> "prepare_sync blocks for the full S3 rendezvous and `fragment_sync_delay` buys zero overlap — **Streaming degrades to vanilla**, correctly but without the comm/compute overlap benefit."
|
| 212 |
+
|
| 213 |
+
This is directionally true but imprecise. With synchronous `ObjectStoreAllReduce`:
|
| 214 |
+
|
| 215 |
+
- **What is PRESERVED:** fragment streaming (per-fragment partial sync, multiple fragments synced in staggered rotation). A 4-fragment model still syncs only 1/4 of parameters per outer step boundary; peak bandwidth is still reduced.
|
| 216 |
+
- **What is LOST:** the τ (fragment_sync_delay) overlap benefit — inner training steps can no longer run while the allreduce is in flight because the allreduce blocks.
|
| 217 |
+
|
| 218 |
+
So the correct characterization is: "synchronous allreduce loses **overlap** (Contribution 2 of Streaming DiLoCo), but **fragment partial sync** (Contribution 1) still works. fragment_sync_delay=0 with multiple fragments still gives partial-sync bandwidth savings."
|
| 219 |
+
|
| 220 |
+
The statement "degrades to vanilla" implies full-model-sync-at-H, which is not what happens for multi-fragment configurations. However, for the current Spike 008 configuration (single fragment, fragment_sync_delay=0), this degrades to exactly vanilla DiLoCo, which makes the statement accidentally correct for that specific case. The broader claim for multi-fragment Streaming is imprecise.
|
| 221 |
+
|
| 222 |
+
**design-F4 correctly notes the fix (non-blocking PUT returning deferred Work) and correctly defers it to Phase 5.** The characterization is acceptable as long as readers understand it applies to overlap loss, not to fragment streaming.
|
| 223 |
+
|
| 224 |
+
### F6. Quantization — NOT IMPLEMENTED, CORRECTLY OMITTED
|
| 225 |
+
|
| 226 |
+
`MockManager.allreduce` signature: `def allreduce(self, tensor, **_kwargs)` — the `should_quantize` keyword is silently absorbed by `**_kwargs` and passed to `ObjectStoreAllReduce`, which has no quantization logic. FP32 tensors are serialized directly via `torch.save`.
|
| 227 |
+
|
| 228 |
+
The Streaming paper's FP4 (E3M0) outer gradient quantization is not implemented. This is appropriate for v0.1 vanilla scope. The paper shows: E3M0 quantization cuts communicated bits by 8× (FP32→FP4) with no regression. The framework loses this efficiency benefit but is algorithmically correct.
|
| 229 |
+
|
| 230 |
+
**research/02** line 244 incorrectly describes Streaming DiLoCo as using "FP16 outer state" for compression. The paper uses **FP4 outer gradients** (what is transmitted), NOT FP16 optimizer state. These are different things. FP16 is used by OpenDiLoCo (separately) to cut payload 2×; Streaming DiLoCo uses FP4 = 8× reduction.
|
| 231 |
+
|
| 232 |
+
### F7. H range and sync cadence claims — MINOR INACCURACY in research/02
|
| 233 |
+
|
| 234 |
+
`research/02` Table §2 says Streaming DiLoCo uses "Continuous partial" sync frequency. This is accurate. But the bandwidth reduction column "~100× peak BW + frequency" is underestimated. The paper's Table 1 shows ≈400× total bits reduction (not 100×) for a 1B model. The "two orders of magnitude" phrasing in the abstract means ≥100×; the actual measured result is ≥400×.
|
| 235 |
+
|
| 236 |
+
### F8. Heterogeneous worker handling — PARTIALLY MISSING in research/02
|
| 237 |
+
|
| 238 |
+
`research/02` says Streaming DiLoCo has "better tolerance since communication is continuous, but still synchronous." This misses the per-worker τ mechanism the Streaming paper introduces. The paper explicitly shows: with τ_1=1, varying τ_2 up to 5 inner steps shows robust degradation curve. This is a first-class mechanism for heterogeneous workers, not just an incidental property.
|
| 239 |
+
|
| 240 |
+
The correct characterization: Streaming DiLoCo with per-worker τ tolerates timing heterogeneity of up to ~5 inner steps without significant degradation.
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
## 5. ADR-003 correctness summary
|
| 245 |
+
|
| 246 |
+
| Claim | Verdict |
|
| 247 |
+
|---|---|
|
| 248 |
+
| torchft computes pseudogradient = θ_initial − θ_local | CORRECT (verified from source) |
|
| 249 |
+
| Outer optimizer sign: no negation needed in our wrapper | CORRECT (SGD subtracts, pseudograd is already "subtract away from θ_local") |
|
| 250 |
+
| fragment_sync_delay > 0 requires CUDA streams | CORRECT (torchft uses `torch.Stream` for overlap; without a stream, overlap is serial) |
|
| 251 |
+
| Spike 008 uses vanilla (single fragment, delay=0) | CORRECT |
|
| 252 |
+
| Streaming is "a configuration-flag away" | CORRECT (same API, just different params) |
|
| 253 |
+
| torchft is Meta-maintained BSD-3 | CORRECT |
|
| 254 |
+
|
| 255 |
+
| Gap | Verdict |
|
| 256 |
+
|---|---|
|
| 257 |
+
| Default H=100 attributed to "DiLoCo paper §3.2" | MISLEADING — paper's main default is H=500; H=100 comes from OpenDiLoCo / Streaming paper range |
|
| 258 |
+
| "Liu et al. 2025" for Streaming DiLoCo | WRONG — should be Douillard et al. 2025 |
|
| 259 |
+
|
| 260 |
+
---
|
| 261 |
+
|
| 262 |
+
## 6. ADR-005 correctness summary
|
| 263 |
+
|
| 264 |
+
| Claim | Verdict |
|
| 265 |
+
|---|---|
|
| 266 |
+
| DiLoCo outer sync is once per H=500-1000 inner steps | CORRECT for original paper; Streaming paper uses H=30-100 |
|
| 267 |
+
| Pseudo-gradient size ~2 GB for 1B model in bf16 | CORRECT: 1B params × 2 bytes/param = 2 GB |
|
| 268 |
+
| Object-store rendezvous is bandwidth-efficient | CORRECT — well-reasoned; consistent with paper's communication profile |
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## 7. DiLoCo scaling laws
|
| 273 |
+
|
| 274 |
+
No dedicated DiLoCo scaling-laws paper exists as of search date. The Streaming DiLoCo paper (§3.2.1) shows scaling experiments from 35M to 4B parameters but does not derive scaling laws for optimal H or outer optimizer settings. Table 4 of the original paper shows modest scaling (60M, 150M, 400M). DiLoCoX (arXiv:2506.21263) scales to 107B but is a different framework.
|
| 275 |
+
|
| 276 |
+
**The repo makes no explicit scaling-law claims for DiLoCo**, so no finding here.
|
| 277 |
+
|
| 278 |
+
---
|
| 279 |
+
|
| 280 |
+
## 8. Action items (priority-ordered)
|
| 281 |
+
|
| 282 |
+
1. **[HIGH] Fix author attribution in two files:**
|
| 283 |
+
- `composer_replication/diloco/__init__.py` line 8: change `"Streaming DiLoCo (Liu et al. 2025)"` to `"Streaming DiLoCo (Douillard et al. 2025)"`.
|
| 284 |
+
- `research/design-F4-decoupled-diloco-s3.md` line 109: correct from `"Liu et al. 2025, 'Eager Updates…', arXiv:2501.18512"` to `"Douillard et al. 2025, 'Streaming DiLoCo…', arXiv:2501.18512"`.
|
| 285 |
+
|
| 286 |
+
2. **[MEDIUM] Add outer_lr note for Streaming:**
|
| 287 |
+
- `make_diloco_outer_loop` docstring: note that Streaming DiLoCo (2501.18512) uses outer_lr=0.4 tuned at small scale, while the default 0.7 is optimal for vanilla.
|
| 288 |
+
|
| 289 |
+
3. **[MEDIUM] Fix research/02 compression claim:**
|
| 290 |
+
- Line 244: "FP16 outer state" → "FP4 (E3M0) outer gradients" (what is communicated; accumulation stays FP32).
|
| 291 |
+
- Bandwidth reduction: "~100× peak BW + frequency" → "≥400× total bits + peak BW reduction" to match Table 1 of the Streaming paper.
|
| 292 |
+
|
| 293 |
+
4. **[LOW] Clarify "degrades to vanilla" in design-F4:**
|
| 294 |
+
- The current text is accurate for the v0.1 single-fragment case. For multi-fragment configurations, tighten to: "fragment partial-sync bandwidth savings are preserved; only the τ overlap benefit is lost with synchronous allreduce."
|
| 295 |
+
|
| 296 |
+
5. **[LOW] Fix H=100 default source attribution:**
|
| 297 |
+
- `__init__.py` docstring says "Default hyperparams (DiLoCo paper §3.2)" — add that §3.2 uses H=500 for main experiments; H=100 matches Streaming/OpenDiLoCo range.
|
| 298 |
+
|
| 299 |
+
6. **[FUTURE] Phase-5 upgrade per design-F4:**
|
| 300 |
+
- Non-blocking `allreduce_async` in `ObjectStoreAllReduce` to realize genuine τ overlap on S3. ~60 LOC, deferred post-Phase-4.
|
| 301 |
+
|
| 302 |
+
---
|
| 303 |
+
|
| 304 |
+
## 9. What the papers say that the repo does NOT cover (gaps, not bugs)
|
| 305 |
+
|
| 306 |
+
- **FP4 quantization of outer gradients** (Streaming paper §2.4): not implemented. Worth noting as Phase-5 item alongside the async allreduce.
|
| 307 |
+
- **Per-worker τ for heterogeneous devices** (Streaming paper §3.3.2): the API supports different `fragment_sync_delay` values but there is no orchestration layer to set per-replica τ based on observed step latency. This is more of an RL orchestration concern than a DiLoCo wrapper concern.
|
| 308 |
+
- **Decoupled DiLoCo** (DeepMind blog 2025, Gemma 4): a Pathways-style async variant not in the literature yet; no implementation expected.
|
| 309 |
+
- **Async outer gradient application** (arXiv:2401.09135, Liu et al. 2024a = Async Local-SGD): delayed Nesterov (DN) optimizer + Dynamic Local Updates (DyLU) for heterogeneous workers. Not needed for v0.1 but relevant if serverless executors have highly variable step times.
|
|
@@ -0,0 +1,298 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read: World-Model / Deliberation Literature — Critical Review
|
| 2 |
+
**Cluster 6 / Novel-Extension Guard**
|
| 3 |
+
**Reviewer:** critical pipeline subagent
|
| 4 |
+
**Date:** 2026-06-09
|
| 5 |
+
**Sources fetched:** MuZero (1911.08265), DreamerV3 (2301.04104), CWM (2510.02387), Chain-of-World (2603.03195), foresight-governance (2601.03905), From-Word-to-World (2512.18832), Predictive-Causal Gap (2605.05029), Reasoning-Tool-Compete/DART (2602.00994), Myopic Planning (2605.06840), Negative Gradient/LLD (2505.18830), RAFT (2504.11343), Near-miss negatives (2503.14391).
|
| 6 |
+
**Final report reviewed:** `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` sections 2–4.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Executive Summary
|
| 11 |
+
|
| 12 |
+
The report's world-model / deliberation section (§2–4) is well-structured and intellectually honest about uncertainty, but contains five factual misreadings of primary sources, three overclaims dressed as conditional commitments, and two omissions that materially weaken the evidence base. The most serious finding is a CWM misread: the paper does NOT "train on all trajectories for the world-model head, reserving success-filtering only for the RL reward" in the sense the report implies — CWM uses a *mid-training stage* architecture, not an auxiliary head on a policy network, and the "train on all" decision applies to a separate, structurally distinct training phase, not an add-on loss riding the policy head. The report imports this result as license for an "aux head trains on all" design that the source does not demonstrate. The other misreadings are: (a) CWM's 65.8% score requires test-time scaling and is not the base score; (b) the Chain-of-World (2603.03195) paper is a robotics/embodied VLA paper, not an SWE paper; (c) the foresight-governance paper (2601.03905) is VLM/VQA, not SWE; (d) the Predictive-Causal Gap paper (2605.05029) is a single-author preprint with linear-Gaussian proofs and a small Duffing-GRU sweep — the report presents it as if the SWE mixed-timescale argument is a theorem about the proposed system, which it is not. The overclaims are: the report commits "parameter isolation eliminates the interference risk" when 2602.00994 shows interference on parameter-isolated LoRA modules too; the report treats Foresight@k as a standard metric when it is a proposed construct with no published baseline; and the "two hard prune gates resolve the central question" framing obscures that neither gate addresses the predictive-causal gap the report itself invokes. The omissions are: no paper in the cluster studies next-state-prediction as an auxiliary loss on a policy network for software engineering tasks — the exact configuration proposed — so the evidentiary basis for the aux-loss design rests entirely on analogical transfer from CWM (different architecture) and MuZero/Dreamer (different domain). The report acknowledges uncertainty but does not flag this as the null-evidence zone it is.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## Section 2: World-Model Goal — Source-by-Source Findings
|
| 17 |
+
|
| 18 |
+
### Finding 2.1 — CWM (arXiv:2510.02387): Misread of "trains on all" + score overclaim [CRITICAL]
|
| 19 |
+
|
| 20 |
+
**What the report says (§2, line 33):**
|
| 21 |
+
> "Meta's Code World Model mid-trains a 32B model on observation-action trajectories to predict next program state, reaching **65.8% on SWE-bench Verified** — crucially training *on all* trajectories for the world-model head, reserving success-filtering only for the RL reward [13]."
|
| 22 |
+
|
| 23 |
+
**What the source actually says:**
|
| 24 |
+
|
| 25 |
+
The CWM paper uses a three-phase training pipeline (pre-training → mid-training → post-training/RL). The "train on all" decision is a mid-training data decision for an entire separate training stage, verbatim:
|
| 26 |
+
|
| 27 |
+
> "Because our goal with the ForagerAgent data is to learn a comprehensive world model of agentic interactions with code environments, **we do not filter trajectories based on whether they succeed at bug or issue resolution**." (Section 2.2, ForagerAgent)
|
| 28 |
+
|
| 29 |
+
This is *not* an auxiliary loss added to a policy head. CWM is mid-trained as a *general purpose* next-state predictor in a dedicated training phase, separate from RL, using 3M ForagerAgent trajectories from 10.2k images. The world modeling capability is baked into the base model before policy optimization begins. The RL stage (Section 5.3.1) then applies success-filtered rewards *on top of* a model that already has world-modeling capability from mid-training.
|
| 30 |
+
|
| 31 |
+
**Why this matters for the proposed design:** The report uses this citation as support for "a world-model aux head can train on all branches during RL." CWM does NOT support this. CWM supports "dedicate a mid-training stage to train-on-all dynamics learning before RL." These are architecturally distinct: CWM's world-modeling capability is in the base weights, not in an auxiliary head receiving gradients simultaneously with RL. The interference risk (2602.00994) the report cites elsewhere applies precisely to the simultaneous-gradient version CWM does NOT use.
|
| 32 |
+
|
| 33 |
+
**Score overclaim:** The 65.8% SWE-bench Verified score requires test-time scaling (multiple candidates + ranking). CWM's base score (single attempt, no retry) is lower. The report does not make this distinction. The verbatim from the CWM abstract: "it reaches pass@1 scores of **65.8% on SWE-bench Verified (with test-time scaling)**." The base score "is computed with a single attempt per instance (no retries, majority voting, or parallel candidates), averaged over multiple runs." CWM's figure-2 caption explicitly notes this gap. Citing "65.8%" without the "(with test-time scaling)" qualifier is a misread of the headline number.
|
| 34 |
+
|
| 35 |
+
**Verdict:** Overclaim on two dimensions. The aux-head-on-policy design does not have CWM support. The score should be cited as "65.8% with test-time scaling" or the base score should be stated alongside.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
### Finding 2.2 — Chain-of-World (arXiv:2603.03195): Wrong domain attribution [HIGH]
|
| 40 |
+
|
| 41 |
+
**What the report says (§2, line 35):**
|
| 42 |
+
> "The latent-motion line carries the same discipline into 2026: factorize dynamics into a compact latent and predict the consequential terminal state, not the full frame [53]."
|
| 43 |
+
|
| 44 |
+
The source list cites [53] as: "Chain of World: World Model Thinking in Latent Motion — arXiv:2603.03195 (CVPR 2026; disentangled latent-motion world model predicts terminal state instead of reconstructing redundant background)."
|
| 45 |
+
|
| 46 |
+
**What the source actually says:**
|
| 47 |
+
|
| 48 |
+
CoWVLA (Chain-of-World VLA) is a Vision-Language-Action model for **robotics**. It is submitted to CVPR 2026 under cs.CV. The authors are from Li Auto, Harbin Institute of Technology, and BAAI. The experimental benchmarks are robotic simulation benchmarks (manipulator tasks: grasping cups, placing objects). The architecture uses a video VAE to extract latent motion from physical-world video frames and predicts the terminal visual frame of a robot arm action segment.
|
| 49 |
+
|
| 50 |
+
This paper has **no relevance to software engineering or LLM-based code agents**. Its "compact latent, predict terminal state, not full frame" insight is about pixel-space robot dynamics. The analogy to "don't reconstruct the full next repo state, predict the decision-relevant delta" is the report author's inference, not a claim the source makes. Using it as a citation for SWE world modeling is a domain transfer that the source does not support.
|
| 51 |
+
|
| 52 |
+
**Verdict:** Improper citation. The paper should either be removed or explicitly noted as "robotics analogy, not SWE evidence."
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
### Finding 2.3 — Foresight Governance Paper (arXiv:2601.03905): Domain-transfer not flagged adequately [MEDIUM]
|
| 57 |
+
|
| 58 |
+
**What the report says (§2, line 29):**
|
| 59 |
+
> "handed a world model as a tool, agents invoke it under 1% of the time, misuse it ~15%, *degrade* when forced, and consult it *less* as they grow more capable — the bottleneck is foresight *governance* [11]"
|
| 60 |
+
|
| 61 |
+
**What the source actually says:**
|
| 62 |
+
|
| 63 |
+
The abstract is verbatim: "Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), **frequently misuse predicted rollouts (approximately 15%)**, and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced."
|
| 64 |
+
|
| 65 |
+
The report's numbers are accurate. However, the paper's experimental domain is "agentic and VQA tasks" — Vision-Language Models (VLMs) on visual question answering. The degradation figures are from VLM/VQA settings, not SWE task settings. The paper itself notes the bottleneck is foresight governance in those domains.
|
| 66 |
+
|
| 67 |
+
The report acknowledges this in §7: "the world-model-as-tool foresight result [11] is VLM/VQA." However, in §2 where this result is deployed as a structural argument for "it does not emerge from scale," the domain caveat is absent. A reader of §2 alone would not know this is VLM evidence being applied to SWE.
|
| 68 |
+
|
| 69 |
+
**Verdict:** The numbers are accurately quoted, but the domain caveat is deferred to §7 and absent from §2 where the argument is made. Minor but addressable.
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
### Finding 2.4 — Predictive-Causal Gap (arXiv:2605.05029): Scope overstated [MEDIUM]
|
| 74 |
+
|
| 75 |
+
**What the report says (§2, line 37):**
|
| 76 |
+
> "across 2,695 networks **mean causal fidelity is 0.49** (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes **causally blind (~1e-8) *while achieving 92% lower prediction error*** [18]. A SWE repo is exactly mixed-timescale..."
|
| 77 |
+
|
| 78 |
+
**What the source actually says:**
|
| 79 |
+
|
| 80 |
+
The paper is a single-author preprint (Kejun Liu, single affiliation) studying linear-Gaussian dynamics with a theorem and a nonlinear Duffing-GRU sweep. The 2,695-network count and the fidelity numbers are accurate. The theorem proves the gap for linear-Gaussian systems.
|
| 81 |
+
|
| 82 |
+
The SWE-specific generalization ("A SWE repo is exactly mixed-timescale") is the report's inference, not the paper's claim. The paper does state implications for "world models" in general, but its empirical evidence is limited to linear-Gaussian dynamics and the Duffing oscillator nonlinear extension. There is no SWE experiment, no code model, no NLP experiment.
|
| 83 |
+
|
| 84 |
+
Further, the paper uses "operational grounding" as a partial mitigation ("operational grounding — restricting the loss to system observables — partially suppresses the gap"). The report correctly notes "The value-equivalent target reduces but, by the theorem, never eliminates the gap" — this is accurate per the abstract. But the "never eliminates" is the theorem's conclusion for linear-Gaussian systems; the practical magnitude of the gap for an LLM trained on code SWE trajectories is unknown.
|
| 85 |
+
|
| 86 |
+
**Verdict:** The report uses this paper's impossibility theorem to argue against an aux-head configuration that CWM does not use anyway (see Finding 2.1). The theorem is real and relevant, but applying it as if it proves the SWE aux-head will fail is an extrapolation the paper does not support. It is best-used as a risk flag, not a structural argument.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
### Finding 2.5 — MuZero (arXiv:1911.08265) and DreamerV3 (arXiv:2301.04104): The "value-equivalent" translation is plausible but not direct evidence [MEDIUM]
|
| 91 |
+
|
| 92 |
+
**What the report says (§2, line 33):**
|
| 93 |
+
> "MuZero and Dreamer add the design discipline: learn the *value-equivalent* latent — predict reward, value, the signed `FAIL_TO_PASS` delta, the predicted `tool_error` kind — never reconstruct the full state [14][15]."
|
| 94 |
+
|
| 95 |
+
**What the sources say:**
|
| 96 |
+
|
| 97 |
+
MuZero (arXiv:1911.08265) learns a latent model that predicts reward, policy, and value function for MCTS planning in board games and Atari. DreamerV3 (arXiv:2301.04104) learns a RSSM world model that predicts compact latent states for imagined rollouts. Both papers operate in fully observable, discrete/continuous control domains with clear reward signals.
|
| 98 |
+
|
| 99 |
+
The "value-equivalent latent" framing is from Schrittwieser et al. and is accurately invoked. However, neither paper has experiments in NLP, code generation, or multi-step software engineering. The translation from "predict reward/value/policy in Atari" to "predict FAIL_TO_PASS delta + tool_error kind for SWE" is the report's design inference.
|
| 100 |
+
|
| 101 |
+
This is not a misread — it is analogical reasoning from RL theory, which is legitimate. But the report presents these as "design discipline" rather than "analogical design inspiration," which is a subtle overclaim. MuZero and Dreamer provide no direct evidence that their latent-representation principle transfers to transformer-based LLM policy training on code.
|
| 102 |
+
|
| 103 |
+
**Verdict:** Sound analogy but presented as established principle. The report should note: MuZero/Dreamer motivate the value-equivalent design direction; they do not demonstrate it works in the LLM-policy-training regime.
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## Section 3: Aux-Head as "Second SDPO Mode" — Overclaim Analysis
|
| 108 |
+
|
| 109 |
+
### Finding 3.1 — "Parameter isolation eliminates the interference risk" is overclaimed [HIGH]
|
| 110 |
+
|
| 111 |
+
**What the report says (§2, line 39):**
|
| 112 |
+
> "three 2026 results pull the other way, hard, and they are why the aux loss must be a *separate* head and an *ablation*. First, interference: 'Reasoning and Tool-use Compete in Agentic RL' shows training reasoning and tool-use into one parameter set induces misaligned gradients, and decoupling into separate adapters (DART) beats every joint baseline across thirteen benchmarks [16] — stacking a next-state head onto the *same* policy head is exactly the configuration it indicts."
|
| 113 |
+
|
| 114 |
+
The report then concludes (§2, line 39): the solution is a "parameter-isolated head or adapter."
|
| 115 |
+
|
| 116 |
+
**What arXiv:2602.00994 actually shows:**
|
| 117 |
+
|
| 118 |
+
DART decouples reasoning and tool-use into **separate LoRA modules** — but these LoRA modules share the same base model weights (frozen). The gradient interference is between the two LoRA adapters, not just between a head and a base. The paper's solution is to use *disjoint parameter sets* for the two capabilities. This means parameter isolation to a separate LoRA module does reduce interference, but the report's implication that a "parameter-isolated head" fully eliminates the problem is not the paper's finding.
|
| 119 |
+
|
| 120 |
+
Specifically: DART's separate LoRA modules still share the frozen base, and the paper's ablation shows even with LoRA decoupling, there is residual interference. "Approaches the 2-Agent upper bound" (abstract) means it does not fully close the gap. The "two-Agent upper bound" is the theoretical ceiling achieved by having two separate models — separate LoRA on the same base does not achieve this.
|
| 121 |
+
|
| 122 |
+
**Verdict:** The report correctly identifies that joint parameter training is the risk and that isolation helps. The overclaim is that isolation *eliminates* the risk. The source shows isolation *reduces* interference but does not eliminate it. The correct framing: "parameter isolation substantially reduces but does not eliminate gradient interference; a fully separate model achieves the upper bound."
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
### Finding 3.2 — "Foresight@k" is proposed, not standard [MEDIUM]
|
| 127 |
+
|
| 128 |
+
**What the report says (§2, line 43):**
|
| 129 |
+
> "**Foresight@k** — the lift in terminal pass-fraction when the deliberation token is allowed versus suppressed, sampling fixed — is **the kill ablation**: if it is ≈0, the token is a no-op and is cut [11][2]."
|
| 130 |
+
|
| 131 |
+
**What the sources say:**
|
| 132 |
+
|
| 133 |
+
Neither arXiv:2601.03905 nor any other cited source defines or uses "Foresight@k" as a metric. The term appears to be coined by the report. This is fine — defining a novel metric is reasonable — but the report presents it as a standard metric with source citations, which is misleading.
|
| 134 |
+
|
| 135 |
+
**Verdict:** The metric is the report's proposal. The citations [11][2] do not define or use Foresight@k. The report should explicitly mark this as a proposed metric ("we define Foresight@k as...") rather than implying it is established.
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
### Finding 3.3 — The aux-loss-as-"second SDPO mode" claim is architecturally creative but unsupported [MEDIUM]
|
| 140 |
+
|
| 141 |
+
**What the report says (§2, line 41):**
|
| 142 |
+
> "A 'predict-the-outcome' target is the same shape: splice the *realized post-action observation* [...] into the teacher context as the privileged info, and distill the student toward the distribution it would have had if it had foreseen that outcome. [...] Because the teacher is stop-grad, a wrong predicted-outcome hint is *bounded-bad*..."
|
| 143 |
+
|
| 144 |
+
This is presented as follows: the aux objective is a "second SDPO mode" riding `generalized_jsd_loss`, not a new loss term.
|
| 145 |
+
|
| 146 |
+
**Assessment of the claim:**
|
| 147 |
+
|
| 148 |
+
The architectural argument is internally consistent and clever. SDPO is hint-conditioned distillation; if the hint is the realized observation, then distillation toward what the model would have predicted with that hint is "predict the next state." The stop-grad safety argument is valid.
|
| 149 |
+
|
| 150 |
+
However: the synthesis note in the vault (`latent-what-if-deliberation`) accurately identifies this as the report's *design inference*, not something any source supports directly. The gap-fill note (`gap-fill-counter-evidence`) explicitly identifies the missing ablation: "No SWE-specific next-state-head null result exists yet — that exact ablation is the cheapest decisive experiment we could run ourselves." The final report presents this design as a committed design with supporting evidence, when the supporting evidence is analogical (CWM uses a different architecture; MuZero/Dreamer operate in different domains).
|
| 151 |
+
|
| 152 |
+
**Verdict:** The design argument is sound but the evidentiary claim is overclaimed. The source base for "aux-next-state-loss as second SDPO mode improves SWE agent performance" is zero. All sources are analogical or domain-different. The report should state this explicitly: "no prior work has tested this exact configuration; the nearest existential proof is CWM's mid-training architecture, which is structurally different."
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Section 4: Prune-vs-Train-on-All — Source Fidelity Check
|
| 157 |
+
|
| 158 |
+
### Finding 4.1 — RAFT [2504.11343]: The report's characterization is accurate [OK]
|
| 159 |
+
|
| 160 |
+
**What the report says:** "RAFT/rejection-sampling is competitive with GRPO at far less complexity, and GRPO's advantage comes from *discarding all-wrong prompts* — a pruning move [25]."
|
| 161 |
+
|
| 162 |
+
**What the source says (arXiv:2504.11343 abstract):** "a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization."
|
| 163 |
+
|
| 164 |
+
**Verdict:** The report's characterization is verbatim-accurate. No issue.
|
| 165 |
+
|
| 166 |
+
---
|
| 167 |
+
|
| 168 |
+
### Finding 4.2 — Negative Gradient / LLD [2505.18830]: Characterization is accurate [OK]
|
| 169 |
+
|
| 170 |
+
**What the report says:** "the 'squeezing'/lazy-likelihood-displacement pathology, where the likelihood of *correct* responses barely rises or even drops under blanket per-token penalties [26]."
|
| 171 |
+
|
| 172 |
+
**What the source says (arXiv:2505.18830 abstract):** "we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training [...] identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength."
|
| 173 |
+
|
| 174 |
+
**Verdict:** Accurately characterized. No issue.
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
### Finding 4.3 — Near-Miss Negatives [2503.14391]: Characterization is accurate but domain is MCQA [OK with caveat]
|
| 179 |
+
|
| 180 |
+
**What the report says:** "positives-only training structurally cannot decrease the likelihood of plausible-but-wrong near-misses [27]."
|
| 181 |
+
|
| 182 |
+
**What the source says (arXiv:2503.14391 abstract):** "while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them."
|
| 183 |
+
|
| 184 |
+
The experimental setting is multiple-choice QA benchmarks (MCQA), not SWE. The report acknowledges this in §7: "the near-miss-calibration result [27] is MCQA." In §4 itself, this caveat is absent.
|
| 185 |
+
|
| 186 |
+
**Verdict:** Accurately quoted, domain gap present but not flagged in §4.
|
| 187 |
+
|
| 188 |
+
---
|
| 189 |
+
|
| 190 |
+
### Finding 4.4 — CWM "trains on all" re-enters §4 without correcting §2 [HIGH]
|
| 191 |
+
|
| 192 |
+
**What the report says (§4, lines 82-83):**
|
| 193 |
+
> "World-model next-state target — the single best foresight lever [...]; no policy penalty at all (§2) [13][27]."
|
| 194 |
+
> "The head is therefore not a bolt-on; it is the mechanism that makes train-on-all *safe* for the policy, because it relocates failed-branch signal off the policy gradient."
|
| 195 |
+
|
| 196 |
+
This invokes CWM [13] again as the model for "aux next-state head trains on all." As established in Finding 2.1, CWM's "train on all" is in the mid-training stage on a model that is NOT simultaneously receiving policy-gradient updates. CWM's design does not demonstrate that a simultaneously-trained aux head receiving failed-branch signal is safe for the policy.
|
| 197 |
+
|
| 198 |
+
**Verdict:** The §4 argument inherits the §2 misread. The "trains on all" CWM citation does not support the aux-head-on-policy configuration. This is the report's most consequential misread because it is the load-bearing justification for the "failed branch → world model head (safe)" two-harvest design.
|
| 199 |
+
|
| 200 |
+
---
|
| 201 |
+
|
| 202 |
+
## Missing Evidence / What the Sources Do NOT Say
|
| 203 |
+
|
| 204 |
+
### Missing 1: No paper tests next-state-prediction as aux loss on a policy network for SWE
|
| 205 |
+
|
| 206 |
+
The entire cluster (MuZero, Dreamer, CWM, Chain-of-World, 2512.18832) provides zero direct evidence for the proposed configuration: an auxiliary next-state-prediction objective appended to a policy network (as a separate head/adapter) during RL on software engineering tasks.
|
| 207 |
+
|
| 208 |
+
- MuZero: game-playing RL with a separate planning model
|
| 209 |
+
- DreamerV3: latent world model where policy is trained *inside* the world model's imagined rollouts — structurally opposite to "add aux head to existing policy"
|
| 210 |
+
- CWM: dedicated mid-training stage, not aux head during RL
|
| 211 |
+
- Chain-of-World: robotics, not SWE
|
| 212 |
+
- 2512.18832 ("From Word to World"): tests prompting and SFT for next-state prediction, not RL training with aux head
|
| 213 |
+
|
| 214 |
+
**The evidentiary gap is total for the specific proposed configuration.** The report should acknowledge: "There is no published ablation of an aux next-state loss on a policy LLM during code RL. CWM is the existence proof for mid-training dynamics; MuZero/Dreamer motivate the value-equivalent latent target. The specific aux-head-during-RL design is ours to test."
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
### Missing 2: The DART paper's scope is narrow
|
| 219 |
+
|
| 220 |
+
DART (2602.00994) is on retrieval-augmented QA and NL2SQL — not SWE, not multi-step agent tasks with long-horizon tool use. The interference result is between two capabilities (reasoning vs tool-use) in a shared LoRA. Applying it to "next-state-prediction head vs policy head" is another analogical transfer that the source does not make.
|
| 221 |
+
|
| 222 |
+
---
|
| 223 |
+
|
| 224 |
+
### Missing 3: The Myopic Planning paper (2605.06840) is not verified in the vault with full content
|
| 225 |
+
|
| 226 |
+
The vault note for 2605.06840 only has the abstract. The report cites specific causal pruning findings. The full paper is not in the vault as fetched content. The abstract confirms the causal CoT-pruning direction is described, but the specific intervention details ("causal CoT-pruning intervention confirms move selection is driven by shallow depth-1 nodes") are derived from the paper's full text, which was not independently verified from the source. This should be checked directly.
|
| 227 |
+
|
| 228 |
+
---
|
| 229 |
+
|
| 230 |
+
## What the Literature DOES Support (for balance)
|
| 231 |
+
|
| 232 |
+
1. **Explicit dynamics training (mid-training or SFT) beats zero-shot prompting**: 2512.18832 demonstrates SFT on trajectories lifts ALFWorld/SciWorld accuracy to 99%/98%. CWM demonstrates mid-training on dynamics produces a strong SWE base. These are real, direct endorsements of *some form* of dynamics training.
|
| 233 |
+
|
| 234 |
+
2. **Train-on-all for world model, filter for RL reward**: CWM explicitly does this, and the motivating argument ("comprehensive world model") is stated in the paper. The design principle is supported, just not via aux head during RL.
|
| 235 |
+
|
| 236 |
+
3. **Value-equivalent / decision-relevant targets**: MuZero's design principle — predict only what matters (reward, value, policy) — is well-established RL theory. Its application to code (predict FAIL_TO_PASS delta, not full repo state) is a sound architectural translation.
|
| 237 |
+
|
| 238 |
+
4. **Structured negatives fix near-miss calibration positives-only cannot**: 2503.14391 supports this in MCQA. The SWE transfer is an inference.
|
| 239 |
+
|
| 240 |
+
5. **Foresight governance bottleneck**: 2601.03905 clearly identifies this bottleneck in VLM/VQA. The principle is general.
|
| 241 |
+
|
| 242 |
+
6. **Simultaneous gradient interference**: 2602.00994 is a real empirical finding in agentic RL. The prescriptive consequence (use parameter isolation) is supported.
|
| 243 |
+
|
| 244 |
+
---
|
| 245 |
+
|
| 246 |
+
## Verdict on Report Commitments (§2–4)
|
| 247 |
+
|
| 248 |
+
| Commitment in final report | Source support | Verdict |
|
| 249 |
+
|---|---|---|
|
| 250 |
+
| CWM "trains on all" for world-model head | CWM trains on all in mid-training stage, not during RL aux-head | **MISREAD — cite correctly as mid-training** |
|
| 251 |
+
| CWM reaches "65.8% SWE-bench Verified" | Correct but with test-time scaling; base score is lower | **INCOMPLETE — add qualifier** |
|
| 252 |
+
| Chain-of-World supports "predict terminal state, not full frame" for code | CoWVLA is robotics VLA, not SWE | **WRONG DOMAIN — remove or note as robotics analogy** |
|
| 253 |
+
| Predictive-Causal Gap proves SWE repo is dangerous for aux loss | Theorem is linear-Gaussian; SWE application is author inference | **OVERSTATED — demote to risk flag** |
|
| 254 |
+
| Parameter isolation eliminates interference risk | DART shows isolation reduces, not eliminates | **OVERCLAIM — soften to "substantially reduces"** |
|
| 255 |
+
| Foresight@k is the kill ablation | Metric is proposed by report, not a standard metric | **MARK AS PROPOSED METRIC** |
|
| 256 |
+
| Aux loss as "second SDPO mode" has evidentiary support | No source tests this configuration | **ZERO DIRECT EVIDENCE — flag as null-evidence design proposal** |
|
| 257 |
+
| "Two hard prune gates resolve the central question" | Gates don't address predictive-causal gap | **INTERNAL INCONSISTENCY — the theorem is invoked then resolved by a design that doesn't address it** |
|
| 258 |
+
| MuZero/Dreamer provide "design discipline" for SWE aux head | They motivate value-equivalent targets; no LLM-SWE experiments | **OVERCLAIM — demote to "analogical motivation"** |
|
| 259 |
+
|
| 260 |
+
---
|
| 261 |
+
|
| 262 |
+
## Recommended Corrections
|
| 263 |
+
|
| 264 |
+
1. **§2, CWM citation**: Rewrite as: "Meta's CWM mid-trains a 32B model in a dedicated pre-RL stage on 3M observation-action trajectories without success filtering, reaching 65.8% on SWE-bench Verified with test-time scaling. The train-on-all decision is for the mid-training dynamics stage, not an auxiliary head during RL." Remove the implication that CWM licenses aux-head-on-policy train-on-all.
|
| 265 |
+
|
| 266 |
+
2. **§2, Chain-of-World [53]**: Flag explicitly as robotics VLA. Either remove or rewrite as: "In the robotics domain, CoWVLA (CVPR 2026) demonstrates the same latent-terminal-state design for embodied agents, providing design-level motivation for the analogous SWE architecture."
|
| 267 |
+
|
| 268 |
+
3. **§2, Predictive-Causal Gap**: Rewrite as a risk flag: "The Predictive-Causal Gap theorem (linear-Gaussian dynamics; 2695 networks) establishes that predictive objectives can be accurate and causally blind simultaneously. Applied to SWE by analogy, a next-state head could improve token-level prediction while failing to learn decision-relevant dynamics. This is a structural risk, not a demonstrated outcome for LLM SWE training."
|
| 269 |
+
|
| 270 |
+
4. **§2, Parameter isolation**: Change "parameter-isolated head or adapter, never fused into the policy head [16]" to "parameter-isolated head or adapter substantially reduces gradient interference (DART reduces but does not fully close the interference gap to the 2-Agent upper bound [16])."
|
| 271 |
+
|
| 272 |
+
5. **§2, Foresight@k**: Add "(we define this metric; it has no published baseline)" at first use.
|
| 273 |
+
|
| 274 |
+
6. **§2, null-evidence flag**: Add a box or explicit paragraph: "Direct evidence gap: no published paper has tested an auxiliary next-state-prediction objective as an add-on loss during RL on a code policy network. All cited support is analogical transfer from: (a) mid-training architectures (CWM), (b) dedicated world-model planning systems (MuZero, Dreamer), or (c) non-SWE domains (Chain-of-World, 2512.18832). The proposed aux-head-during-RL design is a research hypothesis requiring the P4 ablation in §4, not a design with established support."
|
| 275 |
+
|
| 276 |
+
7. **§4, CWM [13] re-citation**: Correct to: "CWM's mid-training precedent motivates train-on-all dynamics learning; direct evidence for the aux-head variant of this design during RL is absent."
|
| 277 |
+
|
| 278 |
+
8. **§3, foresight domain caveat**: Bring the "VLM/VQA" caveat from §7 into §2 at first use of 2601.03905.
|
| 279 |
+
|
| 280 |
+
---
|
| 281 |
+
|
| 282 |
+
## What These Sources Collectively Say About (a), (b), (c)
|
| 283 |
+
|
| 284 |
+
**Question (a): Does next-state-prediction auxiliary head help agent policies (vs hurt via gradient interference)?**
|
| 285 |
+
|
| 286 |
+
Direct evidence: **None in SWE or code domain.** CWM does it in mid-training, not as aux head. DART shows simultaneous parameter optimization on reasoning+tool-use interferes; parameter isolation helps substantially. 2512.18832 shows SFT (not RL with aux head) helps next-state prediction transfer. The honest position: unknown for the specific proposed configuration; plausible but untested.
|
| 287 |
+
|
| 288 |
+
**Question (b): Does training on failure trajectories help/hurt and under what routing?**
|
| 289 |
+
|
| 290 |
+
Direct evidence for routing: 2503.14391 (MCQA near-miss: negatives help near-miss calibration, positives-only cannot decrease plausible-wrong likelihood). 2505.18830 (raw uniform negative gradient destabilizes). 2504.11343 (positives-only RAFT is competitive on pass@1). Together: **raw negatives hurt pass@1, structured negatives fix calibration, SWE-specific ablation unrun.** CWM's mid-training: all trajectories useful for dynamics, not for policy. The "two-harvest" design (negatives to world model, not policy) is consistent with all sources but not tested in the proposed configuration.
|
| 291 |
+
|
| 292 |
+
**Question (c): Do "deliberation tokens" / think-before-act distillation have support?**
|
| 293 |
+
|
| 294 |
+
Partial support: CWM discusses "reasoning about environment feedback to improve agentic code generation" as future work and shows early prototype (Figure 5) of trace-conditioned reasoning. 2512.18832 shows SFT on trajectories enables explicit next-state reasoning. But "deliberation token" as a trainable gate with RL on placement is not in any source. 2605.06840 suggests CoT deliberation content is generated but causally ignored by the model — a direct challenge to the governance-RL-on-token-placement idea. 2601.03905 shows even explicit simulation access fails. The think-before-act idea has motivational support but no direct SWE ablation and one result (myopic planning) that challenges whether the token's content would be consumed.
|
| 295 |
+
|
| 296 |
+
---
|
| 297 |
+
|
| 298 |
+
*End of findings. Full-length file: `/Users/baladita/Documents/DevBox/composer-replication-framework/research/deepread/06-worldmodel.md`*
|
|
@@ -0,0 +1,291 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read Critical Review: Trace-Replay / Multi-Teacher Distillation / MCTS-for-LLM-Agents
|
| 2 |
+
## Cluster 07 — Channel 3 + Tree-of-Work
|
| 3 |
+
|
| 4 |
+
**Reviewer:** agent (Claude Sonnet 4.6)
|
| 5 |
+
**Date:** 2026-06-10
|
| 6 |
+
**Scope:** `composer_replication/teacher_replay.py`, `research/05-trace-replay-distillation.md`, `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §§1,3,6,7,10, `docs/adrs/ADR-002-trace-source.md`, `docs/adrs/ADR-013-lma-integration-channel-ladder.md`
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 0. TL;DR
|
| 11 |
+
|
| 12 |
+
Channel 3 (`teacher_replay.py`) and the tree-of-work design are **neither pure novelty nor pure reinvention**. The specific combination — frozen-trace replay at every decision step across N heterogeneous frontier models, consensus-disagreement-as-DPO-signal, divergence-gated recursive branching with an execution oracle, and world-model auxiliary loss — is genuinely novel synthesis. But the repo's research note (`research/05`) **overclaims** novelty in three specific ways, the final report's cost estimates are **internally inconsistent**, the "closest precedent" identification (rStar) is partially wrong, and there is a **missing piece** (agent-trace distillation at code-repo scale already exists in published form at direct overlap) that neither `research/05` nor the final report adequately cites. This review quotes sources verbatim, identifies every misread, and gives a verdict on each cost figure.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## 1. What Was Fetched / Read
|
| 17 |
+
|
| 18 |
+
Primary sources fetched and read:
|
| 19 |
+
- **LATS** (arXiv:2310.04406, Zhou et al. 2023): Language Agent Tree Search — MCTS over agent reasoning/acting/planning steps with external env feedback and LM-valued nodes.
|
| 20 |
+
- **rStar** (arXiv:2408.06195, Qi et al. 2024): mutual self-play two-SLM MCTS for math reasoning; generator+discriminator role split.
|
| 21 |
+
- **rStar-Math** (arXiv:2501.04519, Guan et al. 2025): MCTS-driven step-verified data synthesis for math SLMs; trains PPM (process preference model) from tree rollouts.
|
| 22 |
+
- **Tree-of-Thoughts** (arXiv:2305.10601, Yao et al. 2023): foundational ToT framework, BFS/DFS over thought trees, LM self-evaluation.
|
| 23 |
+
- **SWE-Search** (arXiv:2410.20285, Antoniades et al. 2024): MCTS over SWE-bench tasks; hybrid LM value function, Value Agent, Discriminator Agent; 23% relative gain over no-MCTS; scales with inference-time compute.
|
| 24 |
+
- **Tree-GRPO** (arXiv:2509.21240, Ji et al. 2025/ICLR 2026): tree-structured group-relative policy optimization; shared-prefix branching; proves intra-tree group-relative advantage == step-level DPO.
|
| 25 |
+
- **Socratic-SWE** (arXiv:2606.07412, Xiao et al. 2026): closed-loop self-evolving SWE agent; trace-derived Agent Skill Registry; 4-gate verifier; solver-gradient alignment reward; 50.40% SWE-bench Verified after 3 iters.
|
| 26 |
+
- **Symphony** (arXiv:2601.22623, Zhu et al. 2026/NeurIPS 2025): heterogeneous LM pool in MCTS planning; argues single-agent MCTS has insufficient branch diversity; heterogeneous pool improves rollout diversity.
|
| 27 |
+
- **SWE-Gym** (arXiv:2412.21139, Pan et al. 2024/ICML 2025): 2,438 real-world Python task environments; inference-time scaling via trained verifiers; 32.0%/26.0% on SWE-bench Verified/Lite.
|
| 28 |
+
- **Code World Model MCTS** (arXiv:2405.15383, Dainese et al. 2024): MCTS-guided code world model generation; GIF-MCTS; NeurIPS 2024.
|
| 29 |
+
- **Tree Search for LM Agents** (arXiv:2407.01476, Koh et al. 2024): best-first tree search for web agents; 39.7% relative gain on VisualWebArena with GPT-4o; effective at realistic tasks.
|
| 30 |
+
|
| 31 |
+
Repo files read in full:
|
| 32 |
+
- `composer_replication/teacher_replay.py` (281 LOC)
|
| 33 |
+
- `research/05-trace-replay-distillation.md`
|
| 34 |
+
- `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` (§§1–10, full)
|
| 35 |
+
- `docs/adrs/ADR-002-trace-source.md`
|
| 36 |
+
- `docs/adrs/ADR-013-lma-integration-channel-ladder.md`
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## 2. What the Literature Actually Says
|
| 41 |
+
|
| 42 |
+
### 2a. Replaying a trace prefix across N heterogeneous models then branching — has anyone done it?
|
| 43 |
+
|
| 44 |
+
**Short answer: The exact frozen-trace multi-teacher replay mechanism is not directly published. But the components are deeply covered and combinatorially close work exists.**
|
| 45 |
+
|
| 46 |
+
The relevant lineage:
|
| 47 |
+
|
| 48 |
+
1. **Tree-of-Thoughts / LATS / SWE-Search** all expand nodes via a policy (one model, sometimes self-evaluated) and do NOT fix a human-collected trace then replay each step with multiple heterogeneous external models. The branching is forward-generative, not retrospective-replay.
|
| 49 |
+
|
| 50 |
+
2. **rStar** (the closest analogue cited in `research/05`) uses two SLMs in fixed roles (generator + discriminator) at trajectory-level evaluation. The discriminator evaluates the full trajectory, not each step position independently. Quote from the abstract: "another SLM, with capabilities similar to the target SLM, acts as a **discriminator to verify each trajectory** generated by the target SLM." This is trajectory-level verdict, not per-step replay with N teachers.
|
| 51 |
+
|
| 52 |
+
3. **rStar-Math** does per-step MCTS rollouts, but these are single-policy forward rollouts (not fixed-trace, not multi-teacher). The PPM then scores steps. No external teacher heterogeneity.
|
| 53 |
+
|
| 54 |
+
4. **SWE-Gym** trains verifiers on agent trajectories and scales inference via them, but the training data collection does not use multi-teacher per-step replay.
|
| 55 |
+
|
| 56 |
+
5. **AgentTrek** (cited in `research/05`) uses "guided replay demonstrations" from web tutorials — semantically similar name but operationally different: it is single-teacher demonstration collection, not multi-teacher disagreement harvesting.
|
| 57 |
+
|
| 58 |
+
**Verdict on (a):** Nobody has published the exact frozen-trace-prefix → N-teacher parallel replay → disagreement-as-DPO mechanism at the agentic-step level. The combination is genuinely novel. However, `research/05`'s claim that "No published work systematically freezes a trace at step t, replays that exact state with N different teachers, harvests variance as per-step supervision" is **accurate** — but `research/05` then goes on to undercite the work that comes closest, which is more than rStar.
|
| 59 |
+
|
| 60 |
+
### 2b. Execution-oracle-graded tree search over real repos at scale — what does the literature actually say and what does it cost?
|
| 61 |
+
|
| 62 |
+
**Published execution-oracle tree search at repo scale:**
|
| 63 |
+
|
| 64 |
+
- **SWE-Search** (arXiv:2410.20285): MCTS over SWE-bench using a hybrid value function (LLM-estimated numerical + qualitative). **Critical nuance:** the value function is LLM-estimated, NOT a true execution oracle. It uses a "Value Agent" that provides qualitative feedback. Full pytest runs are expensive; SWE-Search's value is LM-evaluated fitness, not test-suite-pass-fraction. The 23% relative gain is impressive but it is LM-valued MCTS, not execution-oracle MCTS.
|
| 65 |
+
|
| 66 |
+
- **SWE-Gym verifiers** (arXiv:2412.21139): trains a verifier (a model that learns to score trajectory quality) on trajectories, then uses it for inference-time selection. The verifier *learns* execution outcome but is not the oracle itself during search.
|
| 67 |
+
|
| 68 |
+
- **Socratic-SWE** (arXiv:2606.07412): uses a 4-gate verifier — Format/Grounding/Execution/Semantics — for task validation, NOT as the tree search value function. The training reward is `Valid(τ)·cos(g_τ, G_val)` — solver-gradient alignment, NOT test-suite pass-fraction. The execution gate is a solvability filter, not a step-level oracle.
|
| 69 |
+
|
| 70 |
+
**Only the repo's own `FeatureDeletionEnv._grade()` (and CWM arXiv:2510.02387, which trains a model on execution trajectories) use raw test-suite pass-fraction as the MCTS/RL reward at training time.** The final report correctly identifies this as the critical differentiator.
|
| 71 |
+
|
| 72 |
+
**Cost data for execution-oracle tree search:**
|
| 73 |
+
|
| 74 |
+
The literature does NOT provide a clean "$X per trace" figure for execution-oracle MCTS at real repo scale. The closest:
|
| 75 |
+
- DeepSWE (cited as [43] in the final report) ran 64 H100s for six days on 0/1 sparse reward outcome-RL with no branching tree — making the per-trace cost of a *flat* outcome-RL run tractable but providing no per-branch cost.
|
| 76 |
+
- SWE-rebench infrastructure (nebius.com, [54]) evaluates thousands of SWE instances/hour via distributed container orchestration — providing a throughput floor but not a per-branch dollar figure.
|
| 77 |
+
|
| 78 |
+
### 2c. Cross-model disagreement as a branch gate — is it a useful signal?
|
| 79 |
+
|
| 80 |
+
**Tree-GRPO** (arXiv:2509.21240, ICLR 2026) provides the strongest theoretical backing: it proves that intra-tree group-relative advantage estimation is *equivalent* to step-level direct preference learning. This directly validates the repo's claim that "sibling divergence → DPO signal" is principled. Quote from abstract: "we demonstrate that the objective of intra-tree level group relative policy optimization is **equivalent to that of step-level direct preference learning**."
|
| 81 |
+
|
| 82 |
+
However, this is about **same-model** intra-tree branching (shared prefixes, different continuations), not heterogeneous-model branching. The equivalence proof does not cover cross-family disagreement.
|
| 83 |
+
|
| 84 |
+
**Symphony** (arXiv:2601.22623, NeurIPS 2025) argues explicitly that "single-agent MCTS yields insufficient branch diversity and a heterogeneous LM pool improves rollout diversity and exploration." This endorses cross-model disagreement as a diversity mechanism — but note: Symphony tests this in planning tasks, not repo-level SWE.
|
| 85 |
+
|
| 86 |
+
**Counterpoint:** arXiv:2604.02460 (Single-Agent LLMs Outperform Multi-Agent, Tran & Kiela 2026) makes the data-processing inequality argument that under equal compute, single-agent is information-theoretically superior. The SWE-specific result is unrun (as the final report correctly acknowledges).
|
| 87 |
+
|
| 88 |
+
### 2d. DPO-pair extraction from sibling branches
|
| 89 |
+
|
| 90 |
+
Tree-GRPO (arXiv:2509.21240) provides the clearest theoretical backing for this being well-founded. The repo's `extract_dpo_pairs` implements a simpler heuristic (Counter over normalized actions, break-on-first-pair per state) — this is a sound but weaker version of what Tree-GRPO formalizes. The repo's implementation does not implement the intra-tree advantage weighting Tree-GRPO derives.
|
| 91 |
+
|
| 92 |
+
---
|
| 93 |
+
|
| 94 |
+
## 3. Critical Review of `research/05-trace-replay-distillation.md`
|
| 95 |
+
|
| 96 |
+
### Finding 05-R1: Overclaim on rStar as "closest precedent" (MISREAD)
|
| 97 |
+
|
| 98 |
+
`research/05` Section "The Closest Published Precedent" frames rStar as the closest precedent and describes it as having "multi-model step-level evaluation." This is partially wrong:
|
| 99 |
+
|
| 100 |
+
- rStar's discriminator evaluates **full trajectories**, not individual steps. Quote from rStar abstract: "acts as a discriminator to **verify each trajectory**." The per-step MCTS action expansion in rStar uses the *generator* SLM alone; the discriminator only votes on completed trajectories.
|
| 101 |
+
- The correct characterization: rStar is close because it uses two models with agreement as signal, but the granularity is trajectory-level, not step-level.
|
| 102 |
+
|
| 103 |
+
**More accurate closest precedent:** Tree-GRPO (arXiv:2509.21240) is closer to the DPO-extraction mechanism. `research/05` does not cite Tree-GRPO at all — this paper was published September 2025 and is directly relevant.
|
| 104 |
+
|
| 105 |
+
### Finding 05-R2: Missing citation — Tree-GRPO (MISS)
|
| 106 |
+
|
| 107 |
+
`research/05` does not cite Tree-GRPO (arXiv:2509.21240), which:
|
| 108 |
+
- Proves intra-tree group-relative advantage == step-level DPO (exactly what `extract_dpo_pairs` is trying to compute informally).
|
| 109 |
+
- Shows that shared-prefix branching is a principled way to generate process supervision from outcome reward alone.
|
| 110 |
+
- Is published at ICLR 2026, directly relevant.
|
| 111 |
+
|
| 112 |
+
`research/05` was presumably written before Tree-GRPO's acceptance was known, but this is now a material gap. The final report (§7, footnote [44]) cites Tree-GRPO correctly — so this was caught downstream but never propagated back to `research/05`.
|
| 113 |
+
|
| 114 |
+
### Finding 05-R3: Missing citation — SWE-Search (MISS)
|
| 115 |
+
|
| 116 |
+
`research/05` does not cite SWE-Search (arXiv:2410.20285), which applies MCTS specifically to SWE-bench tasks (the exact domain), achieves 23% relative improvement, and demonstrates that the value signal scales with inference-time compute. This is the most directly relevant published tree-search SWE result. The final report cites it as [51] — again, a gap that exists only in `research/05`.
|
| 117 |
+
|
| 118 |
+
### Finding 05-R4: AgentTrek description is misleading (MISREAD)
|
| 119 |
+
|
| 120 |
+
`research/05` describes AgentTrek as "Large-scale multimodal trajectory dataset from web tutorials. Guided replay demonstrations." The characterization "Guided replay demonstrations" is misleading — it implies similarity to multi-teacher replay. AgentTrek generates demonstrations by following web tutorial steps with a single agent; there is no replay of a frozen trace across multiple models. The "connection" note ("Demonstrates feasibility of guided/counterfactual replay") overstates the analogy.
|
| 121 |
+
|
| 122 |
+
### Finding 05-R5: "Trace-Freezing + Multi-Teacher Replay: Novel" claim is directionally right but mis-attributed (PARTIAL OVERCLAIM)
|
| 123 |
+
|
| 124 |
+
Section "What IS Novel" states: "No published work systematically freezes a trace at step t, replays that exact state with N different teachers, harvests variance as per-step supervision." This specific combination IS genuinely novel. But the claim "Step-Level Multi-Teacher Preference Data: Traditional multi-teacher: response-level preferences; PRMs: single-teacher step evaluation; Gap: No multi-teacher per-step comparison" understates what Tree-GRPO achieves: Tree-GRPO generates multiple continuations from a shared prefix (a frozen state) and extracts step-level preference signal — which is mechanically very close to what Channel 3 does (though with a single model rather than heterogeneous teachers).
|
| 125 |
+
|
| 126 |
+
### Finding 05-R6: Cost estimates are internally consistent in `research/05` but inconsistent with the final report (FLAG)
|
| 127 |
+
|
| 128 |
+
`research/05` states the "Baseline" cost as:
|
| 129 |
+
```
|
| 130 |
+
$0.008/step × 1000 × 8 = $64 per trace
|
| 131 |
+
```
|
| 132 |
+
This is "8 teachers × 1000 steps" ungated. The final report's footnote [6] says "~$0.98 flat vs ~$64 ungated" — the $0.98 figure comes from Spike 001's actual measurement on 50 synthetic states with 3 teachers (reported in `teacher_replay.py` docstring: "Verified economic floor (✅ spike 001): $0.98 mean per-trace cost ungated, $0.30/trace projected with VOI gating.").
|
| 133 |
+
|
| 134 |
+
**The inconsistency:** `research/05` constructs the $64 figure from 8 teachers × 1000 steps, while `teacher_replay.py` uses DEFAULT_TEACHERS with only 3 teachers. The final report uses $0.98 (3 teachers, real trace) and $64 (8 teachers, 1000 steps) in the same sentence, implying $64 is the cost of "the tree" not "flat Channel 3 with 8 teachers." This is confusing:
|
| 135 |
+
|
| 136 |
+
- $0.98 per trace = real Spike 001 measurement, N=3 teachers, ~50 synthetic states
|
| 137 |
+
- $64 per trace = constructed estimate from research/05, N=8 teachers, 1000 steps, flat (no branching)
|
| 138 |
+
- A true branching tree at depth D with N branches is O(N^D) which is *worse* than $64
|
| 139 |
+
|
| 140 |
+
The final report says "a true branching tree is O(N^D), strictly worse than either flat figure" (§10), which is correct — but the $64 figure is NOT a "branching tree" cost, it is a flat all-steps N=8 estimate. Calling it "ungated tree" in the final report (§10: "~$64 ungated tree") is a misnomer. It should be "$64 flat 8-teacher 1000-step, vs O(N^D) for a true tree."
|
| 141 |
+
|
| 142 |
+
---
|
| 143 |
+
|
| 144 |
+
## 4. Critical Review of the Final Report (§§1,3,6,7,10 of `final_report_socratic-mcts-swe-worldmodel-8f6dea.md`)
|
| 145 |
+
|
| 146 |
+
### Finding FR-R1: The "flat (depth-1)" characterization of Channel 3 is accurate (CORRECT)
|
| 147 |
+
|
| 148 |
+
§1: "It is flat (depth-1): the teachers never take their candidate action or continue, so the 'tree' is a set of depth-1 stars hung off a pre-existing linear human trace." This is exactly right — confirmed by reading `replay_trace()` (lines 178-188) and `extract_dpo_pairs()` (lines 206-262).
|
| 149 |
+
|
| 150 |
+
### Finding FR-R2: "Fitness is teacher plurality, not execution" characterization is accurate (CORRECT)
|
| 151 |
+
|
| 152 |
+
§1: "its fitness is teacher plurality, not execution: selection is a Counter over normalized actions, nothing runs against a test suite." Confirmed by reading `_normalize_action()` and `extract_dpo_pairs()` — the pairing logic is purely textual counter-based with whitespace normalization. There is no execution oracle in Channel 3's current form.
|
| 153 |
+
|
| 154 |
+
**Critical gap:** The normalization in `_normalize_action()` (lines 196-203) is very weak — it only normalizes whitespace and lowercases. For real agentic traces with structured tool calls, this will produce near-zero agreement (two tool calls that are semantically identical but differently formatted will count as disagreements). The docstring acknowledges this: "For real agentic traces, this should parse the tool call (name + args) and return a canonical form." This is a **known unimplemented critical path** for the channel to produce any real signal.
|
| 155 |
+
|
| 156 |
+
### Finding FR-R3: The Socratic-SWE description has a materially wrong claim (MISREAD, HIGH SEVERITY)
|
| 157 |
+
|
| 158 |
+
§1 describes Socratic-SWE: "Socratic-SWE (2606.07412) is the closest published analogue — a closed-loop self-evolving SWE trainer that distills traces into skills, generates targeted repair tasks, gates them through a four-stage execution validator, scores by solver-gradient alignment, and reaches 50.40% on SWE-bench Verified after three iterations. But it generates repair tasks; it does NOT inject bugs."
|
| 159 |
+
|
| 160 |
+
This is accurate as far as it goes. However, the final report then says: "Bug injection — manufacturing a broken state by reverting a gold patch and rewarding re-derivation — is the repo's own FeatureDeletionEnv, the analogue of Cursor's 'feature deletion'."
|
| 161 |
+
|
| 162 |
+
The implicit contrast — "they do repair tasks, we do feature deletion / bug injection" — is valid. But the report slightly overstates the contrast by not noting that Socratic-SWE's task generation pipeline is *also* execution-validated (Format/Grounding/Execution/Semantics gate) and also targets agent-specific weaknesses (via skill distillation). The gap is real (no bug injection, no per-step branching) but smaller than the report implies.
|
| 163 |
+
|
| 164 |
+
**Bigger issue:** The claim "it does NOT inject bugs" deserves a citation to the Socratic-SWE paper to confirm this is an architectural choice and not an oversight. Quote from the Socratic-SWE abstract: "Existing synthetic data methods **typically create tasks through fixed mutation or bug-injection** procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that **reuses the agent's historical solving traces** as a source of training signal." So Socratic-SWE explicitly positions against bug injection. The final report correctly notes this but doesn't quote it, which weakens the citation.
|
| 165 |
+
|
| 166 |
+
### Finding FR-R4: "heterogeneous-per-turn branching + execution-oracle fitness + divergence-derived textual feedback + world-model aux loss — is the novel synthesis" claim is correctly hedged (CORRECT)
|
| 167 |
+
|
| 168 |
+
§1: "The multi-model per-turn tree is a recombination whose ingredients all exist (SWE-Search expands nodes with one policy [51]; Symphony does heterogeneous-LM planning [52]; Channel 3 queries N teachers, flat) but whose specific combination [...] is the novel synthesis. Claim the synthesis, not the parts." This is well-calibrated. The component-level citations are verified correct.
|
| 169 |
+
|
| 170 |
+
**One gap:** The paper arXiv:2407.01476 ("Tree Search for Language Model Agents," Koh et al. 2024) is in the vault and applies best-first tree search to realistic web agent tasks. This is another component-level precedent that should be in the synthesis inventory but is not cited in §1 or §7.
|
| 171 |
+
|
| 172 |
+
### Finding FR-R5: The cost estimate claim "$0.98 mean per-trace cost ungated" needs qualification (POTENTIAL OVERCLAIM)
|
| 173 |
+
|
| 174 |
+
The $0.98 figure appears both in `teacher_replay.py`'s docstring ("Verified economic floor (✅ spike 001): $0.98 mean per-trace cost ungated") and in the final report (§10, [6]).
|
| 175 |
+
|
| 176 |
+
However, Spike 001 used 50 **hand-crafted synthetic states** (ADR-002: "Spike 001 used 50 hand-crafted synthetic states for the cost-floor measurement"). Real Claude Code sessions have 125–2,830 tool_use messages per session (ADR-002). If a "trace" is a full session with ~1,400 states (median), the actual cost of flat N=3 replay would be roughly:
|
| 177 |
+
|
| 178 |
+
```
|
| 179 |
+
1,400 steps × 3 teachers × (prompt_tokens × cost_in + completion_tokens × cost_out)
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
The DEFAULT_TEACHERS pricing in `teacher_replay.py` at 200 max_tokens:
|
| 183 |
+
- claude-opus-4.7: $15/M in + $75/M out → at 2k prompt tokens + 200 completion: ~$0.045/call
|
| 184 |
+
- gpt-5: $1.25/M in + $10/M out → ~$0.004/call
|
| 185 |
+
- deepseek-v4-pro: $1.10/M in + $4.40/M out → ~$0.003/call
|
| 186 |
+
|
| 187 |
+
Total per-step: ~$0.052
|
| 188 |
+
For 1,400 steps: **~$73 per session** (not $0.98).
|
| 189 |
+
|
| 190 |
+
The $0.98 figure was correct for 50 synthetic states with short messages. At real session scale, the cost is one to two orders of magnitude higher. The docstring's "Verified economic floor" language suggests the $0.98 is the per-trace figure for the spike's trace definition (50 steps), not per full-session. The final report's §10 is careful about this distinction but `teacher_replay.py`'s docstring is misleading: it should say "per 50-step trace (spike 001 benchmark)," not just "per-trace."
|
| 191 |
+
|
| 192 |
+
### Finding FR-R6: The $64 "ungated tree" terminology is a misnomer (FLAG, see 05-R6 above)
|
| 193 |
+
|
| 194 |
+
§10: "flat Channel-3 replay is ~$0.98/trace at N=3 and ~$64/trace at the eight-teacher × thousand-step scale, both flat O(N·T); a true branching tree is O(N^D), strictly worse than either flat figure."
|
| 195 |
+
|
| 196 |
+
The statement is mathematically correct but the label "~$64 ungated tree" (used in the final report's in-text references) is misleading because $64 is a flat (non-branching) extrapolation from `research/05`. It is not a measured figure, and the "eight teachers" scenario (N=8) does not correspond to any actual configuration in the codebase (DEFAULT_TEACHERS has N=3). The label should be "$64 flat-replay extrapolation at N=8, T=1000, unmeasured."
|
| 197 |
+
|
| 198 |
+
### Finding FR-R7: The heterogeneity pushback is correctly represented (CORRECT)
|
| 199 |
+
|
| 200 |
+
§7 cites arXiv:2604.02460 (Tran & Kiela 2026) and arXiv:2601.12307 (Rethinking multi-agent) correctly. Both papers are in the vault with correct titles/years. The data-processing-inequality framing for single-agent efficiency is correctly attributed.
|
| 201 |
+
|
| 202 |
+
**One gap:** The final report does not note that arXiv:2604.02460 studies multi-hop *reasoning* tasks (QA), not SWE tasks. The domain transfer is uncertain. This is the kind of caveat that belongs in §7's pushback discussion.
|
| 203 |
+
|
| 204 |
+
### Finding FR-R8: The _normalize_action issue is a silent failure mode not surfaced in the final report (MISS)
|
| 205 |
+
|
| 206 |
+
The final report discusses the $0.98→$64 cost gap and the O(N^D) blowup, but does not flag that `extract_dpo_pairs` will produce near-zero pairs on real agentic traces because `_normalize_action` only normalizes whitespace. On real Claude Code traces with structured `tool_use` blocks (JSON-formatted), two semantically identical "run bash command X" steps formatted slightly differently will count as disagreements, and two structurally different calls with identical normalized strings will count as agreements. The pair-extraction logic is likely to produce mostly noise on real traces without a proper tool-call parser.
|
| 207 |
+
|
| 208 |
+
This is acknowledged in the code comment ("For real agentic traces, this should parse the tool call (name + args) and return a canonical form") but is not surfaced as a risk in the final report's §10 failure modes section.
|
| 209 |
+
|
| 210 |
+
### Finding FR-R9: The world-model aux loss discussion correctly cites the key interference and causal-gap papers (CORRECT)
|
| 211 |
+
|
| 212 |
+
§2 cites arXiv:2602.00994 (Reasoning and Tool-use Compete in Agentic RL) and arXiv:2605.05029 (The Predictive-Causal Gap) correctly. Both are in the vault with correct titles. The conclusion — parameter-isolated head, ablation-gated — is appropriate given the evidence.
|
| 213 |
+
|
| 214 |
+
### Finding FR-R10: The "strip_thinking=False" critical path claim is not verified by any cited paper (UNVERIFIED)
|
| 215 |
+
|
| 216 |
+
§6: "strip_thinking must be False: ~67% of real Claude Code error-recovery turns are pure thinking." The 67% figure appears to come from internal repo analysis (ADR-002 / claude_code.py) rather than any external source. This is likely accurate (it's an empirical measurement from real sessions) but should be marked as an internal finding, not stated as if it has external validation.
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## 5. Is the Channel-3/Tree Design Novel or Reinventing Something That Exists?
|
| 221 |
+
|
| 222 |
+
### Novel (good):
|
| 223 |
+
1. **Frozen-trace per-step replay across N heterogeneous frontier models with consensus-DPO harvesting**: No published paper does exactly this. The closest are Tree-GRPO (shared-prefix same-model branching) and rStar (two-model trajectory verification). The heterogeneous-teacher-consensus mechanism applied to a frozen captured trace is original.
|
| 224 |
+
|
| 225 |
+
2. **Combining execution-oracle fitness with multi-model disagreement for divergence-gated branching**: SWE-Search uses LM-estimated value (not execution oracle). Socratic-SWE uses execution validation as a filter, not as the MCTS fitness. Using test-suite pass-fraction as the tree's value function is the most meaningful differentiator.
|
| 226 |
+
|
| 227 |
+
3. **Two-loop architecture (outer MCTS/datagen + inner RL trainer) with the world-model head as the bridge**: The specific coupling where failed branches become world-model training targets (not policy gradient targets) is a design synthesis not found in any single paper.
|
| 228 |
+
|
| 229 |
+
### Reinventing (needs citation):
|
| 230 |
+
|
| 231 |
+
1. **MCTS over real code repositories**: SWE-Search (arXiv:2410.20285) does exactly this and is 18 months old. The repo cites it correctly as [51] in the final report but `research/05` doesn't mention it. The "23% relative improvement" from SWE-Search should inform the expected gain from adding search (whether at training or inference time).
|
| 232 |
+
|
| 233 |
+
2. **Tree-structured DPO signal from shared prefix branches**: Tree-GRPO (arXiv:2509.21240) formalizes exactly this. The repo's `extract_dpo_pairs` is a weaker/informal version. The repo should acknowledge that it is implementing a version of what Tree-GRPO proves, and should consider whether the formal Tree-GRPO objective would be stronger than the current Counter-based heuristic.
|
| 234 |
+
|
| 235 |
+
3. **Per-step process supervision from MCTS**: rStar-Math (arXiv:2501.04519) and Math-Shepherd do this for math. The adaptation to code/SWE is novel, but the core mechanism is not. `research/05` does acknowledge this in the PRM section, but understates how directly rStar-Math applies (it performs MCTS rollouts to generate step-verified training data, which is structurally similar to what the tree-of-work aims to do).
|
| 236 |
+
|
| 237 |
+
4. **World-model head trained on agent trajectories**: CWM (arXiv:2510.02387) trains a 32B model on observation-action trajectories for next-state prediction and achieves 65.8% on SWE-bench Verified. This is directly cited in the final report as [13] and correctly described. No reinvention issue.
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
## 6. Cost Estimate Plausibility
|
| 242 |
+
|
| 243 |
+
| Estimate | Source | Plausibility | Issues |
|
| 244 |
+
|---|---|---|---|
|
| 245 |
+
| **$0.98/trace ungated** | Spike 001 measurement, 50 synthetic states, N=3 | Accurate *for 50-step traces*. Misleading at full-session scale (~1400 steps → ~$73 with same model prices). | `teacher_replay.py` docstring should specify the scope. |
|
| 246 |
+
| **$64/trace flat** | `research/05` constructed estimate, N=8 teachers, T=1000 steps | Plausible *as an extrapolation* for a specific 8-teacher 1000-step scenario. Not measured, not a codebase config. | Calling it "ungated tree" is a misnomer. DEFAULT_TEACHERS has N=3, not N=8. |
|
| 247 |
+
| **O(N·decision-points) with divergence gating** | Final report §3, citing research/05 | Plausible as a *theoretical claim* if gating is very aggressive (VOI/entropy > τ fires only at a small fraction of steps). The 60-80% reduction estimate from "PRM literature" is from `research/05`'s analysis, not a specific paper citation. | No empirical measurement of what fraction of real agentic trace steps are "high-uncertainty." The claim needs a specific PRM citation. |
|
| 248 |
+
| **$0.30/trace with VOI gating** | `teacher_replay.py` docstring, "projected" | Speculative projection from $0.98 × ~0.3 reduction. Not measured. | The 70% reduction rate should cite a specific paper or be labeled "estimated." |
|
| 249 |
+
|
| 250 |
+
**Overall cost plausibility verdict:** The $0.98 and $64 figures are internally consistent as upper/lower bounds for specific parameterizations, but neither reflects the actual cost of running Channel 3 on real full-session Claude Code traces (which would be far higher). The divergence-gating reduction estimate (60-80%) is reasonable but unverified for agentic traces specifically. The final report correctly identifies the O(N^D) combinatorial blowup as the dominant cost concern for a true branching tree — this framing is sound.
|
| 251 |
+
|
| 252 |
+
---
|
| 253 |
+
|
| 254 |
+
## 7. Summary of Findings
|
| 255 |
+
|
| 256 |
+
| ID | Severity | Type | Location | Finding |
|
| 257 |
+
|---|---|---|---|---|
|
| 258 |
+
| 05-R1 | Medium | MISREAD | research/05 | rStar is trajectory-level verdict, not per-step replay — overclaims closeness |
|
| 259 |
+
| 05-R2 | High | MISS | research/05 | Tree-GRPO (arXiv:2509.21240) not cited; directly formalizes the DPO-from-branching mechanism |
|
| 260 |
+
| 05-R3 | High | MISS | research/05 | SWE-Search (arXiv:2410.20285) not cited; most direct execution-oracle-adjacent MCTS-on-SWE precedent |
|
| 261 |
+
| 05-R4 | Low | MISREAD | research/05 | AgentTrek "guided/counterfactual replay" connection overstated |
|
| 262 |
+
| 05-R5 | Low | PARTIAL OVERCLAIM | research/05 | Tree-GRPO's shared-prefix branching is closer to the step-level preference extraction than rStar |
|
| 263 |
+
| 05-R6 | Medium | INCONSISTENCY | research/05 + teacher_replay.py | $64 "8-teacher flat" extrapolation mislabeled as "ungated tree"; N=8 not a codebase config |
|
| 264 |
+
| FR-R1 | — | CORRECT | final_report §1 | Depth-1 / flat characterization of Channel 3 is accurate |
|
| 265 |
+
| FR-R2 | — | CORRECT | final_report §1 | Teacher-plurality vs execution-oracle distinction is accurate |
|
| 266 |
+
| FR-R3 | Medium | MISREAD (mild) | final_report §1 | Socratic-SWE contrast slightly overstated; execution-gating detail understated |
|
| 267 |
+
| FR-R4 | — | CORRECT | final_report §1 | Novel synthesis vs ingredient-level precedents is well-calibrated |
|
| 268 |
+
| FR-R5 | High | POTENTIAL OVERCLAIM | final_report §10 + teacher_replay.py | $0.98/trace docstring applies to 50-step synthetic traces; at real full-session scale (~1400 steps), cost would be ~$73/session |
|
| 269 |
+
| FR-R6 | Medium | FLAG | final_report §10 | "$64 ungated tree" label is misleading; it is a flat extrapolation with N=8, T=1000 |
|
| 270 |
+
| FR-R7 | — | CORRECT | final_report §7 | Heterogeneity pushback cited correctly |
|
| 271 |
+
| FR-R8 | High | MISS | final_report §10 | `_normalize_action` whitespace-only normalization is a critical silent failure path on real agentic traces; not in §10 failure modes |
|
| 272 |
+
| FR-R9 | — | CORRECT | final_report §2 | World-model interference and causal-gap papers cited correctly |
|
| 273 |
+
| FR-R10 | Low | UNVERIFIED | final_report §6 | 67% of error-recovery turns are pure thinking: internal empirical claim, should not be stated as externally validated |
|
| 274 |
+
|
| 275 |
+
---
|
| 276 |
+
|
| 277 |
+
## 8. Actionable Recommendations
|
| 278 |
+
|
| 279 |
+
1. **Fix `_normalize_action` before any real-trace replay**: The whitespace-only normalization is the most likely silent failure path. Implement tool-call parsing that canonicalizes (tool_name, args) tuples. This is acknowledged in the code comment but not in any ADR or the final report's failure mode list. **Add to ADR-002 consequences or open a new ADR.**
|
| 280 |
+
|
| 281 |
+
2. **Correct `teacher_replay.py` docstring**: Change "Verified economic floor (✅ spike 001): $0.98 mean per-trace cost ungated" to "Verified economic floor for 50-step synthetic-state benchmark traces (Spike 001): $0.98 mean per-trace. Full Claude Code sessions (~1,400 tool-use steps) would cost ~$70–80 flat ungated at DEFAULT_TEACHERS pricing."
|
| 282 |
+
|
| 283 |
+
3. **Cite Tree-GRPO in `research/05`** and consider adopting its formal objective in `extract_dpo_pairs` rather than the Counter heuristic. The Tree-GRPO advantage estimator (intra-tree group-relative) is a stronger principled basis for the DPO pair extraction.
|
| 284 |
+
|
| 285 |
+
4. **Add SWE-Search to `research/05`** and note that its 23% relative gain represents the expected floor for adding any tree search to a SWE agent. The tree-of-work must beat this at equal compute on a per-training-dollar basis, not just at inference time.
|
| 286 |
+
|
| 287 |
+
5. **Standardize the $64 figure's label**: In the final report's in-text citations and in `research/05`, consistently label it "flat N=8 T=1000 extrapolation (not measured, not a codebase config)" rather than "ungated tree."
|
| 288 |
+
|
| 289 |
+
6. **Add the `_normalize_action` failure mode to §10's failure modes list** in any future revision of the final report.
|
| 290 |
+
|
| 291 |
+
7. **The tree-of-work core claim stands**: The combination of execution-oracle fitness + frozen-trace multi-teacher replay + divergence-gated recursive branching + typed signal routing (world-model head for failed branches, contrastive DPO for near-misses) is a genuine novel synthesis. The final report's pre-registered ablation design (P0–P6 axis + heterogeneity control) is the right scientific vehicle — do not abandon it based on the misreads identified here, which are correction-level, not foundation-level problems.
|
|
@@ -0,0 +1,292 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deep-Read: RL Infra & Frameworks — Critical Findings
|
| 2 |
+
**Cluster 8 of the dataset-pipeline review series**
|
| 3 |
+
**Reviewer:** automated critical pipeline, 2026-06-09
|
| 4 |
+
**Primary sources fetched:** TRL GRPOTrainer live docs (v1.5.1, huggingface.co/docs/trl/en/grpo_trainer), vLLM co-locate blog (June 2025, huggingface.co/blog/vllm-colocate), verl GitHub README (main branch, verl-project/verl — 21.9k stars), SWE-MiniSandbox paper (arXiv:2602.11210v5), secure EKS sandboxes article (AWS Builder Center). Repo files inspected: `research/04-verl-trl.md`, `research/03-monarch-torchforge-openenv.md`, `docs/adrs/ADR-006-rl-frameworks.md`, `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`, `composer_replication/datagen/env.py`, `composer_replication/trainer/composer_trainer.py`, `composer_replication/recipes/prime_rl/composer_loss.py`.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## 1. TRL GRPOTrainer: Does It Actually Support Multi-Turn Agentic Rollouts?
|
| 9 |
+
|
| 10 |
+
### 1.1 What the live TRL docs say (as of TRL v1.5.1, fetched 2026-06-10)
|
| 11 |
+
|
| 12 |
+
The TRL GRPOTrainer docs (primary source, not the repo's research/04 note written in May 2025) describe **two distinct agentic mechanisms**:
|
| 13 |
+
|
| 14 |
+
**Mechanism A — `tools` parameter:** Pass a list of Python callables. GRPOTrainer runs a tool-call loop. Quote from docs: "GRPO supports agent training through the `tools` argument in `GRPOTrainer`." The loop has a hard cap `max_tool_calling_iterations` (default: unlimited, stops on no-tool-call response or `max_model_length`). Each tool call is synchronous — the training GPU waits.
|
| 15 |
+
|
| 16 |
+
**Mechanism B — `environment_factory` parameter:** Pass a callable that creates environment instances. "GRPOTrainer creates one environment instance per rollout and exposes the environment's public methods as tools." Requires `transformers>=5.2.0`. Marked **experimental**: "This feature is experimental and may change or be removed at any time without prior notice." The `reset()` method can return a string that gets appended to the last user message. `rollout_func` is similarly experimental.
|
| 17 |
+
|
| 18 |
+
**Mechanism C — `rollout_func` (custom rollout):** A callable that receives prompts and the trainer, returns `{"prompt_ids", "completion_ids", "logprobs"}`. Also experimental. This is the escape hatch for fully custom multi-turn generation.
|
| 19 |
+
|
| 20 |
+
**Key constraint confirmed from primary source:** TRL has **no async GPU-decoupled agent loop**. The docs explicitly state the training-inference mismatch and handle it via Truncated Importance Sampling (`vllm_importance_sampling_correction=True` by default), not by async GPU handoff. When a tool call is executing, the GPU waits. This is not a flaw in the repo's research note — `research/04-verl-trl.md` correctly identified this gap — but the docs now show TRL has partially closed the *multi-turn* gap via `tools` / `environment_factory`.
|
| 21 |
+
|
| 22 |
+
### 1.2 What `research/04-verl-trl.md` claims vs. primary source
|
| 23 |
+
|
| 24 |
+
| Claim in research/04 | Primary source (TRL docs, 2026) | Verdict |
|
| 25 |
+
|---|---|---|
|
| 26 |
+
| "TRL does NOT have an async GPU-decoupled agent loop" | Confirmed | CORRECT |
|
| 27 |
+
| "OpenEnv integration (October 2025)" | Confirmed; `environment_factory` + TRL's OpenEnv guide | CORRECT |
|
| 28 |
+
| "VLM support" | Confirmed — tools can return `list` of content blocks incl. images | CORRECT |
|
| 29 |
+
| "GRPOTrainer supports multi-step agentic rollouts" (04:173) | Confirmed via `tools` + `environment_factory` | CORRECT |
|
| 30 |
+
| TRL v1.0 released March 2026 | Confirmed; docs show versions v1.0.0 through v1.5.1 | CORRECT |
|
| 31 |
+
| Default `loss_type` is `"dapo"` | **CONFIRMED from source**: `loss_type: str = 'dapo'` in GRPOConfig | CORRECT |
|
| 32 |
+
| Default `scale_rewards` is... | **CONFIRMED: default is `"group"`** (not `False`/`"none"`) | CORRECT |
|
| 33 |
+
|
| 34 |
+
### 1.3 Critical discovery: TRL's default is DAPO, not GRPO
|
| 35 |
+
|
| 36 |
+
The TRL GRPOConfig shows `loss_type = 'dapo'` as the default. ADR-008 claims to configure `loss_type="dr_grpo"` to match Composer 2.5. The source confirms `"dr_grpo"` is a valid value (uses `max_completion_length` as the constant denominator). **This is consistent with ADR-008's decision.**
|
| 37 |
+
|
| 38 |
+
However: ADR-008 states "KL estimator (k1 vs k3) is not configured or asserted" as an OPEN item. The TRL docs show `beta=0.0` as default (KL term disabled). If `beta=0`, the k1/k3 distinction is moot — there is no KL term in the loss at all. The ADR-008 open item is therefore **low-priority when beta=0** (the current default). If the repo ever enables beta>0 to use k1 KL-in-loss (distinct from the k1-in-reward path the trainer already implements), the open item becomes relevant.
|
| 39 |
+
|
| 40 |
+
### 1.4 `scale_rewards` drift assertion
|
| 41 |
+
|
| 42 |
+
ADR-008 checks `str(cfg.scale_rewards).lower() in ("none","false")`. Primary source confirms `scale_rewards` accepts: `True`/`"group"` (default), `"batch"`, `False`/`"none"`. The check is correct.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 2. The Colocate-vLLM Blog: What It Actually Says
|
| 47 |
+
|
| 48 |
+
Primary source: "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" (June 3, 2025, huggingface.co/blog/vllm-colocate).
|
| 49 |
+
|
| 50 |
+
**What the blog confirms:**
|
| 51 |
+
- Co-locate mode (`vllm_mode="colocate"`) runs training and vLLM in the same process, sharing GPUs. No REST API overhead.
|
| 52 |
+
- Speedups measured: 1.43× for 1.5B model, 1.35× for 7B model, 1.73× for 7B with TP>1. For 72B model (Qwen2.5-Math-72B): **co-locate is ~1.26× faster than plain TRL with 4 fewer GPUs**.
|
| 53 |
+
- vLLM sleep mode (level 2) is **not yet merged into TRL upstream** (as of the blog date) due to a segfault on exit (vLLM issue #16993). The docs now show `vllm_enable_sleep_mode` as a parameter, implying it was eventually merged, but the blog notes a real production bug.
|
| 54 |
+
- FSDP + co-locate + LoRA has known issues: "GRPO + FSDP + LoRA + VLLM colocate" doesn't work; DeepSpeed ZeRO-3 is the recommended path. FSDP1 has a NaN bug with co-located vLLM (issue #14443).
|
| 55 |
+
|
| 56 |
+
**What `research/04` says:** Does not cite the co-locate blog specifically but correctly describes the "NO GPU left behind" feature (June 2025 update, row in §2.7 table). No material misread.
|
| 57 |
+
|
| 58 |
+
**What the repo's SageMaker smoke recipe uses:** The SageMaker GRPO smoke (from git history context) uses `use_vllm=False` for initial tests, which is fine — the co-locate mode requires enough GPU memory for both model and vLLM, and a single g5.2xlarge (1× A10G, 24 GB) may not accommodate it.
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## 3. VeRL: Agentic Mode and the AsyncServer
|
| 63 |
+
|
| 64 |
+
Primary source: verl GitHub README (main branch, fetched 2026-06-10). Key finding from README:
|
| 65 |
+
|
| 66 |
+
> "[2026/05] uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl."
|
| 67 |
+
> "[2026/01] transfer_queue, fully_async_policy, one_step_off_policy ... are kept under verl/experimental since they are planned to be merged into the main library."
|
| 68 |
+
> Feature list includes: "Multi-turn with tool calling", "Sandbox Fusion Integration", "SGLang, verl, OpenBMB and Tsinghua University: Pioneering End-to-End Multi-Turn RLHF"
|
| 69 |
+
|
| 70 |
+
The verl README confirms multi-turn tool-calling exists and `uni-agent` was released May 2026 as a unified agent framework. The `AsyncServer`/`AgentLoop` architecture described in `research/04` is consistent with what the README describes, though the README doesn't use those exact terms. The experimental async features (`fully_async_policy`, `transfer_queue`) are available but not yet in main.
|
| 71 |
+
|
| 72 |
+
**What `research/04` claims about VeRL agentic support:**
|
| 73 |
+
- "First-class agentic RL support" with `AsyncServer`/`AgentLoop` — the README confirms the direction but notes these are under `verl/experimental`. The research/04 characterization of "first-class" **slightly overclaims** what is in the stable API; the full async path is experimental.
|
| 74 |
+
- `SandboxFusionTool` — mentioned in the README as a documented integration ("Sandbox Fusion Integration" link). Consistent.
|
| 75 |
+
- "Multi-turn tokenisation: noted as complex; naive concatenation of per-turn token IDs can introduce distribution drift" — confirmed by the README community blog link "When Reasoning Models Break Tokenization: The Hidden Complexity of Multiturn Training."
|
| 76 |
+
|
| 77 |
+
---
|
| 78 |
+
|
| 79 |
+
## 4. PRIME-RL in the Repo
|
| 80 |
+
|
| 81 |
+
### 4.1 What ADR-006 claims
|
| 82 |
+
|
| 83 |
+
ADR-006 claims PRIME-RL ships a `CustomLossConfig` with `import_path` for dropping in a Python loss function, exposing `LossInputs` with `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask`. It was used to train INTELLECT-1 (10B, 30 nodes) and INTELLECT-2 (32B QwQ).
|
| 84 |
+
|
| 85 |
+
### 4.2 What `composer_replication/recipes/prime_rl/composer_loss.py` confirms
|
| 86 |
+
|
| 87 |
+
The code reads (lines 21-28):
|
| 88 |
+
```python
|
| 89 |
+
@dataclass
|
| 90 |
+
class LossInputs:
|
| 91 |
+
trainer_logprobs: Float[Tensor, ' seq']
|
| 92 |
+
inference_logprobs: Float[Tensor, ' seq']
|
| 93 |
+
teacher_logprobs: Float[Tensor, ' seq'] | None
|
| 94 |
+
advantages: Float[Tensor, ' seq']
|
| 95 |
+
loss_mask: Bool[Tensor, ' seq']
|
| 96 |
+
```
|
| 97 |
+
This is marked as "verified against PrimeIntellect-ai/prime-rl `src/prime_rl/trainer/rl/loss.py` lines 13-22." The code correctly raises `NotImplementedError` when `alpha_sdpo > 0` (logits not available, only log-probs). **This is a real constraint, not a placeholder.**
|
| 98 |
+
|
| 99 |
+
### 4.3 The DPPO upstream loss — a subtle accuracy point
|
| 100 |
+
|
| 101 |
+
The `composer_loss.py` reproduces the upstream DPPO loss verbatim (lines 40-60 of the file). It uses:
|
| 102 |
+
```python
|
| 103 |
+
probs_diff = exp(trainer_logprobs) - exp(inference_logprobs) # probability-space diff
|
| 104 |
+
```
|
| 105 |
+
This is notably **not** a log-ratio but a probability-space difference gating the drop/keep mask. This is PRIME-RL's design, not a repo mistake. But it means the DPPO channel is more like PRIME-RL's INTELLECT-style training than standard GRPO — the repo's framing in ADR-006 as "channels 1+3" needs to be understood in that context: channel 1 is DPPO-shaped (probability-gated policy update), not raw GRPO.
|
| 106 |
+
|
| 107 |
+
---
|
| 108 |
+
|
| 109 |
+
## 5. The Key Question: Is TRL's Single-Submit `reward_fn` a Dead End for Multi-Turn?
|
| 110 |
+
|
| 111 |
+
### 5.1 What `env.py::reward_fn` actually does
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
def reward_fn(self, prompts, completions, *, task_id, **kwargs) -> list[float]:
|
| 115 |
+
...
|
| 116 |
+
for comp, tid in zip(completions, task_id):
|
| 117 |
+
task = self.registry[tid]
|
| 118 |
+
self.reset(task)
|
| 119 |
+
if self._replay is not None:
|
| 120 |
+
res = self._replay(self, comp)
|
| 121 |
+
else:
|
| 122 |
+
res = self.step({"type": "submit"}) # <-- single submit
|
| 123 |
+
rewards.append(res.reward)
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
**The fallback path** (no `_replay` function) treats the entire completion as a single submit — this is an outcome reward on a single-turn completion. This is what the unit tests exercise and what a standard TRL `reward_funcs` call would do.
|
| 127 |
+
|
| 128 |
+
**The intended multi-turn path** is `_replay`: a callable that takes `(env, completion)` and drives multi-turn turns by parsing the agent's encoded tool-call history from the `completion` string. This is a **custom deserializer** that replays the agent turns and grades at the end.
|
| 129 |
+
|
| 130 |
+
### 5.2 Is this a dead end for multi-turn RL?
|
| 131 |
+
|
| 132 |
+
**For current TRL integration: partly yes, mostly no.**
|
| 133 |
+
|
| 134 |
+
The single-submit fallback IS a dead end for genuine multi-turn RL credit assignment — it cannot grade intermediate tool-call steps. But there are two viable paths:
|
| 135 |
+
|
| 136 |
+
**Path A — `_replay` + `rollout_func` (TRL experimental):** The `rollout_func` parameter in GRPOTrainer can drive multi-turn generation externally (running the env's `step()` loop), serialize the full trajectory into `completion` tokens, then call `reward_fn` which uses `_replay` to deserialize and grade. This makes the `reward_fn` the grader, not the rollout driver. This works in TRL **today** but requires the experimental `rollout_func` interface.
|
| 137 |
+
|
| 138 |
+
**Path B — `environment_factory` (TRL experimental):** Pass `FeatureDeletionEnv` (or an adapter) as `environment_factory`. GRPOTrainer calls `reset()` and then uses the env's public methods as tools. The `reward_fn` is replaced by a reward function that reads `environments[i].reward` after generation. This is the more principled path for true multi-turn RL and is what TRL's `environment_factory` was designed for. It requires `transformers>=5.2.0` and is still experimental.
|
| 139 |
+
|
| 140 |
+
**Path C — VeRL's AgentLoop:** For the tree-of-work (multiple parallel rollout branches per prompt, credit assigned at trajectory end), VeRL's `AsyncServer`+`AgentLoop` is architecturally the right fit. Each branch is a coroutine; GPU is not blocked during `sandbox.exec()` calls. The repo acknowledges this in `research/04` §5.3 recommendation.
|
| 141 |
+
|
| 142 |
+
### 5.3 The honest migration path
|
| 143 |
+
|
| 144 |
+
The current TRL single-submit `reward_fn` is:
|
| 145 |
+
- **Correct** for the Phase 1 use case: offline dataset generation where the model produces a diff and we grade it. This is the "GRPO on completions" paradigm.
|
| 146 |
+
- **Insufficient** for genuine multi-turn RL over FeatureDeletionEnv episodes, especially the tree-of-work vision where the model takes tool-call steps, explores branches, and gets rewards at trajectory end.
|
| 147 |
+
|
| 148 |
+
**Migration path (in order of complexity):**
|
| 149 |
+
|
| 150 |
+
1. **Immediate (low cost):** Use TRL's `environment_factory` with `FeatureDeletionEnv` as the adapter. The env's `step()` becomes a tool. Grade via `reward_funcs` reading `env.reward`. Marked experimental but low integration cost. This supports genuine multi-turn GRPO with the current TRL host.
|
| 151 |
+
|
| 152 |
+
2. **Medium term (single-GPU scale):** Implement `rollout_func` that drives the env loop directly, returns serialized trajectories with log-probs. Full control over multi-turn; TRL handles the GRPO update.
|
| 153 |
+
|
| 154 |
+
3. **Scale-out (multi-GPU, async, tree-of-work):** Migrate to VeRL's `AgentLoop`. The `FeatureDeletionEnv` maps onto verl's `SandboxFusionTool` protocol. The tree-of-work branching requires N parallel rollout workers per prompt, which VeRL's asyncio architecture supports and TRL's synchronous loop does not.
|
| 155 |
+
|
| 156 |
+
**The tree-of-work IS multi-turn.** The vision in `framework/composer-replication-framework.md` of a "multi-model Monte-Carlo tree-of-work" requires:
|
| 157 |
+
- Many concurrent rollout branches per prompt
|
| 158 |
+
- Reward propagated back through the tree (not just at leaf)
|
| 159 |
+
- Asynchronous sandbox execution without blocking GPU
|
| 160 |
+
|
| 161 |
+
None of these are provided by TRL's current `GRPOTrainer` (even with `tools`/`environment_factory`). VeRL's experimental `fully_async_policy` + `AgentLoop` is the right substrate. The repo's `research/04` correctly identifies this but the ADR layer has not formally acknowledged this migration requirement.
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## 6. Sandboxing for Code Execution at Scale
|
| 166 |
+
|
| 167 |
+
### 6.1 What the secure-EKS article says (primary source)
|
| 168 |
+
|
| 169 |
+
> "gVisor added negligible launch latency... handles isolation for most agent workloads."
|
| 170 |
+
> "Cold start was around 5 seconds per sandbox" for Kata+Firecracker.
|
| 171 |
+
> "EKS Managed Node Groups do not work yet: they override the CPU Options stanza needed for nested virtualization, forcing the use of self-managed node groups."
|
| 172 |
+
> "Managed sandbox platforms skip Kubernetes entirely. E2B and Vercel Sandbox provision Firecracker microVMs directly... sandbox creation in under a second, versus ~5 seconds Kata with Firecracker on EKS."
|
| 173 |
+
|
| 174 |
+
### 6.2 What SWE-MiniSandbox (arXiv:2602.11210v5) adds
|
| 175 |
+
|
| 176 |
+
Abstract: "SWE-MiniSandbox lowers disk usage to approximately 5% of that required by container-based pipelines and reduces environment preparation time to about 25% of the container baseline." Uses "kernel-level mechanisms" (not containers) with "lightweight environment pre-caching." Empirical performance comparable to container-based pipelines on SWE-bench-style tasks.
|
| 177 |
+
|
| 178 |
+
This is directly relevant to the repo's `FeatureDeletionEnv`/`Sandbox` design. The current `sandbox.py` uses `LocalSubprocessSandbox` (plain subprocess, Docker-gated for real tests) — essentially no isolation for the subprocess case. For production RL training at scale with multiple rollout workers, the SWE-MiniSandbox approach (kernel-level isolation without per-task container builds) could reduce env setup from minutes to seconds.
|
| 179 |
+
|
| 180 |
+
### 6.3 What the repo's sandbox.py actually provides
|
| 181 |
+
|
| 182 |
+
`sandbox.py` defines:
|
| 183 |
+
- `LocalSubprocessSandbox` — runs commands via `subprocess.run` in the repo tree. No container, no kernel isolation. The security model relies on the denylist + cache scrub (commented as "INSUFFICIENT as a primary control").
|
| 184 |
+
- `DockerSandbox` (in `docker_sandbox.py`) — real isolation, referenced in tests.
|
| 185 |
+
|
| 186 |
+
**The gap:** For RL training at scale (many parallel rollout workers), neither `LocalSubprocessSandbox` nor per-task Docker containers are adequate:
|
| 187 |
+
- Subprocess: no isolation, reward hacking possible via Python import tricks (acknowledged in code comments).
|
| 188 |
+
- Docker: isolation is good, but per-task container boot is slow (typical: 2–5s without pre-warming), and at 8 rollouts × N prompts × G generations = hundreds of container launches per training step.
|
| 189 |
+
|
| 190 |
+
The repo's `research/review-sandbox.json` presumably tracks this; the production path requires pre-warmed sandbox pools (gVisor RuntimeClass for speed, Kata/Firecracker for stronger isolation).
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## 7. Misreads, Overclaims, and Gaps in Repo Research
|
| 195 |
+
|
| 196 |
+
### 7.1 OVERCLAIM: VeRL "first-class" agentic RL
|
| 197 |
+
|
| 198 |
+
`research/04` §1.5: "VeRL has first-class agentic RL support" and describes `AsyncServer`/`AgentLoop` as stable. The verl README (main branch, 2026-06-10) shows:
|
| 199 |
+
- `transfer_queue`, `fully_async_policy`, `one_step_off_policy` are kept under `verl/experimental` — "planned to be merged into the main library."
|
| 200 |
+
- `uni-agent` (May 2026) provides the higher-level agent framework, but it's a separate release on top of verl, not part of the stable library.
|
| 201 |
+
|
| 202 |
+
The agentic async path **exists** but is experimental in the stable API. The characterization as "first-class" is slightly ahead of the actual maturity. The repo should note this when recommending VeRL for the tree-of-work.
|
| 203 |
+
|
| 204 |
+
### 7.2 MISS: TRL now has `environment_factory` + `tools` for multi-turn
|
| 205 |
+
|
| 206 |
+
`research/04` (written May 2025) describes TRL as having only synchronous reward functions. The live TRL docs (v1.5.1, 2026) show `environment_factory` and `tools` for genuine multi-turn generation loops. These are experimental but available. The research/04 comparison matrix says "TRL: agentic tool-calling RL ⚠️ (blocking)" — this remains accurate (it IS blocking), but misses that TRL now provides the `environment_factory` interface which would allow `FeatureDeletionEnv` to drive multi-turn episodes inside GRPOTrainer without a custom `rollout_func`. This is not mentioned in any ADR.
|
| 207 |
+
|
| 208 |
+
### 7.3 MISS: TRL default `loss_type="dapo"`, NOT `"grpo"` or `"dr_grpo"`
|
| 209 |
+
|
| 210 |
+
ADR-008 correctly targets `loss_type="dr_grpo"`, but research/04 (the background research) does not explicitly state that TRL 1.x defaults to DAPO loss. The live docs confirm this. The drift assertion in `make_dr_grpo_config` is the right mitigation.
|
| 211 |
+
|
| 212 |
+
### 7.4 GAP: `scale_rewards` default is `"group"`, not `True`
|
| 213 |
+
|
| 214 |
+
The GRPOConfig shows `scale_rewards: str = 'group'` (a string, not a bool). ADR-008's assertion `str(cfg.scale_rewards).lower() in ("none","false")` correctly handles both the old bool (`False`) and new string (`"none"`) forms. But the docs show `True` and `"group"` are equivalent (both mean group-level std scaling). The assertion is correct; this is a documentation note, not a bug.
|
| 215 |
+
|
| 216 |
+
### 7.5 GAP: KL estimator — TRL default is k3, not k1
|
| 217 |
+
|
| 218 |
+
The TRL docs show the KL approximator formula:
|
| 219 |
+
```
|
| 220 |
+
D_KL[π_θ || π_ref] = π_ref/π_θ - log(π_ref/π_θ) - 1
|
| 221 |
+
```
|
| 222 |
+
This is the **k3 estimator** (Schulman et al., 2020). ADR-008 claims Composer 2.5 uses k1 (`-log r = log π_θ/π_ref`). The ADR notes this as an OPEN item. Since `beta=0.0` by default in TRL, the KL term is disabled and this doesn't affect training unless `beta>0`. However, `composer_trainer.py` implements the k1-in-reward path via `kl_in_reward.py` — this is a separate mechanism from TRL's in-loss KL. Verify: the k1-in-reward path computes `log(π_θ/π_ref)` at reward time and folds it into advantages, while TRL's in-loss k3 term (when beta>0) would add a different term. If both are enabled simultaneously, they would double-count KL. The safe configuration is: **k1-in-reward active, beta=0 (TRL in-loss KL disabled)**. The code appears to do this but there's no explicit assertion that `beta=0` when `kl_in_reward=True`.
|
| 223 |
+
|
| 224 |
+
### 7.6 MISS: `num_iterations=1` narrowed claim
|
| 225 |
+
|
| 226 |
+
ADR-008 acknowledges that `num_iterations=1` controls GRPO inner-loop reuse, not dataset-level epochs. Primary source confirms: "Number of iterations per batch (denoted as μ in the algorithm)." The ADR's narrowed claim is correct.
|
| 227 |
+
|
| 228 |
+
### 7.7 MISS: TRL default optimizer is `adamw_torch_fused`, not `adam`
|
| 229 |
+
|
| 230 |
+
ADR-008 has an OPEN item: "Adam is claimed but `optim` is not set." The GRPOConfig docs show:
|
| 231 |
+
```
|
| 232 |
+
optim: transformers.training_args.OptimizerNames | str = 'adamw_torch_fused'
|
| 233 |
+
```
|
| 234 |
+
Default is `adamw_torch_fused` (AdamW with fused CUDA kernel), not plain `adam`. If Composer 2.5 uses Adam (without weight decay), the ADR's open item remains relevant: set `weight_decay=0.0` and `optim="adam"` explicitly to match. The default AdamW has weight decay (though `weight_decay=0.0` is already the GRPOConfig default, making it numerically equivalent to Adam in this specific case). However, `adamw_torch_fused` ≠ `adam` in terms of the optimizer implementation; to be precise, set `optim="adamw_8bit"` or `optim="paged_adamw_8bit"` (memory efficient) or just `optim="adam_torch"` if plain Adam is intended.
|
| 235 |
+
|
| 236 |
+
---
|
| 237 |
+
|
| 238 |
+
## 8. verl `uni-agent` — A New Development Not in Research/04
|
| 239 |
+
|
| 240 |
+
The verl README (May 2026): "uni-agent is released: a unified agent framework to build, run, and train LLM agents at scale, built on top of verl." This is a post-cutoff development (research/04 was written May 2025) that the repo has not incorporated. `uni-agent` could be the production-ready path for multi-turn agentic RL with verl, potentially superseding the lower-level `AgentLoop`/`AsyncServer` integration that ADR-006 contemplates.
|
| 241 |
+
|
| 242 |
+
**Implication for the repo:** Before committing to a custom VeRL `AgentLoop` integration, evaluate whether `uni-agent` already provides the FeatureDeletionEnv integration pattern out of the box. This could significantly reduce the engineering surface.
|
| 243 |
+
|
| 244 |
+
---
|
| 245 |
+
|
| 246 |
+
## 9. Sandboxing Recommendation Gaps
|
| 247 |
+
|
| 248 |
+
### 9.1 The `SWE-MiniSandbox` approach is not referenced anywhere in the repo
|
| 249 |
+
|
| 250 |
+
arXiv:2602.11210 (Feb 2026) directly addresses the production gap: container-free RL training with 5% disk usage and 25% env setup time vs containers. The paper's "kernel-level mechanisms" (likely Linux namespaces + cgroups without a full container runtime) with pre-caching is directly applicable to `FeatureDeletionEnv` at scale. The repo's sandbox design doesn't reference this work.
|
| 251 |
+
|
| 252 |
+
### 9.2 The repo's `docker_sandbox.py` is production-blocking for RL at scale
|
| 253 |
+
|
| 254 |
+
Per-task Docker container boots at GRPO scale (G=8 completions/prompt, B=4 per-device batch, many workers) means O(B*G) = O(32) container launches per training step. Without pre-warming or snapshot-based fast boot, this is the dominant latency. The gVisor RuntimeClass approach (negligible overhead, per the AWS article) or SWE-MiniSandbox's kernel-namespace approach are both faster alternatives.
|
| 255 |
+
|
| 256 |
+
---
|
| 257 |
+
|
| 258 |
+
## 10. Summary of Critical Findings
|
| 259 |
+
|
| 260 |
+
| Finding | Severity | Affected Files |
|
| 261 |
+
|---|---|---|
|
| 262 |
+
| VeRL async agent loop is EXPERIMENTAL, not "first-class" stable | MEDIUM — overclaim | research/04 §1.5, ADR-006 |
|
| 263 |
+
| TRL `environment_factory` (multi-turn) not in any ADR | MEDIUM — miss | ADR-008, env.py |
|
| 264 |
+
| k1-in-reward + beta=0 assertion missing (double-KL risk if beta>0 ever set) | MEDIUM — correctness gap | composer_trainer.py, ADR-008 |
|
| 265 |
+
| `optim` default is `adamw_torch_fused`, not `adam` | LOW — fidelity gap | ADR-008 OPEN item |
|
| 266 |
+
| TRL `loss_type` defaults to `"dapo"` (not GRPO), correctly handled | INFO — confirmed correct | ADR-008, make_dr_grpo_config |
|
| 267 |
+
| `env.py::reward_fn` single-submit path is dead end for tree-of-work | HIGH — architecture gap | env.py, no ADR exists |
|
| 268 |
+
| `uni-agent` (verl, May 2026) not evaluated — may supersede custom AgentLoop | MEDIUM — miss | ADR-006 |
|
| 269 |
+
| SWE-MiniSandbox approach not referenced (5% disk, 25% setup time) | MEDIUM — miss | sandbox.py, docker_sandbox.py |
|
| 270 |
+
| EKS Managed Node Groups incompatible with Kata+Firecracker (nested virt) | INFO — production gotcha | no ADR |
|
| 271 |
+
|
| 272 |
+
---
|
| 273 |
+
|
| 274 |
+
## 11. Migration Path for Multi-Turn Agentic RL (Honest Assessment)
|
| 275 |
+
|
| 276 |
+
The current repo architecture (TRL `reward_fn` with single-submit fallback) is:
|
| 277 |
+
|
| 278 |
+
**Phase 1 — GRPO on completions (current):** The model generates a single diff/completion, the reward function grades it. This is viable, shippable, and correct. The TRL host is appropriate. No migration needed for this phase.
|
| 279 |
+
|
| 280 |
+
**Phase 2 — Multi-turn FeatureDeletionEnv (agentic GRPO):** The model takes tool-call steps (bash, file edits, test runs). Reward at trajectory end. Migration options in order:
|
| 281 |
+
|
| 282 |
+
1. **TRL `environment_factory` adapter (experimental, weeks of work):** Wrap `FeatureDeletionEnv` as a TRL environment. Methods become tools. Blocking GPU during sandbox execution — OK for small scale (≤8 GPUs), not for high parallelism.
|
| 283 |
+
|
| 284 |
+
2. **TRL `rollout_func` (experimental, 1–2 weeks):** Custom rollout that drives the env loop, serializes trajectories. Full control; TRL handles GRPO update.
|
| 285 |
+
|
| 286 |
+
3. **VeRL AsyncServer + `FeatureDeletionEnv` as SandboxFusionTool adapter (2–4 weeks):** GPU not blocked during sandbox calls. Required for tree-of-work fan-out at scale. The repo's ADR-006/ADR-008 have this on the roadmap but it's not implemented.
|
| 287 |
+
|
| 288 |
+
**Phase 3 — Tree-of-Work (MCTS, multi-branch):** This REQUIRES verl's async architecture. TRL cannot support N parallel branches per prompt with GPU-efficient execution. The `uni-agent` framework on top of verl should be evaluated first before building a custom AgentLoop integration.
|
| 289 |
+
|
| 290 |
+
---
|
| 291 |
+
|
| 292 |
+
*Sources: TRL docs fetched 2026-06-10 (huggingface.co/docs/trl/en/grpo_trainer); vLLM co-locate blog (huggingface.co/blog/vllm-colocate); verl README (github.com/volcengine/verl/blob/main/README.md); SWE-MiniSandbox arXiv:2602.11210v5; AWS Builder Center EKS sandboxes article. Repo files: research/04, research/03, ADR-006, ADR-008, env.py, composer_trainer.py, recipes/prime_rl/composer_loss.py.*
|
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Critic: FIDELITY-TO-SOURCES
|
| 2 |
+
**Reviewer:** fidelity critic (critical-review pipeline)
|
| 3 |
+
**Date:** 2026-06-10
|
| 4 |
+
**Primary sources re-verified this pass:** Composer 2.5 blog full body (vault note `introducing-composer-25-cursor`, re-read verbatim), deep-reads `00-grounding.md` and `01`–`08` (all read in full), plus direct re-reads of `docs/COMPOSER_RECIPE_MAPPING.md`, `research/01-composer-2.5.md`, `research/06-feature-deletion-datagen.md`, `docs/adrs/ADR-010`, `composer_replication/opsd.py`, `composer_replication/teacher_replay.py`, `composer_replication/trainer/kl_in_reward.py`, `composer_replication/diloco/__init__.py`, `framework/composer-replication-framework.md`, `docs/VISION_VALIDATION.md`.
|
| 5 |
+
|
| 6 |
+
Severity scale: **P0** = claim must be corrected (factually wrong vs primary source); **P1** = needs a caveat/qualifier (true-ish but overclaims or hides scope); **P2** = wording/precision fix.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## P0 — claims that must be corrected
|
| 11 |
+
|
| 12 |
+
**1. [P0] Benchmark numbers (69.3%, Terminal-Bench 2.0 parity) presented as Cursor-stated — they appear in NO primary source.**
|
| 13 |
+
- *Source fact:* The Composer 2.5 blog (re-fetched, full body) contains **zero benchmark numbers**. The only numerics are pricing ($0.50/$2.50, $3.00/$15.00), "25x more synthetic tasks," "10x more total compute," "0.2s" optimizer step, CP=2/EP=8. The Composer 2 tech report's Table 1 has 61.3/73.7/61.7 for Composer **2** (deep-read 01, FINDING-1).
|
| 14 |
+
- *Repo claim:* `research/01-composer-2.5.md:63-64`: "Composer 2.5 scored 69.3% … On public agentic benchmarks like *Terminal-Bench 2.0*, it hit 69.3%. On *SWE-bench Multilingual*, it achieved parity with or slightly surpassed OpenAI's GPT-5.5."
|
| 15 |
+
- *Problem:* The file's own audit notice (line 8) flags this, but the body still asserts the numbers as fact, and the audit note says only "not in the 2.5 blog" — it does not say the numbers are unverified in ANY primary source. Anyone citing §Performance gets fabricated/secondary numbers.
|
| 16 |
+
- *Correction:* Strike or rewrite §"Performance Characteristics" to: "Neither the 2.5 blog nor the Composer 2 technical report publishes Composer 2.5 benchmark numbers. Figures like 69.3% circulate in secondary commentary (e.g., the Lushbinary developer-guide blog in the vault) and must not be used as replication targets."
|
| 17 |
+
|
| 18 |
+
**2. [P0] "Feature Deletion + 24 other (unnamed) generators" — the count 24 is invented.**
|
| 19 |
+
- *Source fact (blog, verbatim):* "We use a **range of approaches** for creating synthetic tasks that are grounded in real codebases. For example, **one** synthetic approach is feature deletion." No count, no other names, anywhere.
|
| 20 |
+
- *Repo claim:* `docs/COMPOSER_RECIPE_MAPPING.md:75` (mapping row b): "Feature Deletion + 24 other (unnamed) generators"; `research/06-feature-deletion-datagen.md:330`: "The other ~24 generators"; propagated into `research/09` line 23 ("captured 'Feature Deletion + 24 unnamed generators'").
|
| 21 |
+
- *Problem:* "24" appears to be a back-formation from "25x" — but 25x is a **task-count multiplier vs Composer 2**, not a generator count. Two unrelated quantities have been conflated.
|
| 22 |
+
- *Correction:* "Feature deletion is the only named generator. The blog says 'a range of approaches'; the number, names, and weighting of other generators are unknown. The '25x' figure counts synthetic *tasks* relative to Composer 2 (whose baseline count is also unstated), not generators."
|
| 23 |
+
|
| 24 |
+
**3. [P0] SDPO declared "mathematically the same" as Composer 2.5's mechanism — the equivalence is asserted nowhere by Cursor, and SDPO's published mechanics materially differ.**
|
| 25 |
+
- *Source facts:* (a) The blog cites the three self-distillation papers only as "For more **background** on this approach see…" (footnote 1, verbatim). (b) SDPO (arXiv:2601.20802, Eq. 1) applies the KL **over the full rollout response** with feedback in the **conditioning prefix**; it has **no error-turn detection, no hint intercalated into the response, and requires a regularized (EMA/trust-region) teacher** (deep-read 03 §1.1, §3.1–3.2). The blog's mechanism is turn-localized ("For that turn only, we then update the student weights…") — closer to the repo's implementation than to SDPO.
|
| 26 |
+
- *Repo claims:* `docs/COMPOSER_RECIPE_MAPPING.md:25`: "This is **mathematically the same** as Composer's targeted-textual-feedback method"; same doc §Citations: SDPO is "The direct formalization of Composer's hint-distill"; `composer_replication/opsd.py:13-14`: "arXiv:2601.20802 (formalizes the same loss as Composer 2.5's 'Targeted RL with Textual Feedback')".
|
| 27 |
+
- *Problem:* This is the load-bearing identification the whole Channel-2 architecture rests on ("Lift the OPSD/SDPO loss directly… exact same mechanism Cursor uses" — mapping doc §Implementation handles). Cursor never says SDPO is its mechanism; and the repo's actual implementation (error-turn masking + hint spliced into the teacher's response sequence + ADR-011 alignment indices) is **neither** SDPO **nor** confirmed Composer — it is a blog-inspired original design (deep-read 03 verdicts: "FALSE — SDPO applies loss to full rollout"; "FALSE — feedback is in prefix, not intercalated").
|
| 28 |
+
- *Correction:* "Cursor cites SDPO/OPSD as *background*. SDPO is the closest published formalization of a hint-conditioned same-weights teacher with an on-policy distillation KL, but it differs from the blog's described mechanism in localization (full-rollout vs per-turn) and from our implementation in teacher-context construction (feedback-in-prefix vs hint-at-error-turn). Our SDPO channel is a blog-faithful, SDPO-*inspired* design — not a port of SDPO and not a confirmed match to Cursor's internal method."
|
| 29 |
+
|
| 30 |
+
**4. [P0] Streaming DiLoCo misattributed: wrong authors AND wrong title attached to arXiv:2501.18512.**
|
| 31 |
+
- *Source fact:* arXiv:2501.18512 is "Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch," **Douillard et al. 2025**. "Eager Updates for Overlapped Communication and Computation in DiLoCo" is a *different* paper (arXiv:2502.12996, Kale/Douillard/Donchev). "Liu et al." is the Async Local-SGD paper (arXiv:2401.09135) (deep-read 05, F4).
|
| 32 |
+
- *Repo claims:* `composer_replication/diloco/__init__.py:8`: "Streaming DiLoCo (Liu et al. 2025)"; `research/design-F4-decoupled-diloco-s3.md:109`: "Liu et al. 2025, 'Eager Updates for Overlapped Communication in DiLoCo', arXiv:2501.18512".
|
| 33 |
+
- *Correction:* Fix both files to "Streaming DiLoCo (Douillard et al. 2025, arXiv:2501.18512)"; cite Eager Updates separately as Kale et al. 2025, arXiv:2502.12996.
|
| 34 |
+
|
| 35 |
+
**5. [P0] CWM cited as licensing "train-on-all for the world-model aux head during RL" — CWM does this in a separate mid-training stage, not as an aux head riding policy gradients.**
|
| 36 |
+
- *Source fact (CWM, arXiv:2510.02387 §2.2, verbatim):* "Because our goal with the ForagerAgent data is to learn a comprehensive world model … **we do not filter trajectories based on whether they succeed**." That is a **mid-training data decision** for a dedicated pre-RL stage; the world-modeling capability is in the base weights before RL begins (deep-read 06, Finding 2.1, 4.4).
|
| 37 |
+
- *Repo claim:* `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §2: CWM "crucially training *on all* trajectories **for the world-model head**, reserving success-filtering only for the RL reward"; §4 re-uses it as the justification that the aux head "makes train-on-all *safe* for the policy."
|
| 38 |
+
- *Problem:* This is the single load-bearing citation for the two-harvest design (failed branches → world-model head). The cited architecture is structurally different; the simultaneous-gradient configuration the repo proposes is exactly the one the also-cited interference paper (2602.00994) warns about.
|
| 39 |
+
- *Correction:* "CWM mid-trains on unfiltered trajectories in a dedicated pre-RL stage; it provides an existence proof for train-on-all *dynamics learning*, not for an auxiliary next-state head trained simultaneously with RL. No published work tests our exact configuration."
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## P1 — claims needing caveats / qualifiers
|
| 44 |
+
|
| 45 |
+
**6. [P1] CWM's "65.8% on SWE-bench Verified" cited without the test-time-scaling qualifier.**
|
| 46 |
+
- *Source:* CWM abstract: "pass@1 scores of 65.8% on SWE-bench Verified **(with test-time scaling)**"; the single-attempt base score is lower (deep-read 06, Finding 2.1).
|
| 47 |
+
- *Correction:* Always cite as "65.8% with test-time scaling" or give the base score alongside.
|
| 48 |
+
|
| 49 |
+
**7. [P1] Chain-of-World (arXiv:2603.03195) used as evidence for SWE world-modeling — it is a robotics VLA paper.**
|
| 50 |
+
- *Source:* CoWVLA: latent-motion world model for robot manipulation, CVPR 2026, cs.CV (deep-read 06, Finding 2.2). No SWE/code content.
|
| 51 |
+
- *Repo claim:* final report §2 deploys it for "predict the consequential terminal state" in the SWE design.
|
| 52 |
+
- *Correction:* Keep only as explicitly-flagged robotics analogy or remove.
|
| 53 |
+
|
| 54 |
+
**8. [P1] "85% of total compute is post-training" still stated as fact in the research/01 body.**
|
| 55 |
+
- *Source:* Neither blog nor tech report contains any compute-budget ratio (deep-read 01, FINDING-6).
|
| 56 |
+
- *Repo claim:* `research/01-composer-2.5.md:14`: "roughly 85% of the total compute budget for Composer 2.5 was spent on Cursor's proprietary post-training…". The header audit note flags it, but the body asserts it unhedged.
|
| 57 |
+
- *Correction:* Inline-tag the sentence: "[community speculation — not in any Cursor source]".
|
| 58 |
+
|
| 59 |
+
**9. [P1] The repo's implemented "feature deletion" is gold-patch reversion over pre-labeled SWE datasets — not the blog's mechanism — yet downstream docs say it "matches the recipe."**
|
| 60 |
+
- *Source fact (blog, verbatim):* "the agent is given a codebase with a large set of tests, and **asked to delete code and files** in such a way that the codebase remains functional while specific testable features are removed." The published mechanism is (likely agentic) *deletion from a functional, test-covered codebase*. The repo's `SweBenchAdapter` instead reverts human PR gold patches of pre-packaged instances — the analogue of SWE-smith's **PR Mirror** strategy, a *third* construction the blog never describes (deep-reads 00 Claim 1, 02 §6.1).
|
| 61 |
+
- *Repo claims:* ADR-010 is honest about Option A vs B, but its Consequences section says the subsystem delivers the recipe; `docs/COMPOSER_RECIPE_MAPPING.md:42` paraphrases the blog as "take a repo with passing tests, delete some code" (passive — drops the agentic deleter and the "remains functional" constraint); the project framing ("point-at-a-repo feature-deletion task synthesis") attributes to the blog a phrase ("point at a repo") that appears nowhere in it.
|
| 62 |
+
- *Correction:* "We implement the *inversion analogue* of feature deletion (revert gold patch of an existing SWE instance — SWE-smith's PR-Mirror shape, which SWE-smith's ablations show yields the best training data). The blog's actual generator — deleting code/files from a live, functional codebase so that targeted tests fail — is unbuilt (Option B / Path B). 'Point at a repo' is our reconstruction, not blog language."
|
| 63 |
+
|
| 64 |
+
**10. [P1] ADR-010: "Online difficulty gate matches the actual recipe" — only half the recipe, and with a different signal.**
|
| 65 |
+
- *Source facts:* Blog: "we both **select for and create** harder tasks dynamically throughout the run" — two operations. The only Cursor-stated difficulty signal is Composer 2's "number of turns and thinking tokens of rollouts—to upsample increasingly harder data points" (tech report §4) — and that is a Composer-2 mechanism mapped onto 2.5 (deep-read 01, FINDING-12).
|
| 66 |
+
- *Repo state:* `DifficultyCurriculum` implements only the SELECT half, primarily on a pass-rate frontier (extrapolated, not Cursor-stated); the CREATE half (minting harder tasks mid-run) is 0% built; `granularity` is hardcoded `"feature"` (grounding doc, Claim 3). Wave 20's effort tilt on turns/think-tokens partially aligns with the stated heuristic.
|
| 67 |
+
- *Correction:* "The select-for half is built (pass-rate frontier [EXTRAPOLATED] + turns/think-token effort tilt [Composer-2-sourced]); the create-harder half of the published recipe is not implemented."
|
| 68 |
+
|
| 69 |
+
**11. [P1] ADR-010 decision driver: substrates "already guarantee test-exercises-the-code via FAIL_TO_PASS."**
|
| 70 |
+
- *Source fact:* SWE-smith (§2.1) shows the F2P property is the *output of running validation* (overall yield ~50.1%; candidates lacking test coverage are filtered out by execution), not a schema-inherited guarantee; no surveyed dataset verifies *reachability* of the deleted code from the failing tests (deep-read 02, §6.2 item 6; ADR-010's own OPEN item concedes Gate 2 doesn't verify reachability).
|
| 71 |
+
- *Correction:* "F2P labels are trustworthy because the upstream pipelines *executed* the tests; any new inversion we mint must re-run gates 1–3 to re-establish the property, and reachability remains unverified across the field."
|
| 72 |
+
|
| 73 |
+
**12. [P1] Mapping doc states the KL direction as `KL(teacher || student)` — unsupported by the blog and opposite to the paper the doc equates it to.**
|
| 74 |
+
- *Source facts:* Blog: "an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's" — direction unspecified. SDPO Eq. 1: `KL(π_θ ‖ stopgrad(q_θ))` — **student first** (deep-read 03 §1.1).
|
| 75 |
+
- *Repo claim:* `docs/COMPOSER_RECIPE_MAPPING.md:20`: "Loss = on-policy KL divergence: `KL( teacher_logits_at_turn_t || student_logits_at_turn_t )`".
|
| 76 |
+
- *Correction:* "Direction unspecified in the blog; SDPO uses KL(student‖teacher) (adopting JSD for stability); our implementation uses the OPSD generalized JSD."
|
| 77 |
+
|
| 78 |
+
**13. [P1] `opsd.py` β-convention docstring is inverted vs the upstream it claims byte-parity with.**
|
| 79 |
+
- *Source:* OPSD README: "Beta=0 means forward KL and 1 means reverse KL." Repo docstring labels β=0 "reverse KL" / β=1 "forward KL." Code is numerically correct; the labels would misconfigure anyone choosing β by docstring (deep-read 03 §2.2).
|
| 80 |
+
- *Correction:* Swap the labels to match upstream.
|
| 81 |
+
|
| 82 |
+
**14. [P1] SDPO-fidelity claim without SDPO's teacher regularization.**
|
| 83 |
+
- *Source:* SDPO Table 4: non-regularized live-weights teacher **diverges** (36.1% vs 50.6% with trust-region teacher); EMA/trust-region regularization is a core stability component (deep-read 03 §1.1, rec. 4). Also §4.5: SDPO underperforms GRPO on weak models; λ=0.9 GRPO + 0.1 SDPO hybrid recommended.
|
| 84 |
+
- *Repo state:* teacher = live weights each step, no EMA/trust-region; no usage guidance on the weak-model failure mode.
|
| 85 |
+
- *Correction:* Document both gaps wherever the SDPO channel is described as paper-faithful; implement or explicitly defer teacher regularization.
|
| 86 |
+
|
| 87 |
+
**15. [P1] `SiblingBootstrapGenerator` framed as following from SDPO's "successful rollouts as implicit feedback" — it is an extrapolation.**
|
| 88 |
+
- *Source:* SDPO's sibling mechanism puts the successful sibling rollout **into the teacher's conditioning prefix** and applies the KL over the *entire* original response. It does not generate a "Reminder: a working approach looks like…" hint string, insert text into the response, or detect error turns (deep-read 03 §6).
|
| 89 |
+
- *Repo claim:* `research/07` + `hint_generator.py` sketch present the sibling-hint design as the natural SDPO-supported fallback.
|
| 90 |
+
- *Correction:* "SDPO supports sibling-as-conditioning-context; our hint-string splice at error turns is our own Composer-blog-inspired variant. The paper-faithful alternative (sibling in prefix, full-response KL) is simpler and should be the ablation baseline."
|
| 91 |
+
|
| 92 |
+
**16. [P1] "$0.98 mean per-trace cost ungated — verified economic floor" hides the trace definition; real sessions cost ~2 orders of magnitude more.**
|
| 93 |
+
- *Source:* Spike 001 measured 50 hand-crafted synthetic states at N=3 teachers. Real Claude Code sessions have 125–2,830 tool-use messages (ADR-002); at DEFAULT_TEACHERS pricing a ~1,400-step session is ~$70–80 flat (deep-read 07, FR-R5).
|
| 94 |
+
- *Repo claims:* `teacher_replay.py` docstring ("Verified economic floor … $0.98 mean per-trace cost ungated"); `docs/VISION_VALIDATION.md` ("$0.98/trace verified … economic floor is established").
|
| 95 |
+
- *Correction:* "Verified floor for 50-step synthetic-state benchmark traces; full-session flat replay is ~$70–80 ungated; $0.30/trace VOI-gated figure is a projection, not a measurement."
|
| 96 |
+
|
| 97 |
+
**17. [P1] "$64 ungated tree" is mislabeled — it is a flat N=8, T=1000 extrapolation, not a tree and not a codebase config.**
|
| 98 |
+
- *Source:* research/05 constructs $0.008 × 1000 × 8 = $64 flat; DEFAULT_TEACHERS is N=3; a true branching tree is O(N^D), strictly worse (deep-read 07, 05-R6/FR-R6).
|
| 99 |
+
- *Correction:* Label it "flat 8-teacher × 1000-step extrapolation (unmeasured)" everywhere; never "tree."
|
| 100 |
+
|
| 101 |
+
**18. [P1] `kl_in_reward.py`: "verl adopted k1-in-reward as its *only* reverse-KL option" — overclaim.**
|
| 102 |
+
- *Source:* verl supports `kl_penalty="kl"` (k1) **and** `kl_penalty="low_var_kl"` (k3-family) (deep-read 04 §4.3).
|
| 103 |
+
- *Correction:* "verl defaults to / recommends k1-in-reward" — drop "only."
|
| 104 |
+
|
| 105 |
+
**19. [P1] Comedy of Estimators (arXiv:2512.21852) used as "k1-in-reward improves OOD; k3-in-reward can collapse" — full text never read, and the 7B/8B-math→1T-MoE-agentic extrapolation is unflagged.**
|
| 106 |
+
- *Source:* Only the abstract was obtainable (HTML 404); it supports "biased gradients → instability; unbiased → better OOD" but does not state the specific estimator-placement ranking the repo asserts (deep-read 04 §2, §4.4).
|
| 107 |
+
- *Correction:* Soften to "consistent with the abstract's biased-vs-unbiased finding; specific k1/k3-placement ranking unverified (full text unavailable); empirical scope is 7B/8B reasoning models."
|
| 108 |
+
|
| 109 |
+
**20. [P1] GSPO preset claims to implement GSPO but inherits GRPO-scale clipping — two orders of magnitude off the paper's settings.**
|
| 110 |
+
- *Source:* GSPO paper §5.1: clipping ranges 3e-4/4e-4, "a difference of two orders of magnitude in the fractions of clipped tokens between GSPO and GRPO." The repo preset sets no epsilon → TRL defaults (~0.2) → effectively unclipped sequence-level REINFORCE (deep-read 04, ISSUE 1).
|
| 111 |
+
- *Correction:* Add `epsilon=3e-4, epsilon_high=4e-4` to the preset or annotate it "architecture-only; not operationally GSPO without GSPO-scale epsilons." (Companion: CISPO preset should set `beta=0.0` explicitly per MiniMax-M1, and document its deliberate `scale_rewards="none"` deviation from the paper's std-norm.)
|
| 112 |
+
|
| 113 |
+
**21. [P1] research/05's novelty framing: rStar named "closest precedent" (misread) while the actually-closest works are uncited.**
|
| 114 |
+
- *Source facts:* rStar's discriminator verifies **full trajectories**, not per-step states ("acts as a discriminator to verify each trajectory" — abstract). Tree-GRPO (arXiv:2509.21240) proves intra-tree group-relative advantage ≡ step-level DPO — the formal version of `extract_dpo_pairs`; SWE-Search (arXiv:2410.20285) is MCTS on SWE-bench itself. Neither is cited in research/05; `framework/composer-replication-framework.md:17` still says "Closest precedent: rStar-Math … Multi-teacher *frozen-trace replay* is open territory" (deep-read 07, 05-R1/R2/R3).
|
| 115 |
+
- *Correction:* The novelty claim (frozen-trace × N heterogeneous teachers × disagreement-DPO) survives, but the provenance section must cite Tree-GRPO and SWE-Search as the nearest neighbors and correct the rStar granularity description.
|
| 116 |
+
|
| 117 |
+
**22. [P1] "Roughly nine-tenths of it is reuse" (final report §6) conflates design-reuse with build status.**
|
| 118 |
+
- *Source:* Exhaustive grep confirms tree controller, SiblingBootstrapGenerator, world-model head, `<deliberate>` token, pipeline/, infra/, broken-image builder are **0% built** (grounding doc §3, Claim 5).
|
| 119 |
+
- *Correction:* "Nine-tenths of the *recipe-replication* layer reuses existing code; the framework's own additions (tree, world-model head, AWS datagen pipeline) are entirely design-stage."
|
| 120 |
+
|
| 121 |
+
**23. [P1] World-model aux-head design presented with citation support that is wholly analogical; "parameter isolation eliminates the interference risk" overclaims DART.**
|
| 122 |
+
- *Source facts:* No paper in the cluster tests a next-state aux loss on a policy LLM during code RL — the evidentiary gap for the exact proposed configuration is total (deep-read 06, Missing 1). DART (2602.00994) shows separate LoRA modules *reduce* interference but do not reach the 2-Agent upper bound (Finding 3.1); its domain is RAG-QA/NL2SQL, not SWE.
|
| 123 |
+
- *Correction:* Add an explicit null-evidence flag to §2 of the final report and to any ADR that inherits it; change "eliminates" → "substantially reduces"; the P4 ablation is a research hypothesis test, not a validation of established results.
|
| 124 |
+
|
| 125 |
+
**24. [P1] VeRL "first-class agentic RL support" — the async path is experimental.**
|
| 126 |
+
- *Source:* verl README: `fully_async_policy`, `transfer_queue` live under `verl/experimental`; `uni-agent` (May 2026) is a separate layer (deep-read 08 §7.1).
|
| 127 |
+
- *Correction:* "VeRL's agentic/async path exists but is experimental; evaluate `uni-agent` before committing to a custom AgentLoop."
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## P2 — wording / precision
|
| 132 |
+
|
| 133 |
+
**25. [P2] research/01 §5: "During **post-training**, Cursor employs Sharded Muon and Dual Mesh HSDP."** Blog (verbatim): "For **continued pretraining**, we use Muon…". Fix the stage attribution (the mapping doc already has it right). Also do not conflate with Composer 2's FSDP+CP/AdamW system (deep-read 01, FINDING-7).
|
| 134 |
+
|
| 135 |
+
**26. [P2] research/06's "two-agent / two-phase structure the blog implies."** The blog's grammar has one "the agent … asked to delete"; whether the deleter is a model, program, or pipeline is unstated. research/06 already lists this as an open question (line 329) — align the §1 framing with it: "deleter unknown; blog grammar suggests an agent" (deep-read 01, FINDING-3).
|
| 136 |
+
|
| 137 |
+
**27. [P2] research/06 "~50k–60k tasks" as the "25×-spirit" pool.** No Composer-2 baseline count exists, so 25x is not convertible to an absolute. The "spirit" hedge is present; add "[EXTRAPOLATED — no primary-source baseline]" at the number (deep-read 01, FINDING-2).
|
| 138 |
+
|
| 139 |
+
**28. [P2] Grounding doc Claim 1 presents a paraphrase as a blog quote.** `research/deepread/00-grounding.md` Claim 1: "What the blog says (COMPOSER_RECIPE_MAPPING.md): 'take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests.'" — that sentence is the mapping doc's paraphrase, not blog text. Use the real blog sentence when quoting "what the blog says."
|
| 140 |
+
|
| 141 |
+
**29. [P2] research/02 DiLoCo compression details.** "FP16 outer state" → Streaming DiLoCo quantizes **outer gradients to FP4 (E3M0)**, FP32 accumulation; bandwidth claim "~100×" → measured ≈400× total-bits reduction (Table 1) (deep-read 05, F6/F7).
|
| 142 |
+
|
| 143 |
+
**30. [P2] DiLoCo H default source attribution.** `diloco/__init__.py` docstring credits "DiLoCo paper §3.2" for defaults, but §3.2's main-experiment H is 500; the repo's `sync_every=100` matches the Streaming/OpenDiLoCo range. Also note Streaming's outer_lr=0.4 vs the vanilla 0.7 default (deep-read 05, F2/F3).
|
| 144 |
+
|
| 145 |
+
**31. [P2] "Foresight@k" cited as if sourced.** The metric is coined by the final report; citations [11][2] do not define it. Mark "(we define this metric)" at first use (deep-read 06, Finding 3.2).
|
| 146 |
+
|
| 147 |
+
**32. [P2] SWE-rebench "21,336 tasks" count never verified against the SWE-rebench paper** (only the Nebius infra blog was fetched); and the 59k-row figure is ambiguous between SWE-smith-on-HF and Nemotron-SWE-v1 (deep-read 02, §6.2 items 3 and unverified item 3). Tag both counts "[UNVERIFIED-COUNT]" until arXiv:2505.20411 is read.
|
| 148 |
+
|
| 149 |
+
**33. [P2] `_normalize_action` whitespace-only normalization is acknowledged in code but absent from every risk list.** On real tool-call traces, Channel-3 pair extraction will be mostly noise; the final report's §10 failure modes and ADR-002 consequences should both name it (deep-read 07, FR-R8). (Fidelity angle: the "DPO-pair extractor, 7 unit tests" line in VISION_VALIDATION implies more readiness than the known-stub normalizer supports.)
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## Confirmed faithful (for balance — no action)
|
| 154 |
+
|
| 155 |
+
- The verbatim 25x sentence, feature-deletion paragraph, reward-hacking anecdotes (Python type-check cache, Java bytecode), and "agentic monitoring tools" are quoted accurately in `research/06`, `research/09`, ADR-010, and COMPOSER_RECIPE_MAPPING — re-verified against the fetched blog body this pass.
|
| 156 |
+
- Dr.GRPO claims in `research/10` (length-norm removal, no std-norm, k1 estimator, DAPO overlong masking tried-and-rejected, Adam, single-epoch) — all verified verbatim against the Composer 2 report (deep-read 04 §4.1–4.2).
|
| 157 |
+
- The k1-in-reward fold-then-baseline algebra and its Dr.GRPO-regime precondition are mathematically sound (deep-read 04 §3.2); one latent `num_iterations>1` guard is missing (engineering, not fidelity).
|
| 158 |
+
- Channel 3's depth-1/flat and teacher-plurality-not-execution self-descriptions in the final report are accurate (deep-read 07, FR-R1/R2).
|
| 159 |
+
- Channel-3 and tree-of-work provenance is honestly labeled "NOVEL — our addition, not part of Cursor's recipe" in `framework/composer-replication-framework.md`, VISION_VALIDATION, and the spike layer — the provenance boundary between "Composer's recipe" and "our additions" is consistently drawn. The residual issue is nearest-neighbor citation completeness (finding 21), not provenance dishonesty.
|
| 160 |
+
- ADR-010's OPEN items (Gate-2 reachability, deleted_symbols emptiness) are honest self-flags that match what the sources show is an unsolved field-wide gap.
|
| 161 |
+
- TRL claims (default `loss_type="dapo"`, `scale_rewards` handling, drift assertions in `make_dr_grpo_config`) verified against live docs (deep-read 08 §1).
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
## Severity totals
|
| 166 |
+
|
| 167 |
+
- **P0: 5** (findings 1–5)
|
| 168 |
+
- **P1: 19** (findings 6–24)
|
| 169 |
+
- **P2: 9** (findings 25–33)
|
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Design Critic: Architecture of the Envisioned Dataset-Generation Pipeline
|
| 2 |
+
|
| 3 |
+
**Reviewer:** DESIGN CRITIC (critical-review pipeline, subagent)
|
| 4 |
+
**Date:** 2026-06-10
|
| 5 |
+
**Inputs:** `research/deepread/00-grounding.md` + `01`–`08`, `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, `docs/adrs/ADR-010`, and direct reads of `composer_replication/datagen/{env,substrates,validator,sandbox}.py`, `teacher_replay.py`, `safety/holdout.py`, `hint_generator.py`.
|
| 6 |
+
**Scope:** Attack the ARCHITECTURE of the envisioned dataset pipeline — buy-vs-build, the S3 contract, tree-controller implementability, package placement, the minimal end-to-end pipeline, and components no design mentions.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## P0 — Foundational architecture defects (the design as drawn cannot produce the promised artifact)
|
| 11 |
+
|
| 12 |
+
### 1. The tree-of-work's seed traces and its execution oracle operate on DISJOINT data — a chicken-and-egg no design resolves. **[P0]**
|
| 13 |
+
|
| 14 |
+
F1's diagram and the final report (§5: *"ingest a seed trace, expand the divergence-gated tree across N models, execute every branch in a sandbox, grade leaves by `_grade()`"*) compose two components that cannot currently meet:
|
| 15 |
+
|
| 16 |
+
- Seed traces come from `ClaudeCodeIngester` over `~/.claude/projects/**.jsonl` — sessions against the user's **local working directories** at arbitrary commits, with no Docker image, no pinned dependencies, no `FAIL_TO_PASS` labels, no `FeatureDeletionTask`.
|
| 17 |
+
- The execution oracle (`FeatureDeletionEnv.step()`/`_grade()`) requires `task.broken_image` (a frozen container) and pre-identified test node IDs (`__post_init__` raises on empty `fail_to_pass`).
|
| 18 |
+
|
| 19 |
+
You cannot `env.step()` a Claude Code trace action: there is no environment that materializes the repo state the trace was recorded against. The grounding map's Breaks 1–6 document this for "arbitrary OSS repo" but nobody applies it to the trace-replay-tree, where it is fatal: **every branch of the tree needs an executable sandbox, and Claude Code traces have none.** The only traces that are tree-expandable are traces collected *inside* FeatureDeletionEnv episodes — which do not exist because nothing in the repo runs an agent inside the env (see Finding 2).
|
| 20 |
+
|
| 21 |
+
**Recommendation:** Invert the bootstrap order. Phase 1 of the tree must seed from **env-grounded traces** (agent rollouts on FeatureDeletionTask episodes, where reset state is reproducible by `task_id`), not Claude Code sessions. Demote Claude Code traces to (a) flat Channel-3 replay (DPO text pairs, no execution) and (b) SFT style data — uses that need no oracle. Document this split in a new ADR; the current F1 diagram is misleading and will misdirect the build.
|
| 22 |
+
|
| 23 |
+
### 2. No agent rollout harness exists — the SFT corpus has no producer. **[P0]**
|
| 24 |
+
|
| 25 |
+
The "SFT-first competence floor" (F1 §2, final report §5) reads `sft_corpus/` = "clean winning trajectories." Trace the producers: `teacher_replay` emits **single next-actions** per frozen state, not episodes. `env.reward_fn`'s fallback treats the whole completion as one `submit` (08 §5.1 calls this "a dead end for genuine multi-turn"). The tree controller is 0% built. **Nothing in the repo, built or designed, drives a multi-turn agent loop (LLM → tool call → `sandbox.exec` → observation → LLM) to completion and serializes the trajectory.** SWE-smith collected its 5k SFT trajectories with SWE-agent + Claude; SWE-Gym with OpenHands. Every design doc skips this component; F2's four stages (ingest/replay/validate/normalize) produce tasks and DPO pairs but *no SFT trajectories at all* — yet F2's stage (d) claims to write `corpus/sft/`.
|
| 26 |
+
|
| 27 |
+
**Recommendation:** Add an explicit `rollout harness` component to the pipeline: adopt **mini-swe-agent or SWE-agent** (battle-tested, supports any API model) as the expert-trajectory collector against `FeatureDeletionEnv` tasks, with `_grade()==1.0` + `HackMonitor`-clean as the SFT admission filter. ~200–400 LOC of adapter, not a new agent. This is the single highest-priority build item — without it the minimal pipeline (Finding 16) cannot terminate in an SFT corpus.
|
| 28 |
+
|
| 29 |
+
### 3. Divergence-gating is not computable from what the current components emit. **[P0]**
|
| 30 |
+
|
| 31 |
+
The tree's economics depend entirely on the divergence gate (final report §3: it turns O(N^D) into "O(N · decision-points)"). The gate needs "pre-expansion divergence between sibling next-action distributions." But:
|
| 32 |
+
|
| 33 |
+
- Teacher APIs (OpenRouter, Bedrock batch) return **text**, not distributions. Bedrock batch (`CreateModelInvocationJob`) does not return logprobs usable for a cross-model divergence measure (and cross-model token distributions live in different vocabularies anyway — KL between them is undefined without a common action space).
|
| 34 |
+
- The only equality/divergence measure in the codebase is `_normalize_action()` — whitespace-collapse + lowercase. Deepread 07 (FR-R8, HIGH) already established this produces "mostly noise on real traces": semantically identical tool calls in different JSON formatting count as disagreement, so the gate would fire on nearly every node, **silently degrading the tree to the O(N^D) ungated cost the gate exists to prevent.** The cost-control mechanism and the known-broken normalizer are the same component.
|
| 35 |
+
|
| 36 |
+
**Recommendation:** Define divergence over a **canonical action algebra**, not text: parse every candidate into `(tool_name, normalized_args)` (AST-normalize code args, path-normalize file args), and gate on (i) tool-name disagreement, (ii) arg-level edit distance over the canonical form, with an escalation tier that asks a cheap judge model only when (i)/(ii) is ambiguous. Build and *unit-test the gate's firing rate on 5 real traces* (expected: fires on <20% of steps) **before** writing `tree_controller.py`. The tool-call parser is the prerequisite, not a polish item.
|
| 37 |
+
|
| 38 |
+
### 4. The tree requires a sandbox fork/snapshot primitive that neither the Sandbox API nor any design has. **[P0]**
|
| 39 |
+
|
| 40 |
+
`tree_controller.py` (design: "apply each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branch again from the new state"). Branching N ways from a mid-episode state requires N **independent copies of the mutated working tree**. The `Sandbox` protocol has `boot/exec/run_tests/trajectory` — no `fork()`, no `snapshot()/restore()`. `DockerSandbox` boots one container per episode with `read_only=True` rootfs. The two realizable options are architecturally very different and nobody has chosen:
|
| 41 |
+
|
| 42 |
+
1. **Replay-from-root:** re-boot the image and re-execute the action prefix for every branch — cost O(depth) sandbox-execs per node, multiplying the already-dominant sandbox cost (final report §10 names per-branch sandbox isolation "the throughput ceiling of the whole idea").
|
| 43 |
+
2. **Filesystem snapshot fork:** overlayfs/`docker commit`/CRIU per node — fast but pins the design to one isolation backend and conflicts with gVisor/Kata choices in F2 stage (c).
|
| 44 |
+
|
| 45 |
+
**Recommendation:** Extend the `Sandbox` protocol with `fork() -> Sandbox` *now*, implement it as overlayfs-upper-dir copy for `LocalSubprocessSandbox` and `docker commit`+boot for `DockerSandbox`, and measure fork latency in a spike before committing to tree depth >1. If fork costs >5s, the honest fallback is depth-1 trees (N candidate actions, each executed one step + graded by a cheap proxy) — which is most of the DPO value at a fraction of the machinery.
|
| 46 |
+
|
| 47 |
+
### 5. Zero benchmark decontamination anywhere in the pipeline. **[P0]**
|
| 48 |
+
|
| 49 |
+
The pipeline trains on SWE-bench-family substrates and the program will inevitably report SWE-bench Verified numbers (every comparison point in the research notes — DeepSWE 42.2%, SWE-smith 40.2%, Socratic-SWE 50.4% — is SWE-bench Verified). The substrates have **different and partial** decontamination policies: R2E-Gym decontaminates only its *Subset* against SWE-bench test repos (deepread 02 §6.2.2: "The full R2E-Gym (8,135 tasks) may overlap"); SWE-rebench spans 3,468 repos with no stated guarantee; the deepread 02 review explicitly flags research/06's "no contamination worry" as an OVERCLAIM. `HeldoutSplit` guards only the *internal* train/holdout partition — it has no concept of an external benchmark. No design doc (F1, F2, ADR-010) contains the word "decontamination."
|
| 50 |
+
|
| 51 |
+
**Recommendation:** Add a mandatory `decontaminate(tasks, benchmark)` gate at stage (c1), enforced at three levels: (i) repo-level — drop any task whose `repo` appears in SWE-bench Lite/Verified/full or the chosen eval suite; (ii) instance-level — drop tasks whose `golden_diff` or `fail_to_pass` node IDs match an eval instance; (iii) record the decontamination manifest (benchmark name + version hash + drop count) in `manifests/run_id.json`. ~50 LOC + a pinned eval-instance index. This must exist before the *first* corpus is generated, because retro-filtering a published corpus is a credibility event.
|
| 52 |
+
|
| 53 |
+
### 6. Buy-vs-build: the designs build the one component that is buyable, and skip the integration deepread 02 already identified. **[P0]**
|
| 54 |
+
|
| 55 |
+
ADR-010 chose "Option A: invert OSS substrates" and rejected "Option B: greenfield repo scraping" — correct in 2026-05. But the user's new ask ("point at a repo") **is Option B**, and the design response (grounding map: "Broken-repo image builder — clone, `git apply -R`, scrub, build, push — unspecified, 0% built") is to hand-build exactly what `pip install swesmith` (MIT) ships: environment construction from arbitrary GitHub repos (~7 min human/repo, one image per repo = 500× storage win over per-task images), four bug-synthesis strategies, validation, issue-text generation, and `rp.get_container(task)` returning a booted container. Deepread 02 flagged this at HIGH ("the ADR does not evaluate it as a dependency") and noted SWE-smith's **PR Mirror** strategy — the exact gold-patch-reversion mechanic of `SweBenchAdapter` — produces the *best* training data in SWE-smith's own ablation (Table 5). The repo's core approach is validated by prior art it doesn't cite, and its missing half is shipped by a library it doesn't depend on.
|
| 56 |
+
|
| 57 |
+
**Recommendation:** Commit a revised ADR: **[BUY]** swesmith for env-construction + bug synthesis on new repos; **[BUY]** SWE-smith 59k / R2E-Gym-Subset 4.6k / SWE-Gym 2.4k datasets through the existing `SweBenchAdapter` (02 confirms it works on SWE-smith instances unchanged); **[BUILD]** only the genuinely novel pieces — `DifficultyCurriculum`, `HackMonitor`/scrub, decontamination, the rollout harness (Finding 2), and (later, ablation-gated) coverage-guided deletion targeting. The "point-at-a-repo" feature becomes a ~100 LOC `SweSmithProfileAdapter` instead of an unspecified image-builder subsystem. Budget reality check from 02: SWE-smith built 50k tasks for **$1,360 + 20 human-hours**; any in-house builder must beat that.
|
| 58 |
+
|
| 59 |
+
---
|
| 60 |
+
|
| 61 |
+
## P1 — Significant design errors (buildable, but will mislead or bite)
|
| 62 |
+
|
| 63 |
+
### 7. There are TWO unreconciled S3 contracts; the "6-prefix contract" is neither complete nor singular. **[P1]**
|
| 64 |
+
|
| 65 |
+
F1 commits six prefixes (`sft_corpus/ dpo_pairs/ rl_task_pool/ divergence_pairs/ wm_tuples/ holdout/` under `runs/<run_id>/`). F2 commits a different layout (`traces/v1/run_id=<id>/`, `tasks/v1/...`, `replay/v1/...`, `task_grades/v1/...`, `corpus/v1/run_id=<id>/{sft,dpo}/`, `manifests/`). The grounding map (§2 step 8) pastes **both** into one list — so `corpus/v1/.../dpo/` and `dpo_pairs/` both exist for the same artifact, with different partitioning conventions (Hive `run_id=` vs path `runs/<id>/`), different versioning (F2 prefixes carry `v1`, F1 prefixes carry none), and two different buckets named across the docs (`amazon-sagemaker-...` in F1, `composer-datagen-...` in F2). Additionally: `divergence_pairs/` and `dpo_pairs/` describe one lineage (divergence-annotated nodes → extracted pairs) split across two prefixes, inviting drift; there is no prefix for quarantined/retired tasks (the curriculum produces them) nor for eval results.
|
| 66 |
+
|
| 67 |
+
**Recommendation:** Write `s3_contract.py` FIRST and make both design docs subordinate to it. One layout: `s3://<bucket>/<contract_version>/run_id=<id>/<artifact>/` where artifact ∈ {traces, tasks, tasks_golden, replay, grades, corpus_sft, corpus_dpo, wm_tuples, holdout, quarantine, manifest.json} — every artifact versioned by the single top-level `contract_version`, every prefix owned by exactly one writer stage, `divergence_pairs` folded into `corpus_dpo` as a `provenance` column rather than a sibling prefix. Until this file exists, no AWS stage should be built (they would each encode one of the two divergent layouts).
|
| 68 |
+
|
| 69 |
+
### 8. `golden_diff` leaks into the policy-visible manifest: `repr=False` is not serialization-exclusion. **[P1]**
|
| 70 |
+
|
| 71 |
+
F2 (open questions) correctly demands `golden_diff` live in a deny-by-default `tasks/golden/` prefix. But stage (c1) writes "FeatureDeletionTask rows" to `tasks/v1/run_id=<id>/manifest.jsonl`, and any naive serializer (`dataclasses.asdict`, `json.dumps(vars(task))`) **includes `golden_diff`** — `field(repr=False)` only affects `__repr__`. The Batch validator children legitimately need the gold diff (Gate 4), but `rl_task_pool/` (read by the training env, whose prompt renderer carefully hides golden) would carry it in plaintext one `json.loads` away from any reward-hacking trajectory that reads its own task manifest. The safeguard exists in the prompt renderer and nowhere in the storage contract.
|
| 72 |
+
|
| 73 |
+
**Recommendation:** In `s3_contract.py`, define two explicit serializers — `to_policy_row(task)` (drops `golden_diff` AND `deleted_symbols`) and `to_validator_row(task)` (full) — and make the policy-row writer the *only* code path that can populate `rl_task_pool/`. Add a unit test asserting `"golden_diff" not in json` for policy rows. Enforce the prefix split with bucket policy, not convention.
|
| 74 |
+
|
| 75 |
+
### 9. The F2 architecture is a five-service orchestration for a pipeline that has never run once locally. **[P1]**
|
| 76 |
+
|
| 77 |
+
F2 commits Glue 5.0 + Bedrock batch + EMR Serverless + AWS Batch + Step Functions + Lambda + CDK (~250 LOC IaC) before a single task has been validated end-to-end on a laptop (ADR-010's own post-review: the gates passed "against FakeSandbox materializers"; the Docker e2e is still `[~]`). Every stage's per-service rationale in F2 is individually sound, but the composition is premature: the corpus that matters first is O(10²–10³) tasks (SWE-smith's full 50k cost 20 human-hours and one machine), which is a **single-node workload**. The Step Functions DAG also has no idempotency/restart semantics, no run-level budget envelope (only `teacher_replay`'s in-process `max_total_usd`), and a 24h Bedrock-batch SLA in the middle of what should be a same-day iteration loop during development.
|
| 78 |
+
|
| 79 |
+
**Recommendation:** Build the pipeline as **stage functions with a local driver first** (`python -m composer_replication.pipeline.run --stage all --tasks 200`), with S3 used only as a dumb artifact store via the `s3_contract.py` writers. Promote individual stages to managed services only when a measured bottleneck demands it, in this order: (1) AWS Batch for sandbox validation (the only genuinely parallel-heavy stage), (2) Bedrock batch when replay volume × cost crosses the 50%-discount break-even, (3) Step Functions only when >1 unattended run/week. Glue and EMR Serverless are likely never needed at this corpus scale — F2's own "live caveat" already concedes the ingester "is not intrinsically Spark-shaped."
|
| 80 |
+
|
| 81 |
+
### 10. No secrets/PII gate at trace ingest — raw Claude Code sessions go to S3 verbatim. **[P1]**
|
| 82 |
+
|
| 83 |
+
F2 stage (a) uploads `~/.claude/projects/**.jsonl` to `raw/claude_code/` and Parquet-izes them. Claude Code session files contain the user's local file contents, env-var echoes, API keys in tool outputs, internal hostnames, and proprietary code from *whatever repos the user worked on* — none of which passed any license, secrets, or PII filter. The copyleft filter (`is_redistributable`) applies only to SWE-substrate tasks, not to traces. The flywheel then trains on this and (per the publications/ directory) the corpus is intended to be shareable.
|
| 84 |
+
|
| 85 |
+
**Recommendation:** Insert a mandatory scrub stage between ingest and storage: secrets scanning (gitleaks/trufflehog rule pack over message contents), path anonymization, and a per-session allowlist (only sessions from designated repos enter the corpus). Record scrub stats in the run manifest. This is ~1 day of work and belongs in `ClaudeCodeIngester` itself so no unscrubbbed `TraceState` can exist downstream.
|
| 86 |
+
|
| 87 |
+
### 11. No canonical trajectory IR — three trace shapes are about to become five. **[P1]**
|
| 88 |
+
|
| 89 |
+
Today: Claude Code JSONL → `TraceState` (messages + `student_action` as serialized block-list). Planned: Bedrock `.jsonl.out` rows, tree-controller branch trajectories, rollout-harness episodes (Finding 2), OpenHands/SWE-smith trajectories (ADR-002 v0.2). Each design names its own shape; `_normalize_action`'s whitespace hack is the symptom of the missing abstraction. Without one normalized trajectory schema, every pairwise consumer (DPO extractor, SFT formatter, wm_tuple writer, replay submitter) needs format-specific code, and the "trace format normalization" cost grows quadratically.
|
| 90 |
+
|
| 91 |
+
**Recommendation:** Define a `CanonicalTrajectory` schema (list of `Turn{role, content, tool_calls: [(name, canonical_args)], tool_results, error_kind}`) in `datagen/schema.py` as the single internal currency; every ingester is a `X -> CanonicalTrajectory` adapter, and `extract_dpo_pairs`/SFT formatting/wm_tuples consume only it. This also gives the divergence gate (Finding 3) its action algebra for free.
|
| 92 |
+
|
| 93 |
+
### 12. The flywheel has no cross-generation dedup — it will feed itself duplicates. **[P1]**
|
| 94 |
+
|
| 95 |
+
The flywheel (F1: "improved student generates the next round's seed traces") loops the corpus into itself. Dedup today: data-juicer `document_deduplicator` is per-batch only; F2 adds Spark `dropDuplicates` *within one run's* normalize stage. Nothing dedups **across `run_id`s**, so generation N+1's SFT corpus will contain near-copies of generation N's winning trajectories (same tasks, similar solutions), compounding each cycle — a known self-training collapse accelerant. The held-out guard detects collapse after the fact; dedup prevents its cheapest cause.
|
| 96 |
+
|
| 97 |
+
**Recommendation:** Add MinHash/LSH near-dup dedup keyed on `(task_id, canonical_action_sequence)` across all prior run manifests at corpus-write time, plus a per-task cap on retained trajectories (e.g., ≤K winners per task across all generations). Store the dedup index alongside `manifests/`.
|
| 98 |
+
|
| 99 |
+
### 13. License handling is field-deep, not repo-deep — and absent for the "point-at-a-repo" path. **[P1]**
|
| 100 |
+
|
| 101 |
+
`is_redistributable()` lowercases `instance["license_name"]` and substring-matches `("gpl","agpl","lgpl")`. For arbitrary repos there is no `license_name` (grounding Break 5); for SWE-smith instances the license lives in the toolkit's repo profiles, not the instance dict (02 §1: 2 GPLv3 + 4 LGPL repos in SWE-smith would need mapping); the substring check also misclassifies (e.g., "GPL-2.0-with-classpath-exception" semantics, dual-licensed repos) and ignores the difference between *using* a repo for training and *redistributing* derivative diffs (02 notes SWE-smith only claims the former). Repo-ingest is exactly where this must run, and no design places it there.
|
| 102 |
+
|
| 103 |
+
**Recommendation:** At repo ingest, run SPDX detection (`licensee`/`askalono`) on the cloned tree, store the SPDX id + detection confidence on the task, and split policy into two explicit gates: `trainable(license)` (permissive + most copyleft OK for internal training) and `redistributable(license)` (permissive only) — applied at corpus-*publish* time, not generation time, so copyleft repos still contribute non-redistributed training signal. Keep the existing function as the redistribution gate, fix it to exact-SPDX matching.
|
| 104 |
+
|
| 105 |
+
### 14. `wm_tuples/` ("ALL branches incl. failures") is an unbounded write path serving an ablation-gated consumer. **[P1]**
|
| 106 |
+
|
| 107 |
+
The world-model head — the sole consumer of `wm_tuples/` — has, per deepread 06, **zero direct evidence** for its configuration ("no published paper has tested an auxiliary next-state-prediction objective during RL on a code policy") and is explicitly an ablation arm, not a premise. Yet the S3 contract gives it the highest-volume prefix in the system (every `env.step()` observation of every branch, including all failures — observations are full test logs/file contents), with no size estimate, no sampling policy, no retention/lifecycle rule in any design. The pipeline's storage architecture is load-bearing on its most speculative research bet.
|
| 108 |
+
|
| 109 |
+
**Recommendation:** Make wm_tuple emission opt-in per run (`emit_wm_tuples: bool` in the run config, default off until the P4 ablation is scheduled), store observations as content-addressed blobs with dedup (test logs repeat massively), and attach an S3 lifecycle rule (expire after N days unless pinned by an ablation manifest). Do not let the typed-routing elegance ("failed branch is gold for the world model") force eager collection before the consumer exists.
|
| 110 |
+
|
| 111 |
+
### 15. The "create harder tasks dynamically" half of the curriculum is absent from every pipeline design — yet it is the blog's actual mechanism. **[P1]**
|
| 112 |
+
|
| 113 |
+
Deepread 01 confirms the verbatim Composer claim: "we both select for **and create** harder tasks dynamically throughout the run." The repo has SELECT-FOR (`DifficultyCurriculum`); the CREATE half (escalate deletion granularity function→file→feature, combine bugs, multi-feature targets minted *during* the run) appears in no design — `granularity` is hardcoded `"feature"`, and F1/F2's outer loop has no stage that takes curriculum state as *input* to task synthesis. Meanwhile SWE-smith's **Combine-Bugs** strategy (96.9% yield, zero cost, 15 median F2P — 02 §1) is precisely a CREATE-half mechanism sitting in the buyable toolkit.
|
| 114 |
+
|
| 115 |
+
**Recommendation:** Add a `task_escalation` stage to the outer loop contract now (input: curriculum pass-rates per task; output: new combined/escalated task candidates into the validation queue), implemented first as SWE-smith Combine-Bugs over already-validated per-repo bugs. Even if deferred, the *stage boundary* must exist in the orchestration design, or the pipeline hard-codes a select-only curriculum and the eventual retrofit will break the Step Functions/driver topology.
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## P2 — Real gaps, lower blast radius
|
| 120 |
+
|
| 121 |
+
### 16. Where the pipeline should live: extend the monorepo with isolated extras; do NOT split a package. **[P2]**
|
| 122 |
+
|
| 123 |
+
F1/F2 propose `composer_replication/pipeline/`, `composer_replication/datagen/aws/`, and root `infra/`. This is broadly right; a separate datagen package/repo would be wrong now: the shared dataclasses (`FeatureDeletionTask`, `TraceState`, `DPOPair`, future `CanonicalTrajectory`) ARE the contract, and splitting them across repos forces schema version-pinning with zero external consumers. The risks to manage are dependency bleed (boto3/pyspark/docker must not enter the training image) and entrypoint importability.
|
| 124 |
+
**Recommendation:** (a) `composer_replication/datagen/` stays the pure library (schema, env, validator, monitor, curriculum — no cloud deps); (b) new `composer_replication/pipeline/` holds stage drivers + `s3_contract.py`, all cloud/Spark imports lazy and gated behind a `[pipeline]` extra; (c) `infra/` at repo root for IaC only; (d) CI check that `import composer_replication.datagen` succeeds with no extras installed. Revisit a package split only when a second project actually consumes the datagen library.
|
| 125 |
+
|
| 126 |
+
### 17. No state/node ID contract for tree-generated states. **[P2]**
|
| 127 |
+
|
| 128 |
+
`state_id = f"{path.stem}::{idx:04d}"` is unique only within one session file; tree-controller branches, rollout-harness episodes, and Bedrock `recordId` joins all need globally unique, lineage-encoding IDs (parent pointer, branch index, run_id), and `recordId==state_id` is named "the universal join key" without ever specifying the tree extension.
|
| 129 |
+
**Recommendation:** Commit `node_id = {run_id}/{trace_id}/{path-from-root as branch indices}` in `s3_contract.py`; parent derivable by truncation, collision-free by construction.
|
| 130 |
+
|
| 131 |
+
### 18. No dataset versioning, cards, or reproducibility pins. **[P2]**
|
| 132 |
+
|
| 133 |
+
`manifests/run_id.json` carries counts/cost/lineage but no: substrate dataset revision hashes (HF datasets mutate — SWE-smith grew from 50,137 to 59,136 rows post-publication, per 02), swesmith/toolchain versions, Docker image digests (tags like `:latest` in `image_for()` are mutable!), prompt-template hashes, or a generated dataset card (composition, license mix, decontamination statement, known limitations). Irreproducible corpora are unpublishable and undebuggable.
|
| 134 |
+
**Recommendation:** Pin image **digests** not tags in `FeatureDeletionTask.broken_image`; record substrate `(dataset_id, revision)` and generator git SHA in the manifest; auto-generate a dataset card per run from the manifest (the HF `datasets` card template is fine). ~100 LOC total.
|
| 135 |
+
|
| 136 |
+
### 19. DiLoCo rendezvous traffic co-located in the dataset bucket. **[P2]**
|
| 137 |
+
|
| 138 |
+
Both F1 and F2 put `diloco/rendezvous/` (hot, high-frequency, delete-heavy training sync) in the same bucket as the immutable corpus. This entangles IAM (training nodes get write access to a bucket containing `tasks/golden/`), lifecycle policies, and cost attribution.
|
| 139 |
+
**Recommendation:** Separate bucket (or at minimum a separate top-level prefix with its own bucket policy statement and aggressive expiry), keeping the dataset bucket append-only for everything except `quarantine/`.
|
| 140 |
+
|
| 141 |
+
### 20. No corpus-quality acceptance metric — the pipeline has no definition of "done/good." **[P2]**
|
| 142 |
+
|
| 143 |
+
Every stage has pass/fail gates for *tasks*, but nothing measures whether the resulting *corpus* is any good before it consumes GPU budget. SWE-Gym/SWE-smith both validated with small SFT probes (491 and 5,016 trajectories respectively, measurable lift on held-out).
|
| 144 |
+
**Recommendation:** Define the corpus acceptance test as part of the pipeline: SFT a small model (e.g., Qwen3-Coder-7B, LoRA) on each new corpus generation and require a measurable delta on the internal holdout (and decontaminated SWE-bench subset) before the corpus is promoted to `status=accepted` in its manifest. This is the dataset analogue of the trainer's `HeldOutGuard`.
|
| 145 |
+
|
| 146 |
+
### 21. Orchestration restart semantics and budget envelopes are unspecified. **[P2]**
|
| 147 |
+
|
| 148 |
+
F2's Step Functions design has `.sync` integrations but no statement of stage idempotency (re-running stage (b) after partial Bedrock job failure double-writes `replay/`?), no run-level cost ceiling (the only budget control in the system is `replay_trace`'s in-process `max_total_usd=5.0`), and no poison-task quarantine path at the orchestration level (a task that wedges Batch children retries forever).
|
| 149 |
+
**Recommendation:** Make every stage write-once per `(run_id, stage, attempt)` with a completion marker object; add a `budget_usd` field to the run manifest enforced by the driver before each paid stage; route repeatedly-failing array indices to `quarantine/` after `retryStrategy` exhaustion.
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## The MINIMAL pipeline (one pointed-at repo → real SFT corpus), and what the full vision adds in what order
|
| 154 |
+
|
| 155 |
+
**Minimal (Stage 0 — local, no new AWS services, ~1–2 weeks):**
|
| 156 |
+
1. `pip install swesmith`; build the repo profile (env image, test parser) — *buy*, ~7 min human + automation [Finding 6].
|
| 157 |
+
2. Synthesize candidate tasks: PR Mirror first (best data per SWE-smith Table 5 = the repo's own gold-patch-reversion mechanic), procedural/Combine as volume fallback — *buy*.
|
| 158 |
+
3. SPDX license gate at clone + benchmark decontamination check (repo ∉ eval suite) — *build*, ~80 LOC [Findings 5, 13].
|
| 159 |
+
4. 4-gate `validate_task()` in `DockerSandbox` (exists) + `scrub_tree` — *have*; wire the swesmith container into `Sandbox.boot` — ~50 LOC.
|
| 160 |
+
5. Expert trajectory collection: mini-swe-agent/SWE-agent + a frontier model over validated tasks, `$cap` per task — *adopt + adapt*, ~200–400 LOC [Finding 2]. **This is the critical missing component.**
|
| 161 |
+
6. Admission filter: `_grade()==1.0` + `HackMonitor` clean + `pass_to_pass` guard — *have*.
|
| 162 |
+
7. Format to messages-schema SFT rows via `CanonicalTrajectory`; MinHash dedup; `HeldoutSplit` with `check_content=True`; write Parquet + manifest + dataset card via `s3_contract.py` — *build*, ~250 LOC [Findings 7, 11, 12, 18].
|
| 163 |
+
|
| 164 |
+
Total new code ≈ 600–900 LOC plus two adopted dependencies. Output: a real, decontaminated, license-clean, deduped, carded SFT corpus from one repo — and, as a free byproduct, the env-grounded traces the tree needs (Finding 1).
|
| 165 |
+
|
| 166 |
+
**Then, strictly in this order:**
|
| 167 |
+
- **Stage 1:** AWS Batch array validation + Bedrock batch replay when volume justifies (Finding 9); DPO channel on env-grounded traces after the tool-call parser fixes `_normalize_action` (Finding 3).
|
| 168 |
+
- **Stage 2:** Depth-1 tree (N candidates, one env-step each, oracle-graded) — requires `Sandbox.fork()` spike (Finding 4); divergence-gated depth>1 only after the gate's firing rate is measured.
|
| 169 |
+
- **Stage 3:** Curriculum CREATE-half via Combine-Bugs (Finding 15); flywheel with cross-generation dedup (Finding 12); `wm_tuples` emission only when the P4 ablation is scheduled (Finding 14).
|
| 170 |
+
- **Stage 4:** Step Functions/Argo orchestration once runs are routine (Finding 21).
|
| 171 |
+
|
| 172 |
+
---
|
| 173 |
+
|
| 174 |
+
## Severity tally
|
| 175 |
+
|
| 176 |
+
| Severity | Count | Findings |
|
| 177 |
+
|---|---|---|
|
| 178 |
+
| **P0** | 6 | 1 (trace/oracle disjointness), 2 (no rollout harness), 3 (divergence gate uncomputable), 4 (no sandbox fork), 5 (no decontamination), 6 (buy-vs-build inversion) |
|
| 179 |
+
| **P1** | 9 | 7 (two S3 contracts), 8 (golden_diff serialization leak), 9 (premature 5-service orchestration), 10 (no secrets/PII gate), 11 (no trajectory IR), 12 (no cross-generation dedup), 13 (license gate too shallow), 14 (wm_tuples unbounded/speculative), 15 (CREATE-half absent) |
|
| 180 |
+
| **P2** | 6 | 16 (package placement), 17 (node ID contract), 18 (versioning/cards), 19 (bucket co-location), 20 (corpus acceptance metric), 21 (restart/budget semantics) |
|
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Verified Findings — Consolidated Critical Review (Verification Agent)
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-06-09
|
| 4 |
+
**Inputs verified:** `research/deepread/10-critic-fidelity.md`, `research/deepread/11-critic-design.md`, cross-checked against repo code, the deep-reads (`00`–`08`), and the fetched Composer 2.5 blog body (`research/notes/introducing-composer-25-cursor.md`).
|
| 5 |
+
|
| 6 |
+
> **MISSING INPUT:** `research/deepread/09-critic-feasibility.md` did not exist at verification time
|
| 7 |
+
> (polled for ~12 minutes; directory contains only 00–08, 10, 11). The feasibility critic's findings
|
| 8 |
+
> are therefore **unverified and not consolidated here**. If that file appears later, a delta
|
| 9 |
+
> verification pass is required before its P0s are acted on.
|
| 10 |
+
|
| 11 |
+
**Method:** every P0 (and every P1 naming a specific file/claim) was re-checked against the actual
|
| 12 |
+
repo file at the cited line, the verbatim blog text, or the deep-read's primary-source quotes.
|
| 13 |
+
Verdicts: **CONFIRMED** / **REFUTED** / **OVERSTATED** (true core, inflated framing).
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## Part 1 — Verification ledger
|
| 18 |
+
|
| 19 |
+
### Fidelity critic (10) — P0s
|
| 20 |
+
|
| 21 |
+
| # | Claim | Verdict | Evidence |
|
| 22 |
+
|---|---|---|---|
|
| 23 |
+
| F-1 | Benchmark numbers (69.3%, Terminal-Bench 2.0 parity) asserted as Cursor-stated; in no primary source | **CONFIRMED** | `research/01-composer-2.5.md:63-64` asserts "it hit 69.3%" under "Cursor claims…"; fetched blog body contains zero benchmark numbers (only "not well captured by existing benchmarks", note line 26). Header audit note flags but body asserts unhedged. |
|
| 24 |
+
| F-2 | "Feature Deletion + 24 other (unnamed) generators" — count invented | **CONFIRMED** | Blog verbatim (note line 50): "a range of approaches… one synthetic approach is feature deletion" — no count. `docs/COMPOSER_RECIPE_MAPPING.md:75`, `research/06:330` ("The other ~24 generators"), `research/09:23` all carry the fabricated 24. |
|
| 25 |
+
| F-3 | SDPO declared "mathematically the same" as Composer's mechanism — unsupported identification | **CONFIRMED** | `docs/COMPOSER_RECIPE_MAPPING.md:25` ("mathematically the same"), `:151` ("direct formalization"); `opsd.py:13-14` ("formalizes the same loss as Composer 2.5's…"). Blog cites the papers only as "For more background on this approach see" (note lines 67/148). Deep-read 03 (lines 93, 252, 280-281, 308): SDPO loss is full-rollout with feedback-in-prefix; repo's design is turn-localized hint-splice — neither SDPO nor confirmed Composer. |
|
| 26 |
+
| F-4 | Streaming DiLoCo misattributed (wrong authors AND wrong title on arXiv:2501.18512) | **CONFIRMED** | `composer_replication/diloco/__init__.py:8` "Streaming DiLoCo (Liu et al. 2025)"; `research/design-F4-decoupled-diloco-s3.md:109` attaches "Eager Updates…" title to 2501.18512. Deep-read 05 F4: 2501.18512 = Douillard et al., "Streaming DiLoCo with overlapping communication"; Eager Updates = arXiv:2502.12996 (Kale et al.). |
|
| 27 |
+
| F-5 | CWM cited as licensing train-on-all for an RL-time aux head; CWM does it in a separate mid-training stage | **CONFIRMED** | Final report line 33: "crucially training *on all* trajectories for the world-model head, reserving success-filtering only for the RL reward [13]" (+ ref line 258). Deep-read 06 quotes CWM §2.2 verbatim ("we do not filter trajectories…") and documents the three-phase pipeline: world-modeling is in base weights pre-RL, not an aux head riding policy gradients. |
|
| 28 |
+
|
| 29 |
+
### Fidelity critic (10) — P1s naming specific files/claims
|
| 30 |
+
|
| 31 |
+
| # | Claim | Verdict | Evidence |
|
| 32 |
+
|---|---|---|---|
|
| 33 |
+
| F-6 | CWM "65.8%" cited without test-time-scaling qualifier | **CONFIRMED** | Final report line 33 cites bare "65.8% on SWE-bench Verified"; CWM abstract per deep-read 06: "65.8% … (with test-time scaling)". |
|
| 34 |
+
| F-7 | Chain-of-World (2603.03195) is robotics VLA, used for SWE | **CONFIRMED** (via deep-read 06 finding 2.2) | Deep-read 06 exec summary: "robotics/embodied VLA paper, not an SWE paper". |
|
| 35 |
+
| F-8 | "85% post-training compute" stated as fact in research/01 body | **CONFIRMED** | `research/01-composer-2.5.md:14` asserts it unhedged; header (line 5) flags it as "community consensus, not Cursor-stated" — body never inline-tags. |
|
| 36 |
+
| F-9 | Implemented "feature deletion" is gold-patch reversion, not the blog's delete-from-functional-codebase mechanism | **CONFIRMED** | Blog verbatim (note line 50): agent "asked to delete code and files… codebase remains functional". `substrates.py` SweBenchAdapter reverts gold patches of pre-labeled instances; mapping doc line ~42 paraphrase drops the agentic deleter. "Point at a repo" appears nowhere in the blog. |
|
| 37 |
+
| F-10 | Difficulty curriculum: only SELECT half built; CREATE half 0% | **CONFIRMED** | `curriculum.py` header quotes the blog's "select for and create"; `substrates.py:80` hardcodes `granularity="feature"`; no task-escalation stage anywhere. (= Design #15; merged below.) |
|
| 38 |
+
| F-12 | Mapping doc states `KL(teacher‖student)` — direction unsupported by blog, opposite of SDPO Eq. 1 | **CONFIRMED** | `COMPOSER_RECIPE_MAPPING.md` (~line 19): "KL( teacher_logits_at_turn_t \|\| student_logits_at_turn_t )". Blog: "moves the student's token probabilities toward the teacher's" — directionless. SDPO Eq. 1 per deep-read 03 line 308: `KL(π_θ ‖ stopgrad(·))`, student first. |
|
| 39 |
+
| F-13 | `opsd.py` β-convention docstring inverted vs upstream | **CONFIRMED** | `opsd.py:55-59` labels β=0 "reverse KL"; OPSD README per deep-read 03 (§2.2, line 168 "INVERTED", table line 497): "Beta=0 means forward KL". Code numerically correct; labels swapped. |
|
| 40 |
+
| F-14 | SDPO channel lacks the paper's EMA/trust-region teacher regularization | **CONFIRMED** | Deep-read 03 lines 32-34, 81-82: non-regularized live teacher diverges (Table 4: 36.1% vs 50.6%); repo teacher = live weights, no EMA. |
|
| 41 |
+
| F-16 | "$0.98/trace verified economic floor" hides trace definition; real sessions ~2 OOM more | **CONFIRMED** | `teacher_replay.py:7-8` unqualified docstring; `VISION_VALIDATION.md:63,76`; `ADR-002:60` (125→2,830 tool-use messages/session); deep-read 07 FR-R5: ~$73/full session. (VISION_VALIDATION's Objection 2 partially self-flags; the docstring does not.) |
|
| 42 |
+
| F-17 | "$64 ungated tree" is a flat N=8×T=1000 extrapolation, not a tree | **CONFIRMED** | `research/05:256` ($0.008×1000×8 flat math); `comparisons.md:33` and final report line 66 label it the branching/"ungated" cost. True tree is O(N^D), strictly worse. |
|
| 43 |
+
| F-18 | `kl_in_reward.py`: verl k1 "*only* reverse-KL option" overclaim | **CONFIRMED** | `kl_in_reward.py:12` verbatim; deep-read 04 §4.3: verl also supports `kl_penalty="low_var_kl"` (k3-family). |
|
| 44 |
+
| F-20 | GSPO preset inherits GRPO-scale clipping — 2 OOM off the paper | **CONFIRMED** | `composer_trainer.py` "gspo" preset (~line 814) sets no epsilon → TRL default ~0.2; GSPO paper §5.1: 3e-4/4e-4. Companion confirmed: "cispo" preset omits explicit `beta`. |
|
| 45 |
+
| F-21 | rStar named "closest precedent" (misread granularity); Tree-GRPO + SWE-Search uncited | **CONFIRMED** | `framework/composer-replication-framework.md:17-18` ("Closest precedent: rStar-Math… open territory"); grep: zero hits for Tree-GRPO/2509.21240/SWE-Search/2410.20285 in `research/05`. |
|
| 46 |
+
| F-22 | "Nine-tenths is reuse" conflates design-reuse with build status | **CONFIRMED** | Final report line 133 ("the substrate already exists — roughly nine-tenths of it"); grounding doc Claim 5: tree, wm-head, pipeline/, infra/ are 0% built. |
|
| 47 |
+
| F-23 | "Parameter isolation eliminates interference" overclaims DART; aux-head evidence wholly analogical | **CONFIRMED** | Deep-read 06: 2602.00994 shows interference persists on isolated LoRA modules; "evidentiary gap is total for the specific proposed configuration". |
|
| 48 |
+
| F-24 | VeRL "first-class agentic RL" — async path is experimental | **CONFIRMED** | Deep-read 08 lines 66-73: `fully_async_policy`/`transfer_queue` under `verl/experimental`; "slightly overclaims". |
|
| 49 |
+
|
| 50 |
+
P2s (F-25…F-33) spot-checked where cheap; no refutations found. F-33 (`_normalize_action` absent from risk lists) merges into Design #3.
|
| 51 |
+
|
| 52 |
+
### Design critic (11) — P0s
|
| 53 |
+
|
| 54 |
+
| # | Claim | Verdict | Evidence |
|
| 55 |
+
|---|---|---|---|
|
| 56 |
+
| D-1 | Tree seed traces (Claude Code) and execution oracle (FeatureDeletionEnv) operate on disjoint data | **CONFIRMED** | `design-F1:18,70-74` seeds the tree from `ingestion/claude_code.py` TraceStates (local `~/.claude/projects` JSONL — no image, no F2P); `env.py:59` requires `task.broken_image`; `schema.py:35-38` `__post_init__` raises on empty `fail_to_pass`. No design reconciles them. |
|
| 57 |
+
| D-2 | No agent rollout harness exists — SFT corpus has no producer | **CONFIRMED** (minor nuance) | Deep-read 08 §5.1: `reward_fn` fallback = single submit, "dead end for genuine multi-turn"; `teacher_replay` emits single next-actions; F1 line 136 names "tree controller" (0% built) as the sft_corpus producer; F2 stage (d) writes `corpus/sft` with no trajectory-generating stage upstream. Nuance: the tree-controller *design* does call `env.step()`, but expands counterfactual branches from (unexecutable, per D-1) frozen traces — it is not an episode-completing rollout loop. |
|
| 58 |
+
| D-3 | Divergence gating not computable from what components emit | **CONFIRMED** | `teacher_replay.py:195-203`: `_normalize_action` = whitespace-collapse + lowercase, docstring self-admits skeleton status; teachers return text (no cross-model distributions); deep-read 07 FR-R8 (HIGH): "mostly noise on real traces" → gate fires everywhere → silent O(N^D). |
|
| 59 |
+
| D-4 | Sandbox lacks fork/snapshot primitive the tree requires | **CONFIRMED** | `sandbox.py:99-106` Protocol = boot/exec/run_tests/trajectory/is_command_allowed only; grep "fork\|snapshot" → only "fork-bomb guard"; `docker_sandbox.py:15,157` `read_only=True` rootfs. |
|
| 60 |
+
| D-5 | Zero benchmark decontamination anywhere | **CONFIRMED** | grep "decontamin/contamination" across `composer_replication/`, `docs/`, F1, F2, ADR-010 → zero hits; `holdout.py:171` `HeldoutSplit` is internal-split-only; deep-read 02 flags R2E-Gym full-set overlap risk. |
|
| 61 |
+
| D-6 | Buy-vs-build inversion: repo hand-builds what swesmith ships; PR-Mirror validation uncited | **CONFIRMED** | No `swesmith` in `pyproject.toml` or ADR-010 (grep empty); deep-read 02 lines 230, 251, 306: PR Mirror ≡ repo's gold-patch reversion, best training data per SWE-smith Table 5, "$1,360 + 20 human-hours" verified — "neither research/06 nor ADR-010 cite this result". |
|
| 62 |
+
|
| 63 |
+
### Design critic (11) — P1s naming specific files/claims
|
| 64 |
+
|
| 65 |
+
| # | Claim | Verdict | Evidence |
|
| 66 |
+
|---|---|---|---|
|
| 67 |
+
| D-7 | Two unreconciled S3 contracts | **CONFIRMED** (one sub-point softened) | F1 lines 86-93 (`runs/<id>/sft_corpus/ dpo_pairs/ …`) vs F2 lines 133-143 (`traces/v1/run_id=<id>`, `corpus/v1/...{sft,dpo}`); grounding step 8 (lines 91-93) pastes both — `corpus/v1/.../dpo/` AND `dpo_pairs/` coexist. Nuance: F2:129 offers "reuse the existing amazon-sagemaker-… **or** a dedicated composer-datagen-…" — two bucket *options*, not a hard contradiction. |
|
| 68 |
+
| D-8 | `golden_diff` leaks via naive serialization; `repr=False` ≠ exclusion | **CONFIRMED** | `schema.py:29` `field(default="", repr=False)`; F2:100,135 writes "FeatureDeletionTask rows" to `tasks/v1/.../manifest.jsonl`; `dataclasses.asdict`/`vars()` include repr=False fields. Only `env.py:73` (prompt renderer) hides it. |
|
| 69 |
+
| D-9 | Five-service AWS orchestration before one local end-to-end run | **CONFIRMED** | F2 names Glue/EMR Serverless/AWS Batch/Bedrock batch/Step Functions/Lambda (46 hits); ADR-010:99-103 concedes gates passed "against FakeSandbox", Docker e2e `[~]`. |
|
| 70 |
+
| D-10 | No secrets/PII gate at trace ingest | **CONFIRMED** | F2:29 uploads raw `~/.claude/projects/**.jsonl` to `raw/claude_code/`; grep secrets/PII/gitleaks/trufflehog in F2 → zero; `scrub_tree` strips caches, not secrets; `is_redistributable` applies to tasks only. |
|
| 71 |
+
| D-12 | No cross-generation dedup in the flywheel | **CONFIRMED** | F2:114-118: data-juicer `document_deduplicator` per-batch + Spark `dropDuplicates` within one run; `parent_run_id` (F2:151) threads lineage but nothing dedups across runs. |
|
| 72 |
+
| D-13 | License gate is field-deep substring matching, absent for repo-ingest | **CONFIRMED** | `substrates.py:86-90`: `lic = task.upstream_license.lower(); return not any(c in lic for c in _COPYLEFT)`; no SPDX detection at clone path; trainable-vs-redistributable distinction absent. |
|
| 73 |
+
| D-14 | `wm_tuples/` highest-volume prefix serving an ablation-gated, zero-direct-evidence consumer | **CONFIRMED** | F1:90 "wm_tuples/ ← … ALL branches"; deep-read 06: "no published paper has tested an auxiliary next-state-prediction objective during RL on a code policy"; no sampling/lifecycle policy in any design. |
|
| 74 |
+
| D-15 | CREATE-half of curriculum absent from every pipeline design | **CONFIRMED** | Same evidence as F-10; merged below as one finding. |
|
| 75 |
+
| D-11 | No canonical trajectory IR (3 trace shapes → 5) | **CONFIRMED** | `_normalize_action` stub is the symptom; Claude Code JSONL/TraceState, Bedrock `.jsonl.out`, planned tree + rollout + OpenHands shapes each named in separate docs with no shared schema (no `datagen/schema.py` trajectory type exists). |
|
| 76 |
+
|
| 77 |
+
D-16…D-21 (P2): spot-checked — `state_id` format confirmed (`claude_code.py:181`), `diloco_rendezvous/` co-located in F1's run layout (F1:93). No refutations.
|
| 78 |
+
|
| 79 |
+
**Verification summary: 0 REFUTED, 2 minor nuances (D-2, D-7), everything else CONFIRMED.** Both critics were accurate against source; no fabrication detected.
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## Part 2 — FINAL consolidated finding list (ranked by severity × confidence, deduplicated)
|
| 84 |
+
|
| 85 |
+
### Tier 1 — P0, high confidence, load-bearing for the dataset-pipeline build
|
| 86 |
+
|
| 87 |
+
1. **[V1] The Channel-2 identity claim is wrong: SDPO is *background*, not Composer's mechanism — and the repo's implementation is a third design.** Cursor's blog cites OPSD/SDPO only as "for more background"; SDPO's published loss is full-rollout with feedback-in-prefix and a *regularized* teacher; the repo's turn-localized hint-splice (live-weights teacher, no EMA) is its own blog-inspired variant. Fix `COMPOSER_RECIPE_MAPPING.md:25,151`, `opsd.py:13-14`; also fix the KL-direction line (F-12), the inverted β docstring (F-13), and document/implement teacher regularization (F-14). *(Fidelity F-3 + F-12 + F-13 + F-14.)*
|
| 88 |
+
|
| 89 |
+
2. **[V2] The envisioned tree-of-work pipeline cannot execute as drawn: seed traces and the execution oracle are disjoint, no rollout harness exists to produce the SFT corpus, the divergence gate is uncomputable from text + a whitespace normalizer, and the Sandbox has no fork primitive.** Four independently confirmed structural breaks that compose into one verdict: Phase 1 must seed from env-grounded rollouts (adopt mini-swe-agent/SWE-agent as the trajectory collector), demote Claude Code traces to flat Channel-3/SFT-style uses, build a canonical tool-call action algebra before any `tree_controller.py`, and spike `Sandbox.fork()` before committing to depth>1. *(Design D-1 + D-2 + D-3 + D-4 + fidelity F-33.)*
|
| 90 |
+
|
| 91 |
+
3. **[V3] Zero benchmark decontamination anywhere in code or designs.** The pipeline trains on SWE-bench-family substrates and will be scored on SWE-bench Verified; substrates have partial/divergent decontamination; `HeldoutSplit` is internal-only; the word does not appear in F1/F2/ADR-010. Must exist before the first corpus is generated. *(Design D-5.)*
|
| 92 |
+
|
| 93 |
+
4. **[V4] Buy-vs-build inversion: the repo plans to hand-build what `pip install swesmith` ships, while its own core mechanic (gold-patch reversion) is SWE-smith's PR Mirror — validated as the *best* training data by an ablation the repo never cites.** The user's "point-at-a-repo" ask is ~100 LOC of swesmith adapter, not an unspecified image-builder subsystem ($1,360 + 20 human-hours is the budget bar to beat). Relatedly, the implemented "feature deletion" is the inversion *analogue*, not the blog's delete-from-functional-codebase mechanism — and downstream docs should say so. *(Design D-6 + fidelity F-9.)*
|
| 94 |
+
|
| 95 |
+
### Tier 2 — P0 fidelity corrections (high confidence, citation/claim integrity)
|
| 96 |
+
|
| 97 |
+
5. **[V5] Fabricated quantities circulating as Cursor-stated facts:** (a) benchmark numbers 69.3%/Terminal-Bench parity appear in no primary source (`research/01:63-64`); (b) "24 other generators" is a back-formation from the 25x *task* multiplier (`COMPOSER_RECIPE_MAPPING.md:75`, `research/06:330`, `research/09:23`); (c) "85% post-training compute" asserted unhedged in the research/01 body. Strike or inline-tag all three. *(Fidelity F-1 + F-2 + F-8.)*
|
| 98 |
+
|
| 99 |
+
6. **[V6] World-model aux-head's load-bearing citation is a misread: CWM trains-on-all in a dedicated *mid-training* stage, not as an aux head riding RL policy gradients; 65.8% requires the test-time-scaling qualifier; Chain-of-World is robotics; "parameter isolation eliminates interference" overclaims DART.** The exact proposed configuration has zero published evidence — keep it an ablation arm and gate `wm_tuples/` emission (highest-volume prefix, most speculative consumer) behind the scheduled ablation. *(Fidelity F-5 + F-6 + F-7 + F-23 + design D-14.)*
|
| 100 |
+
|
| 101 |
+
7. **[V7] Streaming DiLoCo citation is doubly wrong** — arXiv:2501.18512 is Douillard et al., "Streaming DiLoCo with overlapping communication"; "Eager Updates…" is arXiv:2502.12996 (Kale et al.); "Liu et al." is the Async Local-SGD paper. Fix `diloco/__init__.py:8` and `design-F4:109`. Low blast radius, unambiguous. *(Fidelity F-4.)*
|
| 102 |
+
|
| 103 |
+
### Tier 3 — P1, high confidence (will bite during the build)
|
| 104 |
+
|
| 105 |
+
8. **[V8] Two unreconciled S3 contracts (F1 vs F2) + `golden_diff` serialization leak.** Write `s3_contract.py` first with one layout and two explicit serializers (`to_policy_row` drops `golden_diff`/`deleted_symbols`; unit-test the absence); fold `divergence_pairs` into `corpus_dpo` provenance; separate the DiLoCo rendezvous prefix/bucket. *(Design D-7 + D-8 + D-19.)*
|
| 106 |
+
|
| 107 |
+
9. **[V9] No secrets/PII scrub at Claude Code trace ingest** — raw sessions (file contents, keys in tool outputs, proprietary code) go to S3 verbatim; the only license filter applies to tasks, and it is a lowercase substring match with no SPDX detection and no trainable-vs-redistributable split for the repo-ingest path. *(Design D-10 + D-13.)*
|
| 108 |
+
|
| 109 |
+
10. **[V10] The blog's CREATE-half of the dynamic curriculum ("select for **and create** harder tasks") is absent from code and from every pipeline design** — `granularity` hardcoded `"feature"`, no escalation stage in the outer loop. Cheapest first implementation: SWE-smith Combine-Bugs (96.9% yield) as a `task_escalation` stage; at minimum reserve the stage boundary in the orchestration contract. *(Fidelity F-10 ≡ design D-15.)*
|
| 110 |
+
|
| 111 |
+
11. **[V11] Cost claims mislabeled in ways that distort planning:** "$0.98/trace verified floor" is a 50-step synthetic-state benchmark (real sessions ~$70–80 flat at 125–2,830 steps/session); "$64 ungated tree" is a flat 8-teacher×1000-step extrapolation, not a tree (true tree is O(N^D)). Fix `teacher_replay.py:7-8`, `VISION_VALIDATION.md`, `comparisons.md:33`, final report §10/§3. *(Fidelity F-16 + F-17.)*
|
| 112 |
+
|
| 113 |
+
12. **[V12] Premature five-service AWS orchestration + flywheel hygiene gaps:** F2 commits Glue/EMR/Batch/Bedrock-batch/Step-Functions before one local e2e run (ADR-010 gates passed against FakeSandbox); no cross-generation dedup (self-training collapse accelerant); no canonical trajectory IR; no restart/budget semantics. Build a local stage-driver first; add MinHash cross-run dedup + `CanonicalTrajectory` schema. *(Design D-9 + D-12 + D-11 + D-21.)*
|
| 114 |
+
|
| 115 |
+
### Tier 4 — P1/P2 precision fixes (confirmed, lower blast radius)
|
| 116 |
+
|
| 117 |
+
13. **[V13] Preset/config drift vs papers:** GSPO preset missing 3e-4/4e-4 epsilons (operationally not GSPO); CISPO preset should set `beta=0.0` explicitly; `kl_in_reward.py:12` "only reverse-KL option" → "default/recommended"; Comedy-of-Estimators citation rests on abstract-only. *(Fidelity F-20 + F-18 + F-19.)*
|
| 118 |
+
|
| 119 |
+
14. **[V14] Provenance/novelty bookkeeping:** cite Tree-GRPO (2509.21240) and SWE-Search (2410.20285) as nearest neighbors and fix the rStar granularity misread (`framework/…:17`, `research/05`); re-scope "nine-tenths reuse" to the recipe-replication layer (tree/wm-head/pipeline are 0% built); note verl async path is experimental. *(Fidelity F-21 + F-22 + F-24.)*
|
| 120 |
+
|
| 121 |
+
15. **[V15] Remaining P2 wording fixes from the fidelity critic (F-25…F-33)** — stage attribution of Muon, deleter-unknown framing, 25x non-convertibility, grounding-doc paraphrase-as-quote, DiLoCo FP4/H-default details, Foresight@k "(we define this)", SWE-rebench unverified counts — all consistent with the deep-reads; apply as a batch documentation pass. Plus design P2s: node-ID contract, image-digest pinning + dataset cards, corpus acceptance probe (small-model SFT delta before GPU spend).
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Coverage note
|
| 126 |
+
|
| 127 |
+
The design critic's minimal-pipeline proposal (swesmith + SPDX/decontamination gates + SWE-agent
|
| 128 |
+
rollout harness + canonical trajectory schema + `s3_contract.py`, ≈600–900 LOC) is consistent with
|
| 129 |
+
every confirmed finding above and is the recommended Stage-0 build order. The feasibility critic
|
| 130 |
+
(09) was never produced; its absence means cost/timeline feasibility claims in the designs have had
|
| 131 |
+
only the fidelity-angle checks (V11) — treat any standalone feasibility assertions as unreviewed.
|
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Synthesis: Critical-Review Verdict + The Dataset-Generation Pipeline Architecture
|
| 2 |
+
|
| 3 |
+
> **Date:** 2026-06-09. **Inputs:** the 8 deep-reads (`01`–`08`), grounding map (`00`),
|
| 4 |
+
> verified findings (`12`), design-critic minimal pipeline (`11` §MINIMAL).
|
| 5 |
+
> **Provenance:** 8 source-cluster readers re-fetched and re-read every primary source
|
| 6 |
+
> (Composer 2.5 blog verbatim, Composer 2 techreport arXiv:2603.24477 full HTML,
|
| 7 |
+
> SWE-smith/SWE-Gym/R2E-Gym/SWE-bench full texts, SDPO/OPSD, Dr.GRPO/DAPO/GSPO/CISPO +
|
| 8 |
+
> Comedy-of-Estimators, DiLoCo/Streaming-DiLoCo, MuZero/Dreamer/CWM + the anti-evidence
|
| 9 |
+
> cluster, LATS/ToT/rStar/Tree-GRPO/SWE-Search/Symphony/Socratic-SWE, TRL/verl/SkyRL live
|
| 10 |
+
> docs, SWE-MiniSandbox); 2 adversarial critics' findings were then independently
|
| 11 |
+
> verified line-by-line — **0 refuted**.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Part A — The critical-review verdict in one page
|
| 16 |
+
|
| 17 |
+
**The recipe-replication layer (Channels 1+2 mechanics, env, safeguards) is solid and
|
| 18 |
+
mostly faithful. The *story we tell about it* has confirmed fidelity defects, and the
|
| 19 |
+
*envisioned dataset pipeline* has four structural breaks that would have made it
|
| 20 |
+
unbuildable as drawn.** The good news: every break has a cheap, evidence-backed fix, and
|
| 21 |
+
the biggest fix is a *buy* (SWE-smith), not a build.
|
| 22 |
+
|
| 23 |
+
### What survived adversarial review intact
|
| 24 |
+
- Feature-deletion-by-gold-patch-reversion as the task mechanic — **independently
|
| 25 |
+
validated** by SWE-smith's ablation (its "PR Mirror" strategy *is* our mechanic and
|
| 26 |
+
produces the **best** training data of its five strategies, Table 5).
|
| 27 |
+
- The execution-oracle reward (`_grade()` masked pass-fraction), the 4-gate validator,
|
| 28 |
+
`scrub_tree` as primary anti-hack control, fractional-credit curriculum (SELECT half).
|
| 29 |
+
- The k1-in-reward KL fix (Composer-2 §4.1 verbatim confirms k1 = −log r in reward,
|
| 30 |
+
citing the same variance argument) and the behavior-rewards bank (§4.2 verbatim).
|
| 31 |
+
- The DiLoCo-over-S3 substrate (math verified against torchft; live-S3 validated).
|
| 32 |
+
- The honest-provenance discipline itself: Channel 3 + tree are OUR additions, not
|
| 33 |
+
Cursor's — re-confirmed.
|
| 34 |
+
|
| 35 |
+
### The four structural breaks in the envisioned pipeline (all CONFIRMED)
|
| 36 |
+
1. **Seed-trace/oracle disjointness.** The tree was drawn growing off Claude Code traces,
|
| 37 |
+
but those traces have no executable environment (no `broken_image`, no
|
| 38 |
+
`fail_to_pass`) — `FeatureDeletionEnv` literally cannot `reset()` on them. The tree
|
| 39 |
+
must seed from **env-grounded rollouts**, which don't exist yet because…
|
| 40 |
+
2. **No rollout harness.** Nothing in the repo runs an agent loop against
|
| 41 |
+
`FeatureDeletionEnv` to completion. The SFT corpus has **no producer**. This is the
|
| 42 |
+
single highest-priority build item.
|
| 43 |
+
3. **Divergence gate uncomputable.** `_normalize_action` is a whitespace-collapse stub;
|
| 44 |
+
teachers return free text; there is no tool-call action algebra to compare. Ungated,
|
| 45 |
+
the tree is O(N^D).
|
| 46 |
+
4. **No `Sandbox.fork()`.** Branching from a mid-trajectory state requires state
|
| 47 |
+
cloning the Sandbox protocol doesn't have.
|
| 48 |
+
|
| 49 |
+
### The five worst fidelity defects (all CONFIRMED, now to be corrected)
|
| 50 |
+
1. "SDPO is mathematically the same as Composer's mechanism" — **wrong**; Cursor cites
|
| 51 |
+
SDPO/OPSD only as *background*; SDPO's published loss is full-rollout with
|
| 52 |
+
feedback-in-prefix + EMA-regularized teacher; ours is a turn-localized hint-splice
|
| 53 |
+
with a live teacher. It is a *third design* (blog-inspired), and must be labeled so.
|
| 54 |
+
2. Fabricated numbers circulating as Cursor-stated: 69.3% CursorBench / Terminal-Bench
|
| 55 |
+
parity (in no primary source), "24 other generators" (back-formed from "25x"),
|
| 56 |
+
"85% post-training compute" (community speculation).
|
| 57 |
+
3. CWM misread: it trains-on-all in a separate **mid-training stage**, not as an aux
|
| 58 |
+
head riding RL gradients; its 65.8% requires test-time scaling; Chain-of-World is a
|
| 59 |
+
**robotics** paper. The world-model head has **zero** direct published evidence for
|
| 60 |
+
the exact proposed configuration — it stays an ablation arm.
|
| 61 |
+
4. Streaming DiLoCo citation doubly wrong (2501.18512 = Douillard et al. "Streaming
|
| 62 |
+
DiLoCo"; "Eager Updates" = 2502.12996 Kale et al.).
|
| 63 |
+
5. Cost figures mislabeled: "$0.98/trace" is a 50-state synthetic benchmark (real
|
| 64 |
+
sessions ≈ $70–80 flat); "$64 ungated tree" is a flat 8×1000 extrapolation, not a
|
| 65 |
+
tree.
|
| 66 |
+
|
| 67 |
+
### Missing pieces no design mentioned (all now in the architecture)
|
| 68 |
+
**Benchmark decontamination** (zero mentions anywhere — training substrates overlap
|
| 69 |
+
SWE-bench eval repos), **secrets/PII scrub at trace ingest**, **SPDX license detection
|
| 70 |
+
at repo ingest**, **canonical trajectory IR**, **cross-generation dedup**,
|
| 71 |
+
**`golden_diff` serialization leak** (`repr=False` does not survive `asdict()`),
|
| 72 |
+
**corpus acceptance probe**, **two unreconciled S3 contracts**.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## Part B — The architecture: "point at a repo → enhanced dataset"
|
| 77 |
+
|
| 78 |
+
**The committed answer to the user's question: yes — point at an open-source repo and
|
| 79 |
+
build the dataset. The engine is SWE-smith (buy), the trajectories come from a rollout
|
| 80 |
+
harness (build), and our vision enhancements (oracle-graded multi-candidate divergence,
|
| 81 |
+
typed signal routing) layer on top as Stage 1+ — strictly after Stage 0 produces one
|
| 82 |
+
real corpus end-to-end locally.**
|
| 83 |
+
|
| 84 |
+
### Why SWE-smith is the engine (the buy-vs-build verdict)
|
| 85 |
+
- `pip install swesmith` (MIT) ships: env construction from arbitrary GitHub repos
|
| 86 |
+
(one Docker image **per repo**, 500× more storage-efficient than per-task), five
|
| 87 |
+
bug-synthesis strategies (LM Modify 56% yield / LM Rewrite 35% / Procedural-AST 40%
|
| 88 |
+
at $0 / Combine 96.9% at $0 / PR Mirror 33.8%), issue-text generation, and
|
| 89 |
+
validation-by-test-execution. 50k tasks for **$1,360 + 20 human-hours** total.
|
| 90 |
+
- Its **PR Mirror ≡ our gold-patch reversion** — and its ablation shows PR Mirror
|
| 91 |
+
trajectories train the best models. The repo's core mechanic is *independently
|
| 92 |
+
validated*; what we were about to hand-build is exactly what the toolkit ships.
|
| 93 |
+
- Its **Combine-Bugs** (96.9% yield, $0) is the cheapest implementation of the blog's
|
| 94 |
+
"create harder tasks" CREATE-half — multi-bug escalation for free.
|
| 95 |
+
- R2E-Gym's SweGen covers the "commits without tests" case (LLM-synthesized F2P tests,
|
| 96 |
+
27.8% vs 28.0% — indistinguishable from real) — the fallback when a pointed-at repo
|
| 97 |
+
has thin test coverage. Adopt as data (R2E-Gym-Subset), not code, for now.
|
| 98 |
+
- Composer-2 reward nuance (from the techreport): reward = correctness + succinctness +
|
| 99 |
+
SE-principles. Our pass-fraction is the correctness core; `behavior_rewards.py`
|
| 100 |
+
(Wave 20) already carries the succinctness/style components.
|
| 101 |
+
|
| 102 |
+
### The Stage-0 pipeline (local, no new AWS services)
|
| 103 |
+
|
| 104 |
+
```
|
| 105 |
+
point at repo URL (or HF substrate, or trace dir)
|
| 106 |
+
│
|
| 107 |
+
┌──────────────────────────▼───────────────────────────┐
|
| 108 |
+
│ 1. INGEST GATE (datagen/repo_gate.py) │
|
| 109 |
+
│ SPDX license detect → trainable/redistributable │
|
| 110 |
+
│ tier; BENCHMARK DECONTAMINATION (repo ∉ SWE-bench/ │
|
| 111 |
+
│ Verified/Lite/Multimodal eval repo lists) │
|
| 112 |
+
└──────────────────────────┬───────────────────────────┘
|
| 113 |
+
▼
|
| 114 |
+
┌──────────────────────────────────────────────────────┐
|
| 115 |
+
│ 2. TASK SYNTHESIS (buy: swesmith) │
|
| 116 |
+
│ swesmith profile → env image → PR-Mirror first, │
|
| 117 |
+
│ Combine-Bugs for escalation (CREATE-half), 13 │
|
| 118 |
+
│ procedural AST transforms for volume │
|
| 119 |
+
│ → SwesmithAdapter.to_task() → FeatureDeletionTask │
|
| 120 |
+
└──────────────────────────┬───────────────────────────┘
|
| 121 |
+
▼
|
| 122 |
+
┌──────────────────────────────────────────────────────┐
|
| 123 |
+
│ 3. VALIDATE (have): 4-gate validate_task() in │
|
| 124 |
+
│ DockerSandbox + scrub_tree │
|
| 125 |
+
└──────────────────────────┬───────────────────────────┘
|
| 126 |
+
▼
|
| 127 |
+
┌──────────────────────────────────────────────────────┐
|
| 128 |
+
│ 4. ROLLOUT HARNESS (build — the critical missing │
|
| 129 |
+
│ component): agent loop over FeatureDeletionEnv │
|
| 130 |
+
│ (prompt → act → env.step → … → submit → _grade()), │
|
| 131 |
+
│ pluggable policy (frontier API / local model), │
|
| 132 |
+
│ $cap per task. Output: CanonicalTrajectory. │
|
| 133 |
+
└──────────────────────────┬───────────────────────────┘
|
| 134 |
+
▼
|
| 135 |
+
┌──────────────────────────────────────────────────────┐
|
| 136 |
+
│ 5. ADMIT + TYPE (have + build): _grade()==1.0 + │
|
| 137 |
+
│ HackMonitor-clean + PASS_TO_PASS → sft/; │
|
| 138 |
+
│ near-misses → dpo candidates; failures → withheld │
|
| 139 |
+
│ (wm_tuples only when the P4 ablation is scheduled) │
|
| 140 |
+
└──────────────────────────┬───────────────────────────┘
|
| 141 |
+
▼
|
| 142 |
+
┌───────────────────────────────���──────────────────────┐
|
| 143 |
+
│ 6. CORPUS (build): CanonicalTrajectory → messages- │
|
| 144 |
+
│ schema SFT rows via to_policy_row() (golden_diff │
|
| 145 |
+
│ PROVABLY absent — unit-tested); MinHash dedup │
|
| 146 |
+
│ (cross-generation aware); HeldoutSplit; secrets │
|
| 147 |
+
│ scrub; ONE reconciled S3/local layout + manifest │
|
| 148 |
+
│ + dataset card (pipeline/s3_contract.py) │
|
| 149 |
+
└──────────────────────────┬───────────────────────────┘
|
| 150 |
+
▼
|
| 151 |
+
7. ACCEPTANCE PROBE: small-model SFT (LoRA) on each corpus
|
| 152 |
+
generation; require measurable holdout delta before
|
| 153 |
+
promote-to-accepted. (the dataset analogue of HeldOutGuard)
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
**Free byproduct:** step 4's trajectories are *env-grounded* — exactly the seed nodes
|
| 157 |
+
the tree-of-work needs, fixing structural break #1 without extra work. Claude Code
|
| 158 |
+
traces are demoted to flat Channel-3 / SFT-style uses (their honest capability).
|
| 159 |
+
|
| 160 |
+
### What the vision enhancements add, strictly in order
|
| 161 |
+
- **Stage 1 — DPO channel + tool-call algebra:** parse tool calls into a canonical
|
| 162 |
+
action form (`ToolCall(name, normalized_args)`), fix `_normalize_action`, extract
|
| 163 |
+
DPO pairs from env-grounded multi-candidate rollouts (N samples per state, each
|
| 164 |
+
env-stepped + graded — depth-1, no fork needed). Bedrock batch / AWS Batch only when
|
| 165 |
+
volume justifies.
|
| 166 |
+
- **Stage 2 — depth-1 tree → divergence gate measurement:** N candidates per decision
|
| 167 |
+
point, one env-step each, oracle-graded. Measure the divergence gate's firing rate on
|
| 168 |
+
real traces BEFORE building depth>1. `Sandbox.fork()` spike gates depth>1.
|
| 169 |
+
- **Stage 3 — curriculum CREATE-half (Combine-Bugs escalation), flywheel with
|
| 170 |
+
cross-generation dedup, wm_tuples emission only when P4 ablation scheduled.**
|
| 171 |
+
- **Stage 4 — Step Functions/Argo orchestration once runs are routine.**
|
| 172 |
+
|
| 173 |
+
### Build manifest (Stage 0, ~900 LOC + 1 adopted dep)
|
| 174 |
+
|
| 175 |
+
| # | Module | What | ~LOC |
|
| 176 |
+
|---|---|---|---|
|
| 177 |
+
| 1 | `datagen/repo_gate.py` | SPDX license detection (LICENSE/classifier heuristics) + trainable-vs-redistributable tiers + benchmark-decontamination list (SWE-bench{,-Lite,-Verified,-Multimodal}+SWE-Gym eval repos) + gate verdict dataclass | ~150 |
|
| 178 |
+
| 2 | `datagen/swesmith_adapter.py` | swesmith task instance → `FeatureDeletionTask` (mirror of SweBenchAdapter; handles swesmith's image naming, F2P/P2P, strategy provenance field) + optional thin synthesis driver behind `[swesmith]` extra | ~120 |
|
| 179 |
+
| 3 | `datagen/trajectory.py` | `CanonicalTrajectory` IR: steps of (obs, ToolCall-or-text action, result, error flag), provenance, grade; adapters: ClaudeCode TraceState→IR; IR→SFT messages; IR→`to_policy_row()` (golden_diff/deleted_symbols PROVABLY dropped) | ~180 |
|
| 180 |
+
| 4 | `datagen/rollout_harness.py` | `RolloutPolicy` protocol (pluggable: API model / local / scripted-fake); `collect_trajectory(env, task, policy, max_turns, budget)` loop; admission filter (`_grade()==1.0` + monitor-clean + guard) | ~220 |
|
| 181 |
+
| 5 | `pipeline/s3_contract.py` | THE single reconciled layout (file:// + s3:// via fsspec): `runs/<run_id>/{tasks,traj,corpus_sft,corpus_dpo,holdout,quarantine}/` + run manifest (counts, cost, lineage, schema_version, parent_run_id) + dataset card writer | ~200 |
|
| 182 |
+
| 6 | `pipeline/dedup.py` | MinHash near-dup detection over SFT rows; cross-generation aware (accepts prior-run signature file) | ~120 |
|
| 183 |
+
| 7 | `pipeline/build_corpus.py` | the local stage-driver wiring 1→6: `build_corpus(source, out, policy, budget)`; source = swesmith repo / SWE-* substrate rows / trace dir | ~180 |
|
| 184 |
+
| 8 | Fidelity-fix batch | research/01 + COMPOSER_RECIPE_MAPPING strikes/tags; opsd.py β + "same loss" claims; diloco citation; kl_in_reward wording; teacher_replay cost docstring; ADR-016 records all of Part A | docs |
|
| 185 |
+
|
| 186 |
+
Everything testable CPU-only with fakes (swesmith synthesis itself needs Docker+Linux —
|
| 187 |
+
the adapter and driver are tested on fixture instances, the live path is
|
| 188 |
+
`skipif`-gated like the existing Docker e2e).
|
| 189 |
+
|
| 190 |
+
### Pre-registered falsifiers (kept from the original vision, now evidence-bounded)
|
| 191 |
+
- Heterogeneous-N-models vs single-model-N-samples at equal compute (Symphony vs
|
| 192 |
+
data-processing-inequality — genuinely two-sided, SWE-specific answer unrun).
|
| 193 |
+
- World-model aux head: build only if the P0–P6 ladder's P4/P6 beat P0–P3 on
|
| 194 |
+
foresight+calibration (CWM supports mid-training train-on-all, NOT an RL-time aux
|
| 195 |
+
head — the proposed configuration is in a null-evidence zone).
|
| 196 |
+
- Tree depth>1: build only if the measured divergence-gate firing rate makes
|
| 197 |
+
O(N·decision-points) real, and only after `Sandbox.fork()` exists.
|