Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 25,337 Bytes
2a16b30 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | # Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned
**Agent:** REPO-GROUNDING
**Date:** 2026-06-09
**Scope:** composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py,
hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5,
research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md,
docs/BACKLOG_RESOLUTION_2026-06-09.md
---
## (1) Exact Current Dataset-Generation Capability
### FeatureDeletionTask schema (`datagen/schema.py`)
Six load-bearing fields and what produces each today:
| Field | Type | Producer today | Notes |
|---|---|---|---|
| `task_id` | `str` | `SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]` | `"unknown"` if missing |
| `repo` | `str` | `instance["repo"]` via `SweBenchAdapter.to_task()` | e.g. `"getmoto/moto"` |
| `base_commit` | `str` | `instance["base_commit"]` | no code to `git checkout` this commit exists today |
| `broken_image` | `str` | `SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest` | This tag is a **pre-built SWE-bench eval image**; no code in the repo pulls or builds these images |
| `fail_to_pass` | `tuple[str,...]` | `_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list | validated non-empty in `__post_init__` |
| `pass_to_pass` | `tuple[str,...]` | `_as_tuple(instance["PASS_TO_PASS"])` | may be empty |
| `test_command` | `str` | `SweBenchAdapter.default_test_command` = `"python -m pytest -q"` | hardcoded; not read from instance |
| `deleted_symbols` | `tuple[str,...]` | **never populated by SweBenchAdapter** — hardcoded `()` in every substrate inversion | the monitor can't do symbol-provenance checks without this |
| `golden_diff` | `str` | `instance["patch"]` | held out of repr; used only by validator |
| `granularity` | `str` | hardcoded `"feature"` in `SweBenchAdapter.to_task()` | CREATE-half escalation (function→file→feature) not wired to anything |
| `difficulty_prior` | `float` | `instance["difficulty"]` if present (SWE-rebench) else `0.5` | |
| `upstream_license` | `str` | `instance["license_name"]` | copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL |
### What SweBenchAdapter actually does and does NOT do
`SweBenchAdapter.to_task(instance: dict)` is a **pure schema inversion** — it takes one SWE-bench-shaped dict and maps it to a `FeatureDeletionTask`. It does NOT:
- Pull or build a Docker image
- Apply the gold patch in reverse (`git apply -R`)
- Run any tests
- Discover test node IDs
- Populate `deleted_symbols` (always empty)
- Escalate `granularity` beyond the static `"feature"`
The broken-repo Docker image is **assumed to exist pre-built** (the SWE-bench project publishes these images; SWE-rebench carries its own `docker_image` field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented `[~]` gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, `scrub_tree` is built, `LocalSubprocessSandbox` and `DockerSandbox` are built) but there is no code in the repo that actually clones a repo, runs `git apply -R <gold_patch>`, builds a Docker image, and pushes it to a registry.
### What FeatureDeletionEnv does during training (`datagen/env.py`)
- `reset(task)` — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes `task.repo`, `task.fail_to_pass`, `task.test_command` but NEVER `golden_diff` or `deleted_symbols`.
- `step(action)` — delegates to `sandbox.exec(action)`, returning observation text; grades on `submit` or turn limit.
- `_grade()` — runs `sandbox.run_tests(test_command, fail_to_pass + pass_to_pass)`, computes pass-fraction over `fail_to_pass`, gates to 0 if `pass_to_pass` guard is broken OR `HackMonitor.flag()` fires.
- `reward_fn(prompts, completions, *, task_id, **kwargs)` — TRL `RewardFunc` face; dispatches through `reset`/`step`; feeds fractional credit (not binary) to `DifficultyCurriculum.update`.
### Safeguards implemented
- `scrub_tree(workdir)` — physically removes `__pycache__`, `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, `*.pyc/.pyo/.class` before episode start. This is the PRIMARY control (added in Wave 2; was absent before).
- `SANDBOX_DENYLIST` — blocks `find`, `strings`, `unzip`, `jar`, `javap`, decompilers, `git`. First-token-only check; bypassable via `sh -c "..."`. Documented as defense-in-depth, NOT the wall.
- `HackMonitor.flag()` — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in `submit_patch`). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat `"__py"+"cache__"` obfuscation), flags the trajectory.
- `DockerSandbox` — `network_mode='none'`, `read_only=True`, `cap_drop=['ALL']`, `no-new-privileges`, `pids_limit=256`, `mem_limit=1g`, optional gVisor `runtime='runsc'`.
### What ingestion/claude_code.py can ingest today
`ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]`:
- Input: Claude Code session JSONL at `~/.claude/projects/<encoded>/<sessionId>.jsonl`
- Output: one `TraceState` per assistant TURN (`state_id`, `messages`, `student_action`)
- Skips: subagent files (`agent-` prefix), sidechain records (`isSidechain: True`), `summary` / `attachment` / `queue-operation` / `file-history-snapshot` records
- `student_action`: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if `strip_thinking=True`)
- `tool_error` flag: structurally set on `user` messages where any `tool_result` block has `is_error: true` — this is the SDPO error-site detection signal
- `state_id`: `f"{path.stem}::{state_idx:04d}"`
- Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL
---
## (2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)
From `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §5/§8/§10.
1. **Seed trace ingestion (Stage a):** `ClaudeCodeIngester.ingest()` over `s3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl` → Parquet at `traces/v1/run_id=<id>/part-*.parquet` via AWS Glue 5.0 Spark ETL job (`glue_ingest_job.py`, ~80 LOC, NOT YET BUILT).
2. **Schema inversion (Stage c1):** `SweBenchAdapter.to_task()` per SWE-bench row → `FeatureDeletionTask` JSONL at `tasks/v1/run_id=<id>/manifest.jsonl` (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (`is_redistributable()`) applied here.
3. **N-teacher replay (Stage b):** `teacher_replay.replay_trace()` generalized from flat OpenRouter to `BedrockBatchTeacherPool` — write one shared `replay/v1/run_id=<id>/input/states.jsonl`, submit one `CreateModelInvocationJob` per teacher, write `.jsonl.out` per teacher to `replay/v1/.../teacher=<slug>/`. An EMR Serverless aggregation step joins all N outputs by `state_id` → `list[TeacherCallResult]`. (`teacher_replay_bedrock.py`, ~180 LOC, NOT YET BUILT).
4. **Multi-model tree expansion (the core delta — NOT BUILT):** A `tree_controller.py` (~250–350 LOC, design-only) that, for each `TraceState` node, fires N models, applies each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branches again from the new state, grades leaves with `_grade()`. Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8).
5. **Sandbox materialization + 4-gate validation (Stage c2):** AWS Batch array jobs on EC2 Spot, one child per task. Each child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task in the S3 manifest, boots `DockerSandbox`/`LocalSubprocessSandbox`, runs `validator.validate_task()` (4 gates), writes `task_grades/v1/run_id=<id>/<task_id>.json`. (`datagen/aws/batch_validate.py`, ~120 LOC, NOT YET BUILT).
6. **DPO pair extraction + normalization (Stage d):** `extract_dpo_pairs()` (already built in `teacher_replay.py`) on the fan-in of teacher outputs → `DPOPair` rows → `DJNormalizer` data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → `corpus/v1/run_id=<id>/dpo/part-*.parquet` and `corpus/sft/part-*.parquet`. (`replaysim/emr_normalize_job.py`, ~100 LOC, NOT YET BUILT).
7. **Orchestration:** AWS Step Functions Standard Workflow: `Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda)`. (`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py`, ~250 LOC IaC, NOT YET BUILT).
8. **S3 typed dataset contract (full set):**
- `raw/claude_code/**/*.jsonl` — input seed traces
- `traces/v1/run_id=<id>/part-*.parquet` — TraceState rows (Stage a output)
- `tasks/v1/run_id=<id>/manifest.jsonl` — FeatureDeletionTask rows (Stage c1 output)
- `tasks/golden/run_id=<id>/` — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)
- `replay/v1/run_id=<id>/input/states.jsonl` — shared Bedrock batch input
- `replay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out` — per-teacher Bedrock batch output
- `task_grades/v1/run_id=<id>/<task_id>.json` — validator + _grade() results
- `corpus/v1/run_id=<id>/sft/part-*.parquet` — clean winning trajectories (SFT-first floor)
- `corpus/v1/run_id=<id>/dpo/part-*.parquet` — DPO pairs (normalized DPOPair)
- `dpo_pairs/` — divergence-derived DPO pairs from the tree (sibling winners vs losers)
- `rl_task_pool/` — FeatureDeletionTask registry + DifficultyCurriculum priors
- `divergence_pairs/` — divergence-annotated nodes (where sibling next-action distributions forked)
- `wm_tuples/` — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)
- `holdout/` — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)
- `diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt` — DiLoCo outer-sync (already used by existing allreduce.py)
- `manifests/run_id=<id>.json` — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
9. **SFT-first stage:** Read `sft_corpus/` (clean `_grade()` gate-1 passing trajectories), run `compose_loss` with `alpha_sdpo=0, beta_replay=0` (reduces to `_lm_response_ce` — next-token CE masked to response tokens), write `ckpt_sft/`. (`pipeline/sft_floor.py`, ~60 LOC, NOT YET BUILT).
10. **Inner RL loop:** `ComposerReplicationTrainer` (trl.GRPOTrainer subclass) on `rl_task_pool/` with `FeatureDeletionEnv.reward_fn`; `total = grpo + α·sdpo + β·trace_replay_dpo`; DiLoCo outer-sync via S3; `HeldOutGuard` kill-switch now wired (Wave 3).
11. **Flywheel:** Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.
---
## (3) Unbuilt Components the Vision Depends On
Every item below is design-only or a skeleton; none has real production code.
| Component | File Estimate | Source | Status |
|---|---|---|---|
| `datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes | ~250–350 LOC | design-F1, final_report §1/§5/§6 | **0% built** — no file exists |
| `SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice | ~60 LOC | design-F5 Tier 1 / final_report §1/§6 | **0% built** — not a class in hint_generator.py at all |
| `pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract | ~80 LOC | design-F1 §4 | **0% built** — no `pipeline/` directory exists |
| `pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/` | ~60 LOC | design-F1 §2 / design-F5 d | **0% built** |
| `teacher_replay_bedrock.py` — `BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]` | ~180 LOC | design-F2 §b | **0% built** |
| `datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/` | ~120 LOC | design-F2 §c2 | **0% built** — `datagen/aws/` subdirectory does not exist |
| `datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet | ~80 LOC | design-F2 §a | **0% built** |
| `replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet | ~100 LOC | design-F2 §d | **0% built** |
| `datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection | ~120 LOC | design-F2 §contract | **0% built** |
| `infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) | ~250 LOC IaC | design-F2 §orchestration | **0% built** — `infra/` directory does not exist |
| `trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `<deliberate>` token as second SDPO mode | ~40 LOC delta | design-F1 §4 / final_report §2 | **0% built** — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`<deliberate>` anywhere in `composer_replication/` |
| Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R <golden_diff>`, run `scrub_tree`, build and push a Docker image to ECR | unspecified | ADR-010 §decision / design-F2 §c2 | **0% built** — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch |
| `EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop | Wave-2 executor skeleton built; Argo controller design-only | design-F1 §AWS / final_report §8 | **skeleton built** — `eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% |
| `verl AsyncServer` backend for tool-heavy tree | — | final_report §8 | **0% built** — design note only |
| Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) | — | design-F5 §Tier 4 | **0% built** |
---
## (4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code
The `SweBenchAdapter` is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:
### Break 1: `broken_image` assumes a pre-built SWE-bench image exists
`SweBenchAdapter.image_for()` returns either `instance["docker_image"]` (SWE-rebench) or the convention `swebench/sweb.eval.x86_64.{iid}:latest`. For an arbitrary OSS repo there is no such image. A fresh repo would need:
- Clone at `base_commit`
- Install the project's Python/Java/etc. toolchain
- Apply `git apply -R <golden_diff>` to manufacture the broken state
- Run `scrub_tree()` to strip caches
- Build a Docker image that encapsulates this broken state
- Push the image to a registry accessible by `DockerSandbox.boot()`
None of this code exists. `DockerSandbox.boot(image)` raises `RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.")` if the image is absent.
### Break 2: `test_command` is hardcoded
`SweBenchAdapter.default_test_command = "python -m pytest -q"`. A fresh repo may use `make test`, `npm test`, `cargo test`, `mvn verify`, or any other test runner. There is no test-discovery logic anywhere in the repo.
### Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels
SWE-bench instances ship with `FAIL_TO_PASS` and `PASS_TO_PASS` as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. `FeatureDeletionTask.__post_init__` raises `ValueError` if `fail_to_pass` is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.
### Break 4: `deleted_symbols` is never populated
`SweBenchAdapter` hardcodes `deleted_symbols=()`. The `HackMonitor._patch_provenance_hack()` check (`monitor.py:157-182`) skips the symbol-reappearance test if `deleted_symbols` is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.
### Break 5: No copyleft scrub for arbitrary repos
`is_redistributable()` reads `upstream_license` from `instance["license_name"]`. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.
### Break 6: No env setup for non-Python repos
`LocalSubprocessSandbox.run_tests` runs `subprocess.run(cmd, shell=True, ...)` against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. `DockerSandbox` depends on a pre-baked image that already has the environment. A fresh Python repo would need `pip install -e .` run inside the image, and a non-Python repo would need a completely different image and test runner.
---
## (5) What ingestion/claude_code.py Can Ingest Today
`ClaudeCodeIngester.ingest(path)` handles exactly one format: **Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionId>.jsonl`.
Supported record types handled:
- `type="user"` — string content or list of text/tool_result blocks → OpenAI-style user message; `tool_error` structural flag set if any `tool_result` block has `is_error: true`
- `type="assistant"` — list of text/thinking/tool_use blocks → one `TraceState` with `student_action` (full blocks including thinking) and `messages` (history, optionally with thinking stripped)
Record types silently skipped:
- `type="summary"` — Claude Code conversation summary records
- `type="attachment"`, `"queue-operation"`, `"file-history-snapshot"`, `"last-prompt"`, `"system"` — auxiliary records
- `isSidechain: True` records — subagent traces (skipped in v0.1 per ADR-002)
- Files starting with `agent-` — subagent session files by naming convention
Structural features:
- `state_id = f"{path.stem}::{state_idx:04d}"` — stable within-session identifier
- `strip_thinking` flag (default True) — strips `[THINKING] ...` lines from the teacher-facing `messages` history but keeps them in `student_action`
- Injects synthetic system prompt at `messages[0]` (`"You are a senior software engineer..."`)
- Version check: warns on schema version outside `2.x.x`
NOT handled by this ingester:
- OpenHands trajectory format (planned for v0.2 per ADR-002)
- SWE-smith trajectories (planned for v0.2)
- Cline VS Code export
- Aider chat history
- SWE-bench leaderboard trajectory submissions
- Any binary or non-JSONL format
---
## Critical Cross-Checks: What the Repo Claims vs What Exists
### Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")
**What the blog says (COMPOSER_RECIPE_MAPPING.md):** "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests."
**What the repo does:** Inverts *existing* SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was *explicitly rejected*.
### Claim 2: "25× synthetic data"
**What the blog says:** Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2).
**What the repo has:** A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the *training run*; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.
### Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"
**What Composer 2.5 says:** "We both select for and create harder tasks dynamically throughout the run."
**What the repo has:** The SELECT-FOR half: `DifficultyCurriculum` with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. `granularity` is set statically to `"feature"` for all SweBenchAdapter tasks; no escalation logic exists.
### Claim 4: `deleted_symbols` enables AST-provenance monitoring
**What ADR-010 says:** "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads.
**Reality:** `deleted_symbols=()` on every `SweBenchAdapter`-derived task (line 81 in substrates.py: hardcoded empty tuple). `HackMonitor._patch_provenance_hack()` returns False immediately when `deleted_symbols` is empty (`reappeared = [s for s in deleted_symbols if s and s in patch]` → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.
### Claim 5: The tree controller and world-model head are part of the system
**What design docs say:** "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table).
**Reality:** The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no `SiblingBootstrap`, `world_model`, `WorldModel`, `next_state_head`, `tree_controller`, `MCTS`, `deliberate_token` anywhere in `composer_replication/`. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.
### Claim 6: The broken-repo image is manufactured by the pipeline
**What design-F2 says:** Step c2 involves "pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()`, run the test command, confirm FAIL_TO_PASS actually fails."
**Reality:** This describes what SHOULD happen in the Batch array child. No such code is written. `SweBenchAdapter.image_for()` returns a string tag; that tag must be pre-pulled on the host before `DockerSandbox.boot()` can use it (`RuntimeError` on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.
---
## Summary of Unbuilt vs Built
### BUILT and tested (production-ready CPU, Docker-gated where noted):
- `FeatureDeletionTask` schema + `FeatureDeletionEnv` (reset/step/_grade/reward_fn)
- `SweBenchAdapter` schema inversion (pure dict transform)
- `FakeSandbox`, `LocalSubprocessSandbox`, `DockerSandbox` (hardware-gated e2e green in Wave 1/2)
- `scrub_tree()` primary reward-hack control
- `HackMonitor` (signature + patch-provenance, obfuscation-resistant)
- `DifficultyCurriculum` (SELECT-FOR half + effort tilt)
- `validate_task()` 4-gate solvability validator
- `ClaudeCodeIngester` (Claude Code JSONL only)
- `behavior_rewards.py` — `c_length`, `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (Wave 20)
- `kl_in_reward.py` — k1-in-reward path opt-in (Wave 20)
- `HeldOutGuard` + `HeldoutSplit` + wired into trainer (Wave 2/3)
- `EKSExecutor` skeleton + `SageMakerExecutor` skeleton (Wave 2)
### DESIGN-ONLY (no code):
- Tree controller (`datagen/tree_controller.py`)
- `SiblingBootstrapGenerator` in `hint_generator.py`
- `pipeline/s3_layout.py`, `pipeline/sft_floor.py`
- `teacher_replay_bedrock.py` (BedrockBatchTeacherPool)
- `datagen/aws/batch_validate.py`, `datagen/aws/glue_ingest_job.py`, `datagen/aws/s3_contract.py`
- `replaysim/emr_normalize_job.py`
- `infra/datagen_stepfunctions.json`, `infra/datagen_stack.py`
- World-model next-state head in trainer
- Argo Workflows outer-loop controller
- Broken-repo image builder (clone → git apply -R → build → push)
- CREATE half of difficulty curriculum (mint harder tasks during run)
- SFT-first training stage
- Offline LLM-judge hack monitor
|