File size: 25,337 Bytes
2a16b30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
# Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned

**Agent:** REPO-GROUNDING  
**Date:** 2026-06-09  
**Scope:** composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py,
hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5,
research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md,
docs/BACKLOG_RESOLUTION_2026-06-09.md

---

## (1) Exact Current Dataset-Generation Capability

### FeatureDeletionTask schema (`datagen/schema.py`)

Six load-bearing fields and what produces each today:

| Field | Type | Producer today | Notes |
|---|---|---|---|
| `task_id` | `str` | `SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]` | `"unknown"` if missing |
| `repo` | `str` | `instance["repo"]` via `SweBenchAdapter.to_task()` | e.g. `"getmoto/moto"` |
| `base_commit` | `str` | `instance["base_commit"]` | no code to `git checkout` this commit exists today |
| `broken_image` | `str` | `SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest` | This tag is a **pre-built SWE-bench eval image**; no code in the repo pulls or builds these images |
| `fail_to_pass` | `tuple[str,...]` | `_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list | validated non-empty in `__post_init__` |
| `pass_to_pass` | `tuple[str,...]` | `_as_tuple(instance["PASS_TO_PASS"])` | may be empty |
| `test_command` | `str` | `SweBenchAdapter.default_test_command` = `"python -m pytest -q"` | hardcoded; not read from instance |
| `deleted_symbols` | `tuple[str,...]` | **never populated by SweBenchAdapter** — hardcoded `()` in every substrate inversion | the monitor can't do symbol-provenance checks without this |
| `golden_diff` | `str` | `instance["patch"]` | held out of repr; used only by validator |
| `granularity` | `str` | hardcoded `"feature"` in `SweBenchAdapter.to_task()` | CREATE-half escalation (function→file→feature) not wired to anything |
| `difficulty_prior` | `float` | `instance["difficulty"]` if present (SWE-rebench) else `0.5` | |
| `upstream_license` | `str` | `instance["license_name"]` | copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL |

### What SweBenchAdapter actually does and does NOT do

`SweBenchAdapter.to_task(instance: dict)` is a **pure schema inversion** — it takes one SWE-bench-shaped dict and maps it to a `FeatureDeletionTask`. It does NOT:
- Pull or build a Docker image
- Apply the gold patch in reverse (`git apply -R`)
- Run any tests
- Discover test node IDs
- Populate `deleted_symbols` (always empty)
- Escalate `granularity` beyond the static `"feature"`

The broken-repo Docker image is **assumed to exist pre-built** (the SWE-bench project publishes these images; SWE-rebench carries its own `docker_image` field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented `[~]` gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, `scrub_tree` is built, `LocalSubprocessSandbox` and `DockerSandbox` are built) but there is no code in the repo that actually clones a repo, runs `git apply -R <gold_patch>`, builds a Docker image, and pushes it to a registry.

### What FeatureDeletionEnv does during training (`datagen/env.py`)

- `reset(task)` — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes `task.repo`, `task.fail_to_pass`, `task.test_command` but NEVER `golden_diff` or `deleted_symbols`.
- `step(action)` — delegates to `sandbox.exec(action)`, returning observation text; grades on `submit` or turn limit.
- `_grade()` — runs `sandbox.run_tests(test_command, fail_to_pass + pass_to_pass)`, computes pass-fraction over `fail_to_pass`, gates to 0 if `pass_to_pass` guard is broken OR `HackMonitor.flag()` fires.
- `reward_fn(prompts, completions, *, task_id, **kwargs)` — TRL `RewardFunc` face; dispatches through `reset`/`step`; feeds fractional credit (not binary) to `DifficultyCurriculum.update`.

### Safeguards implemented

- `scrub_tree(workdir)` — physically removes `__pycache__`, `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, `*.pyc/.pyo/.class` before episode start. This is the PRIMARY control (added in Wave 2; was absent before).
- `SANDBOX_DENYLIST` — blocks `find`, `strings`, `unzip`, `jar`, `javap`, decompilers, `git`. First-token-only check; bypassable via `sh -c "..."`. Documented as defense-in-depth, NOT the wall.
- `HackMonitor.flag()` — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in `submit_patch`). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat `"__py"+"cache__"` obfuscation), flags the trajectory.
- `DockerSandbox``network_mode='none'`, `read_only=True`, `cap_drop=['ALL']`, `no-new-privileges`, `pids_limit=256`, `mem_limit=1g`, optional gVisor `runtime='runsc'`.

### What ingestion/claude_code.py can ingest today

`ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]`:
- Input: Claude Code session JSONL at `~/.claude/projects/<encoded>/<sessionId>.jsonl`
- Output: one `TraceState` per assistant TURN (`state_id`, `messages`, `student_action`)
- Skips: subagent files (`agent-` prefix), sidechain records (`isSidechain: True`), `summary` / `attachment` / `queue-operation` / `file-history-snapshot` records
- `student_action`: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if `strip_thinking=True`)
- `tool_error` flag: structurally set on `user` messages where any `tool_result` block has `is_error: true` — this is the SDPO error-site detection signal
- `state_id`: `f"{path.stem}::{state_idx:04d}"`
- Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL

---

## (2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)

From `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §5/§8/§10.

1. **Seed trace ingestion (Stage a):** `ClaudeCodeIngester.ingest()` over `s3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl` → Parquet at `traces/v1/run_id=<id>/part-*.parquet` via AWS Glue 5.0 Spark ETL job (`glue_ingest_job.py`, ~80 LOC, NOT YET BUILT).
2. **Schema inversion (Stage c1):** `SweBenchAdapter.to_task()` per SWE-bench row → `FeatureDeletionTask` JSONL at `tasks/v1/run_id=<id>/manifest.jsonl` (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (`is_redistributable()`) applied here.
3. **N-teacher replay (Stage b):** `teacher_replay.replay_trace()` generalized from flat OpenRouter to `BedrockBatchTeacherPool` — write one shared `replay/v1/run_id=<id>/input/states.jsonl`, submit one `CreateModelInvocationJob` per teacher, write `.jsonl.out` per teacher to `replay/v1/.../teacher=<slug>/`. An EMR Serverless aggregation step joins all N outputs by `state_id``list[TeacherCallResult]`. (`teacher_replay_bedrock.py`, ~180 LOC, NOT YET BUILT).
4. **Multi-model tree expansion (the core delta — NOT BUILT):** A `tree_controller.py` (~250–350 LOC, design-only) that, for each `TraceState` node, fires N models, applies each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branches again from the new state, grades leaves with `_grade()`. Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8).
5. **Sandbox materialization + 4-gate validation (Stage c2):** AWS Batch array jobs on EC2 Spot, one child per task. Each child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task in the S3 manifest, boots `DockerSandbox`/`LocalSubprocessSandbox`, runs `validator.validate_task()` (4 gates), writes `task_grades/v1/run_id=<id>/<task_id>.json`. (`datagen/aws/batch_validate.py`, ~120 LOC, NOT YET BUILT).
6. **DPO pair extraction + normalization (Stage d):** `extract_dpo_pairs()` (already built in `teacher_replay.py`) on the fan-in of teacher outputs → `DPOPair` rows → `DJNormalizer` data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → `corpus/v1/run_id=<id>/dpo/part-*.parquet` and `corpus/sft/part-*.parquet`. (`replaysim/emr_normalize_job.py`, ~100 LOC, NOT YET BUILT).
7. **Orchestration:** AWS Step Functions Standard Workflow: `Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda)`. (`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py`, ~250 LOC IaC, NOT YET BUILT).
8. **S3 typed dataset contract (full set):**
   - `raw/claude_code/**/*.jsonl` — input seed traces
   - `traces/v1/run_id=<id>/part-*.parquet` — TraceState rows (Stage a output)
   - `tasks/v1/run_id=<id>/manifest.jsonl` — FeatureDeletionTask rows (Stage c1 output)
   - `tasks/golden/run_id=<id>/` — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)
   - `replay/v1/run_id=<id>/input/states.jsonl` — shared Bedrock batch input
   - `replay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out` — per-teacher Bedrock batch output
   - `task_grades/v1/run_id=<id>/<task_id>.json` — validator + _grade() results
   - `corpus/v1/run_id=<id>/sft/part-*.parquet` — clean winning trajectories (SFT-first floor)
   - `corpus/v1/run_id=<id>/dpo/part-*.parquet` — DPO pairs (normalized DPOPair)
   - `dpo_pairs/` — divergence-derived DPO pairs from the tree (sibling winners vs losers)
   - `rl_task_pool/` — FeatureDeletionTask registry + DifficultyCurriculum priors
   - `divergence_pairs/` — divergence-annotated nodes (where sibling next-action distributions forked)
   - `wm_tuples/` — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)
   - `holdout/` — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)
   - `diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt` — DiLoCo outer-sync (already used by existing allreduce.py)
   - `manifests/run_id=<id>.json` — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
9. **SFT-first stage:** Read `sft_corpus/` (clean `_grade()` gate-1 passing trajectories), run `compose_loss` with `alpha_sdpo=0, beta_replay=0` (reduces to `_lm_response_ce` — next-token CE masked to response tokens), write `ckpt_sft/`. (`pipeline/sft_floor.py`, ~60 LOC, NOT YET BUILT).
10. **Inner RL loop:** `ComposerReplicationTrainer` (trl.GRPOTrainer subclass) on `rl_task_pool/` with `FeatureDeletionEnv.reward_fn`; `total = grpo + α·sdpo + β·trace_replay_dpo`; DiLoCo outer-sync via S3; `HeldOutGuard` kill-switch now wired (Wave 3).
11. **Flywheel:** Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.

---

## (3) Unbuilt Components the Vision Depends On

Every item below is design-only or a skeleton; none has real production code.

| Component | File Estimate | Source | Status |
|---|---|---|---|
| `datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes | ~250–350 LOC | design-F1, final_report §1/§5/§6 | **0% built** — no file exists |
| `SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice | ~60 LOC | design-F5 Tier 1 / final_report §1/§6 | **0% built** — not a class in hint_generator.py at all |
| `pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract | ~80 LOC | design-F1 §4 | **0% built** — no `pipeline/` directory exists |
| `pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/` | ~60 LOC | design-F1 §2 / design-F5 d | **0% built** |
| `teacher_replay_bedrock.py``BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]` | ~180 LOC | design-F2 §b | **0% built** |
| `datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/` | ~120 LOC | design-F2 §c2 | **0% built**`datagen/aws/` subdirectory does not exist |
| `datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet | ~80 LOC | design-F2 §a | **0% built** |
| `replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet | ~100 LOC | design-F2 §d | **0% built** |
| `datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection | ~120 LOC | design-F2 §contract | **0% built** |
| `infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) | ~250 LOC IaC | design-F2 §orchestration | **0% built**`infra/` directory does not exist |
| `trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `<deliberate>` token as second SDPO mode | ~40 LOC delta | design-F1 §4 / final_report §2 | **0% built** — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`<deliberate>` anywhere in `composer_replication/` |
| Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R <golden_diff>`, run `scrub_tree`, build and push a Docker image to ECR | unspecified | ADR-010 §decision / design-F2 §c2 | **0% built** — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch |
| `EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop | Wave-2 executor skeleton built; Argo controller design-only | design-F1 §AWS / final_report §8 | **skeleton built**`eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% |
| `verl AsyncServer` backend for tool-heavy tree | — | final_report §8 | **0% built** — design note only |
| Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) | — | design-F5 §Tier 4 | **0% built** |

---

## (4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code

The `SweBenchAdapter` is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:

### Break 1: `broken_image` assumes a pre-built SWE-bench image exists

`SweBenchAdapter.image_for()` returns either `instance["docker_image"]` (SWE-rebench) or the convention `swebench/sweb.eval.x86_64.{iid}:latest`. For an arbitrary OSS repo there is no such image. A fresh repo would need:
- Clone at `base_commit`
- Install the project's Python/Java/etc. toolchain
- Apply `git apply -R <golden_diff>` to manufacture the broken state
- Run `scrub_tree()` to strip caches
- Build a Docker image that encapsulates this broken state
- Push the image to a registry accessible by `DockerSandbox.boot()`

None of this code exists. `DockerSandbox.boot(image)` raises `RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.")` if the image is absent.

### Break 2: `test_command` is hardcoded

`SweBenchAdapter.default_test_command = "python -m pytest -q"`. A fresh repo may use `make test`, `npm test`, `cargo test`, `mvn verify`, or any other test runner. There is no test-discovery logic anywhere in the repo.

### Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels

SWE-bench instances ship with `FAIL_TO_PASS` and `PASS_TO_PASS` as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. `FeatureDeletionTask.__post_init__` raises `ValueError` if `fail_to_pass` is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.

### Break 4: `deleted_symbols` is never populated

`SweBenchAdapter` hardcodes `deleted_symbols=()`. The `HackMonitor._patch_provenance_hack()` check (`monitor.py:157-182`) skips the symbol-reappearance test if `deleted_symbols` is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.

### Break 5: No copyleft scrub for arbitrary repos

`is_redistributable()` reads `upstream_license` from `instance["license_name"]`. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.

### Break 6: No env setup for non-Python repos

`LocalSubprocessSandbox.run_tests` runs `subprocess.run(cmd, shell=True, ...)` against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. `DockerSandbox` depends on a pre-baked image that already has the environment. A fresh Python repo would need `pip install -e .` run inside the image, and a non-Python repo would need a completely different image and test runner.

---

## (5) What ingestion/claude_code.py Can Ingest Today

`ClaudeCodeIngester.ingest(path)` handles exactly one format: **Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionId>.jsonl`.

Supported record types handled:
- `type="user"` — string content or list of text/tool_result blocks → OpenAI-style user message; `tool_error` structural flag set if any `tool_result` block has `is_error: true`
- `type="assistant"` — list of text/thinking/tool_use blocks → one `TraceState` with `student_action` (full blocks including thinking) and `messages` (history, optionally with thinking stripped)

Record types silently skipped:
- `type="summary"` — Claude Code conversation summary records
- `type="attachment"`, `"queue-operation"`, `"file-history-snapshot"`, `"last-prompt"`, `"system"` — auxiliary records
- `isSidechain: True` records — subagent traces (skipped in v0.1 per ADR-002)
- Files starting with `agent-` — subagent session files by naming convention

Structural features:
- `state_id = f"{path.stem}::{state_idx:04d}"` — stable within-session identifier
- `strip_thinking` flag (default True) — strips `[THINKING] ...` lines from the teacher-facing `messages` history but keeps them in `student_action`
- Injects synthetic system prompt at `messages[0]` (`"You are a senior software engineer..."`)
- Version check: warns on schema version outside `2.x.x`

NOT handled by this ingester:
- OpenHands trajectory format (planned for v0.2 per ADR-002)
- SWE-smith trajectories (planned for v0.2)
- Cline VS Code export
- Aider chat history
- SWE-bench leaderboard trajectory submissions
- Any binary or non-JSONL format

---

## Critical Cross-Checks: What the Repo Claims vs What Exists

### Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")
**What the blog says (COMPOSER_RECIPE_MAPPING.md):** "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests."
**What the repo does:** Inverts *existing* SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was *explicitly rejected*.

### Claim 2: "25× synthetic data"
**What the blog says:** Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2).
**What the repo has:** A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the *training run*; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.

### Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"
**What Composer 2.5 says:** "We both select for and create harder tasks dynamically throughout the run."
**What the repo has:** The SELECT-FOR half: `DifficultyCurriculum` with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. `granularity` is set statically to `"feature"` for all SweBenchAdapter tasks; no escalation logic exists.

### Claim 4: `deleted_symbols` enables AST-provenance monitoring
**What ADR-010 says:** "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads.
**Reality:** `deleted_symbols=()` on every `SweBenchAdapter`-derived task (line 81 in substrates.py: hardcoded empty tuple). `HackMonitor._patch_provenance_hack()` returns False immediately when `deleted_symbols` is empty (`reappeared = [s for s in deleted_symbols if s and s in patch]` → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.

### Claim 5: The tree controller and world-model head are part of the system
**What design docs say:** "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table).
**Reality:** The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no `SiblingBootstrap`, `world_model`, `WorldModel`, `next_state_head`, `tree_controller`, `MCTS`, `deliberate_token` anywhere in `composer_replication/`. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.

### Claim 6: The broken-repo image is manufactured by the pipeline
**What design-F2 says:** Step c2 involves "pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()`, run the test command, confirm FAIL_TO_PASS actually fails."
**Reality:** This describes what SHOULD happen in the Batch array child. No such code is written. `SweBenchAdapter.image_for()` returns a string tag; that tag must be pre-pulled on the host before `DockerSandbox.boot()` can use it (`RuntimeError` on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.

---

## Summary of Unbuilt vs Built

### BUILT and tested (production-ready CPU, Docker-gated where noted):
- `FeatureDeletionTask` schema + `FeatureDeletionEnv` (reset/step/_grade/reward_fn)
- `SweBenchAdapter` schema inversion (pure dict transform)
- `FakeSandbox`, `LocalSubprocessSandbox`, `DockerSandbox` (hardware-gated e2e green in Wave 1/2)
- `scrub_tree()` primary reward-hack control
- `HackMonitor` (signature + patch-provenance, obfuscation-resistant)
- `DifficultyCurriculum` (SELECT-FOR half + effort tilt)
- `validate_task()` 4-gate solvability validator
- `ClaudeCodeIngester` (Claude Code JSONL only)
- `behavior_rewards.py``c_length`, `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (Wave 20)
- `kl_in_reward.py` — k1-in-reward path opt-in (Wave 20)
- `HeldOutGuard` + `HeldoutSplit` + wired into trainer (Wave 2/3)
- `EKSExecutor` skeleton + `SageMakerExecutor` skeleton (Wave 2)

### DESIGN-ONLY (no code):
- Tree controller (`datagen/tree_controller.py`)
- `SiblingBootstrapGenerator` in `hint_generator.py`
- `pipeline/s3_layout.py`, `pipeline/sft_floor.py`
- `teacher_replay_bedrock.py` (BedrockBatchTeacherPool)
- `datagen/aws/batch_validate.py`, `datagen/aws/glue_ingest_job.py`, `datagen/aws/s3_contract.py`
- `replaysim/emr_normalize_job.py`
- `infra/datagen_stepfunctions.json`, `infra/datagen_stack.py`
- World-model next-state head in trainer
- Argo Workflows outer-loop controller
- Broken-repo image builder (clone → git apply -R → build → push)
- CREATE half of difficulty curriculum (mint harder tasks during run)
- SFT-first training stage
- Offline LLM-judge hack monitor