Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 21 days ago

25.3 kB

	# Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned

	Agent: REPO-GROUNDING
	Date: 2026-06-09
	Scope: composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py,
	hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5,
	research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md,
	docs/BACKLOG_RESOLUTION_2026-06-09.md

	---

	## (1) Exact Current Dataset-Generation Capability

	### FeatureDeletionTask schema (`datagen/schema.py`)

	Six load-bearing fields and what produces each today:

	\| Field \| Type \| Producer today \| Notes \|
	\|---\|---\|---\|---\|
	\| `task_id` \| `str` \| `SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]` \| `"unknown"` if missing \|
	\| `repo` \| `str` \| `instance["repo"]` via `SweBenchAdapter.to_task()` \| e.g. `"getmoto/moto"` \|
	\| `base_commit` \| `str` \| `instance["base_commit"]` \| no code to `git checkout` this commit exists today \|
	\| `broken_image` \| `str` \| `SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest` \| This tag is a pre-built SWE-bench eval image; no code in the repo pulls or builds these images \|
	\| `fail_to_pass` \| `tuple[str,...]` \| `_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list \| validated non-empty in `__post_init__` \|
	\| `pass_to_pass` \| `tuple[str,...]` \| `_as_tuple(instance["PASS_TO_PASS"])` \| may be empty \|
	\| `test_command` \| `str` \| `SweBenchAdapter.default_test_command` = `"python -m pytest -q"` \| hardcoded; not read from instance \|
	\| `deleted_symbols` \| `tuple[str,...]` \| never populated by SweBenchAdapter — hardcoded `()` in every substrate inversion \| the monitor can't do symbol-provenance checks without this \|
	\| `golden_diff` \| `str` \| `instance["patch"]` \| held out of repr; used only by validator \|
	\| `granularity` \| `str` \| hardcoded `"feature"` in `SweBenchAdapter.to_task()` \| CREATE-half escalation (function→file→feature) not wired to anything \|
	\| `difficulty_prior` \| `float` \| `instance["difficulty"]` if present (SWE-rebench) else `0.5` \| \|
	\| `upstream_license` \| `str` \| `instance["license_name"]` \| copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL \|

	### What SweBenchAdapter actually does and does NOT do

	`SweBenchAdapter.to_task(instance: dict)` is a pure schema inversion — it takes one SWE-bench-shaped dict and maps it to a `FeatureDeletionTask`. It does NOT:
	- Pull or build a Docker image
	- Apply the gold patch in reverse (`git apply -R`)
	- Run any tests
	- Discover test node IDs
	- Populate `deleted_symbols` (always empty)
	- Escalate `granularity` beyond the static `"feature"`

	The broken-repo Docker image is assumed to exist pre-built (the SWE-bench project publishes these images; SWE-rebench carries its own `docker_image` field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented `[~]` gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, `scrub_tree` is built, `LocalSubprocessSandbox` and `DockerSandbox` are built) but there is no code in the repo that actually clones a repo, runs `git apply -R <gold_patch>`, builds a Docker image, and pushes it to a registry.

	### What FeatureDeletionEnv does during training (`datagen/env.py`)

	- `reset(task)` — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes `task.repo`, `task.fail_to_pass`, `task.test_command` but NEVER `golden_diff` or `deleted_symbols`.
	- `step(action)` — delegates to `sandbox.exec(action)`, returning observation text; grades on `submit` or turn limit.
	- `_grade()` — runs `sandbox.run_tests(test_command, fail_to_pass + pass_to_pass)`, computes pass-fraction over `fail_to_pass`, gates to 0 if `pass_to_pass` guard is broken OR `HackMonitor.flag()` fires.
	- `reward_fn(prompts, completions, , task_id, *kwargs)` — TRL `RewardFunc` face; dispatches through `reset`/`step`; feeds fractional credit (not binary) to `DifficultyCurriculum.update`.

	### Safeguards implemented

	- `scrub_tree(workdir)` — physically removes `__pycache__`, `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, `*.pyc/.pyo/.class` before episode start. This is the PRIMARY control (added in Wave 2; was absent before).
	- `SANDBOX_DENYLIST` — blocks `find`, `strings`, `unzip`, `jar`, `javap`, decompilers, `git`. First-token-only check; bypassable via `sh -c "..."`. Documented as defense-in-depth, NOT the wall.
	- `HackMonitor.flag()` — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in `submit_patch`). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat `"__py"+"cache__"` obfuscation), flags the trajectory.
	- `DockerSandbox` — `network_mode='none'`, `read_only=True`, `cap_drop=['ALL']`, `no-new-privileges`, `pids_limit=256`, `mem_limit=1g`, optional gVisor `runtime='runsc'`.

	### What ingestion/claude_code.py can ingest today

	`ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]`:
	- Input: Claude Code session JSONL at `~/.claude/projects/<encoded>/<sessionId>.jsonl`
	- Output: one `TraceState` per assistant TURN (`state_id`, `messages`, `student_action`)
	- Skips: subagent files (`agent-` prefix), sidechain records (`isSidechain: True`), `summary` / `attachment` / `queue-operation` / `file-history-snapshot` records
	- `student_action`: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if `strip_thinking=True`)
	- `tool_error` flag: structurally set on `user` messages where any `tool_result` block has `is_error: true` — this is the SDPO error-site detection signal
	- `state_id`: `f"{path.stem}::{state_idx:04d}"`
	- Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL

	---

	## (2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)

	From `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §5/§8/§10.

	1. Seed trace ingestion (Stage a): `ClaudeCodeIngester.ingest()` over `s3://composer-datagen-386931836011-us-west-2/raw/claude_code/*/.jsonl` → Parquet at `traces/v1/run_id=<id>/part-*.parquet` via AWS Glue 5.0 Spark ETL job (`glue_ingest_job.py`, ~80 LOC, NOT YET BUILT).
	2. Schema inversion (Stage c1): `SweBenchAdapter.to_task()` per SWE-bench row → `FeatureDeletionTask` JSONL at `tasks/v1/run_id=<id>/manifest.jsonl` (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (`is_redistributable()`) applied here.
	3. N-teacher replay (Stage b): `teacher_replay.replay_trace()` generalized from flat OpenRouter to `BedrockBatchTeacherPool` — write one shared `replay/v1/run_id=<id>/input/states.jsonl`, submit one `CreateModelInvocationJob` per teacher, write `.jsonl.out` per teacher to `replay/v1/.../teacher=<slug>/`. An EMR Serverless aggregation step joins all N outputs by `state_id` → `list[TeacherCallResult]`. (`teacher_replay_bedrock.py`, ~180 LOC, NOT YET BUILT).
	4. Multi-model tree expansion (the core delta — NOT BUILT): A `tree_controller.py` (~250–350 LOC, design-only) that, for each `TraceState` node, fires N models, applies each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branches again from the new state, grades leaves with `_grade()`. Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8).
	5. Sandbox materialization + 4-gate validation (Stage c2): AWS Batch array jobs on EC2 Spot, one child per task. Each child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task in the S3 manifest, boots `DockerSandbox`/`LocalSubprocessSandbox`, runs `validator.validate_task()` (4 gates), writes `task_grades/v1/run_id=<id>/<task_id>.json`. (`datagen/aws/batch_validate.py`, ~120 LOC, NOT YET BUILT).
	6. DPO pair extraction + normalization (Stage d): `extract_dpo_pairs()` (already built in `teacher_replay.py`) on the fan-in of teacher outputs → `DPOPair` rows → `DJNormalizer` data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → `corpus/v1/run_id=<id>/dpo/part-.parquet` and `corpus/sft/part-.parquet`. (`replaysim/emr_normalize_job.py`, ~100 LOC, NOT YET BUILT).
	7. Orchestration: AWS Step Functions Standard Workflow: `Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda)`. (`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py`, ~250 LOC IaC, NOT YET BUILT).
	8. S3 typed dataset contract (full set):
	- `raw/claude_code/*/.jsonl` — input seed traces
	- `traces/v1/run_id=<id>/part-*.parquet` — TraceState rows (Stage a output)
	- `tasks/v1/run_id=<id>/manifest.jsonl` — FeatureDeletionTask rows (Stage c1 output)
	- `tasks/golden/run_id=<id>/` — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)
	- `replay/v1/run_id=<id>/input/states.jsonl` — shared Bedrock batch input
	- `replay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out` — per-teacher Bedrock batch output
	- `task_grades/v1/run_id=<id>/<task_id>.json` — validator + _grade() results
	- `corpus/v1/run_id=<id>/sft/part-*.parquet` — clean winning trajectories (SFT-first floor)
	- `corpus/v1/run_id=<id>/dpo/part-*.parquet` — DPO pairs (normalized DPOPair)
	- `dpo_pairs/` — divergence-derived DPO pairs from the tree (sibling winners vs losers)
	- `rl_task_pool/` — FeatureDeletionTask registry + DifficultyCurriculum priors
	- `divergence_pairs/` — divergence-annotated nodes (where sibling next-action distributions forked)
	- `wm_tuples/` — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)
	- `holdout/` — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)
	- `diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt` — DiLoCo outer-sync (already used by existing allreduce.py)
	- `manifests/run_id=<id>.json` — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
	9. SFT-first stage: Read `sft_corpus/` (clean `_grade()` gate-1 passing trajectories), run `compose_loss` with `alpha_sdpo=0, beta_replay=0` (reduces to `_lm_response_ce` — next-token CE masked to response tokens), write `ckpt_sft/`. (`pipeline/sft_floor.py`, ~60 LOC, NOT YET BUILT).
	10. Inner RL loop: `ComposerReplicationTrainer` (trl.GRPOTrainer subclass) on `rl_task_pool/` with `FeatureDeletionEnv.reward_fn`; `total = grpo + α·sdpo + β·trace_replay_dpo`; DiLoCo outer-sync via S3; `HeldOutGuard` kill-switch now wired (Wave 3).
	11. Flywheel: Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.

	---

	## (3) Unbuilt Components the Vision Depends On

	Every item below is design-only or a skeleton; none has real production code.

	\| Component \| File Estimate \| Source \| Status \|
	\|---\|---\|---\|---\|
	\| `datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes \| ~250–350 LOC \| design-F1, final_report §1/§5/§6 \| 0% built — no file exists \|
	\| `SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice \| ~60 LOC \| design-F5 Tier 1 / final_report §1/§6 \| 0% built — not a class in hint_generator.py at all \|
	\| `pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract \| ~80 LOC \| design-F1 §4 \| 0% built — no `pipeline/` directory exists \|
	\| `pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/` \| ~60 LOC \| design-F1 §2 / design-F5 d \| 0% built \|
	\| `teacher_replay_bedrock.py` — `BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]` \| ~180 LOC \| design-F2 §b \| 0% built \|
	\| `datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/` \| ~120 LOC \| design-F2 §c2 \| 0% built — `datagen/aws/` subdirectory does not exist \|
	\| `datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet \| ~80 LOC \| design-F2 §a \| 0% built \|
	\| `replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet \| ~100 LOC \| design-F2 §d \| 0% built \|
	\| `datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection \| ~120 LOC \| design-F2 §contract \| 0% built \|
	\| `infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) \| ~250 LOC IaC \| design-F2 §orchestration \| 0% built — `infra/` directory does not exist \|
	\| `trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `<deliberate>` token as second SDPO mode \| ~40 LOC delta \| design-F1 §4 / final_report §2 \| 0% built — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`<deliberate>` anywhere in `composer_replication/` \|
	\| Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R <golden_diff>`, run `scrub_tree`, build and push a Docker image to ECR \| unspecified \| ADR-010 §decision / design-F2 §c2 \| 0% built — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch \|
	\| `EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop \| Wave-2 executor skeleton built; Argo controller design-only \| design-F1 §AWS / final_report §8 \| skeleton built — `eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% \|
	\| `verl AsyncServer` backend for tool-heavy tree \| — \| final_report §8 \| 0% built — design note only \|
	\| Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) \| — \| design-F5 §Tier 4 \| 0% built \|

	---

	## (4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code

	The `SweBenchAdapter` is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:

	### Break 1: `broken_image` assumes a pre-built SWE-bench image exists

	`SweBenchAdapter.image_for()` returns either `instance["docker_image"]` (SWE-rebench) or the convention `swebench/sweb.eval.x86_64.{iid}:latest`. For an arbitrary OSS repo there is no such image. A fresh repo would need:
	- Clone at `base_commit`
	- Install the project's Python/Java/etc. toolchain
	- Apply `git apply -R <golden_diff>` to manufacture the broken state
	- Run `scrub_tree()` to strip caches
	- Build a Docker image that encapsulates this broken state
	- Push the image to a registry accessible by `DockerSandbox.boot()`

	None of this code exists. `DockerSandbox.boot(image)` raises `RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.")` if the image is absent.

	### Break 2: `test_command` is hardcoded

	`SweBenchAdapter.default_test_command = "python -m pytest -q"`. A fresh repo may use `make test`, `npm test`, `cargo test`, `mvn verify`, or any other test runner. There is no test-discovery logic anywhere in the repo.

	### Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels

	SWE-bench instances ship with `FAIL_TO_PASS` and `PASS_TO_PASS` as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. `FeatureDeletionTask.__post_init__` raises `ValueError` if `fail_to_pass` is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.

	### Break 4: `deleted_symbols` is never populated

	`SweBenchAdapter` hardcodes `deleted_symbols=()`. The `HackMonitor._patch_provenance_hack()` check (`monitor.py:157-182`) skips the symbol-reappearance test if `deleted_symbols` is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.

	### Break 5: No copyleft scrub for arbitrary repos

	`is_redistributable()` reads `upstream_license` from `instance["license_name"]`. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.

	### Break 6: No env setup for non-Python repos

	`LocalSubprocessSandbox.run_tests` runs `subprocess.run(cmd, shell=True, ...)` against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. `DockerSandbox` depends on a pre-baked image that already has the environment. A fresh Python repo would need `pip install -e .` run inside the image, and a non-Python repo would need a completely different image and test runner.

	---

	## (5) What ingestion/claude_code.py Can Ingest Today

	`ClaudeCodeIngester.ingest(path)` handles exactly one format: Claude Code session JSONL at `~/.claude/projects/<encoded>/<sessionId>.jsonl`.

	Supported record types handled:
	- `type="user"` — string content or list of text/tool_result blocks → OpenAI-style user message; `tool_error` structural flag set if any `tool_result` block has `is_error: true`
	- `type="assistant"` — list of text/thinking/tool_use blocks → one `TraceState` with `student_action` (full blocks including thinking) and `messages` (history, optionally with thinking stripped)

	Record types silently skipped:
	- `type="summary"` — Claude Code conversation summary records
	- `type="attachment"`, `"queue-operation"`, `"file-history-snapshot"`, `"last-prompt"`, `"system"` — auxiliary records
	- `isSidechain: True` records — subagent traces (skipped in v0.1 per ADR-002)
	- Files starting with `agent-` — subagent session files by naming convention

	Structural features:
	- `state_id = f"{path.stem}::{state_idx:04d}"` — stable within-session identifier
	- `strip_thinking` flag (default True) — strips `[THINKING] ...` lines from the teacher-facing `messages` history but keeps them in `student_action`
	- Injects synthetic system prompt at `messages[0]` (`"You are a senior software engineer..."`)
	- Version check: warns on schema version outside `2.x.x`

	NOT handled by this ingester:
	- OpenHands trajectory format (planned for v0.2 per ADR-002)
	- SWE-smith trajectories (planned for v0.2)
	- Cline VS Code export
	- Aider chat history
	- SWE-bench leaderboard trajectory submissions
	- Any binary or non-JSONL format

	---

	## Critical Cross-Checks: What the Repo Claims vs What Exists

	### Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")
	What the blog says (COMPOSER_RECIPE_MAPPING.md): "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests."
	What the repo does: Inverts existing SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was explicitly rejected.

	### Claim 2: "25× synthetic data"
	What the blog says: Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2).
	What the repo has: A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the training run; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.

	### Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"
	What Composer 2.5 says: "We both select for and create harder tasks dynamically throughout the run."
	What the repo has: The SELECT-FOR half: `DifficultyCurriculum` with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. `granularity` is set statically to `"feature"` for all SweBenchAdapter tasks; no escalation logic exists.

	### Claim 4: `deleted_symbols` enables AST-provenance monitoring
	What ADR-010 says: "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads.
	Reality: `deleted_symbols=()` on every `SweBenchAdapter`-derived task (line 81 in substrates.py: hardcoded empty tuple). `HackMonitor._patch_provenance_hack()` returns False immediately when `deleted_symbols` is empty (`reappeared = [s for s in deleted_symbols if s and s in patch]` → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.

	### Claim 5: The tree controller and world-model head are part of the system
	What design docs say: "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table).
	Reality: The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no `SiblingBootstrap`, `world_model`, `WorldModel`, `next_state_head`, `tree_controller`, `MCTS`, `deliberate_token` anywhere in `composer_replication/`. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.

	### Claim 6: The broken-repo image is manufactured by the pipeline
	What design-F2 says: Step c2 involves "pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()`, run the test command, confirm FAIL_TO_PASS actually fails."
	Reality: This describes what SHOULD happen in the Batch array child. No such code is written. `SweBenchAdapter.image_for()` returns a string tag; that tag must be pre-pulled on the host before `DockerSandbox.boot()` can use it (`RuntimeError` on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.

	---

	## Summary of Unbuilt vs Built

	### BUILT and tested (production-ready CPU, Docker-gated where noted):
	- `FeatureDeletionTask` schema + `FeatureDeletionEnv` (reset/step/_grade/reward_fn)
	- `SweBenchAdapter` schema inversion (pure dict transform)
	- `FakeSandbox`, `LocalSubprocessSandbox`, `DockerSandbox` (hardware-gated e2e green in Wave 1/2)
	- `scrub_tree()` primary reward-hack control
	- `HackMonitor` (signature + patch-provenance, obfuscation-resistant)
	- `DifficultyCurriculum` (SELECT-FOR half + effort tilt)
	- `validate_task()` 4-gate solvability validator
	- `ClaudeCodeIngester` (Claude Code JSONL only)
	- `behavior_rewards.py` — `c_length`, `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (Wave 20)
	- `kl_in_reward.py` — k1-in-reward path opt-in (Wave 20)
	- `HeldOutGuard` + `HeldoutSplit` + wired into trainer (Wave 2/3)
	- `EKSExecutor` skeleton + `SageMakerExecutor` skeleton (Wave 2)

	### DESIGN-ONLY (no code):
	- Tree controller (`datagen/tree_controller.py`)
	- `SiblingBootstrapGenerator` in `hint_generator.py`
	- `pipeline/s3_layout.py`, `pipeline/sft_floor.py`
	- `teacher_replay_bedrock.py` (BedrockBatchTeacherPool)
	- `datagen/aws/batch_validate.py`, `datagen/aws/glue_ingest_job.py`, `datagen/aws/s3_contract.py`
	- `replaysim/emr_normalize_job.py`
	- `infra/datagen_stepfunctions.json`, `infra/datagen_stack.py`
	- World-model next-state head in trainer
	- Argo Workflows outer-loop controller
	- Broken-repo image builder (clone → git apply -R → build → push)
	- CREATE half of difficulty curriculum (mint harder tasks during run)
	- SFT-first training stage
	- Offline LLM-judge hack monitor