Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30
|
Raw
History Blame Contribute Delete
25.3 kB
# Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned
**Agent:** REPO-GROUNDING
**Date:** 2026-06-09
**Scope:** composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py,
hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5,
research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md,
docs/BACKLOG_RESOLUTION_2026-06-09.md
---
## (1) Exact Current Dataset-Generation Capability
### FeatureDeletionTask schema (`datagen/schema.py`)
Six load-bearing fields and what produces each today:
| Field | Type | Producer today | Notes |
|---|---|---|---|
| `task_id` | `str` | `SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]` | `"unknown"` if missing |
| `repo` | `str` | `instance["repo"]` via `SweBenchAdapter.to_task()` | e.g. `"getmoto/moto"` |
| `base_commit` | `str` | `instance["base_commit"]` | no code to `git checkout` this commit exists today |
| `broken_image` | `str` | `SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest` | This tag is a **pre-built SWE-bench eval image**; no code in the repo pulls or builds these images |
| `fail_to_pass` | `tuple[str,...]` | `_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list | validated non-empty in `__post_init__` |
| `pass_to_pass` | `tuple[str,...]` | `_as_tuple(instance["PASS_TO_PASS"])` | may be empty |
| `test_command` | `str` | `SweBenchAdapter.default_test_command` = `"python -m pytest -q"` | hardcoded; not read from instance |
| `deleted_symbols` | `tuple[str,...]` | **never populated by SweBenchAdapter** — hardcoded `()` in every substrate inversion | the monitor can't do symbol-provenance checks without this |
| `golden_diff` | `str` | `instance["patch"]` | held out of repr; used only by validator |
| `granularity` | `str` | hardcoded `"feature"` in `SweBenchAdapter.to_task()` | CREATE-half escalation (function→file→feature) not wired to anything |
| `difficulty_prior` | `float` | `instance["difficulty"]` if present (SWE-rebench) else `0.5` | |
| `upstream_license` | `str` | `instance["license_name"]` | copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL |
### What SweBenchAdapter actually does and does NOT do
`SweBenchAdapter.to_task(instance: dict)` is a **pure schema inversion** — it takes one SWE-bench-shaped dict and maps it to a `FeatureDeletionTask`. It does NOT:
- Pull or build a Docker image
- Apply the gold patch in reverse (`git apply -R`)
- Run any tests
- Discover test node IDs
- Populate `deleted_symbols` (always empty)
- Escalate `granularity` beyond the static `"feature"`
The broken-repo Docker image is **assumed to exist pre-built** (the SWE-bench project publishes these images; SWE-rebench carries its own `docker_image` field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented `[~]` gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, `scrub_tree` is built, `LocalSubprocessSandbox` and `DockerSandbox` are built) but there is no code in the repo that actually clones a repo, runs `git apply -R <gold_patch>`, builds a Docker image, and pushes it to a registry.
### What FeatureDeletionEnv does during training (`datagen/env.py`)
- `reset(task)` — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes `task.repo`, `task.fail_to_pass`, `task.test_command` but NEVER `golden_diff` or `deleted_symbols`.
- `step(action)` — delegates to `sandbox.exec(action)`, returning observation text; grades on `submit` or turn limit.
- `_grade()` — runs `sandbox.run_tests(test_command, fail_to_pass + pass_to_pass)`, computes pass-fraction over `fail_to_pass`, gates to 0 if `pass_to_pass` guard is broken OR `HackMonitor.flag()` fires.
- `reward_fn(prompts, completions, *, task_id, **kwargs)` — TRL `RewardFunc` face; dispatches through `reset`/`step`; feeds fractional credit (not binary) to `DifficultyCurriculum.update`.
### Safeguards implemented
- `scrub_tree(workdir)` — physically removes `__pycache__`, `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, `*.pyc/.pyo/.class` before episode start. This is the PRIMARY control (added in Wave 2; was absent before).
- `SANDBOX_DENYLIST` — blocks `find`, `strings`, `unzip`, `jar`, `javap`, decompilers, `git`. First-token-only check; bypassable via `sh -c "..."`. Documented as defense-in-depth, NOT the wall.
- `HackMonitor.flag()` — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in `submit_patch`). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat `"__py"+"cache__"` obfuscation), flags the trajectory.
- `DockerSandbox``network_mode='none'`, `read_only=True`, `cap_drop=['ALL']`, `no-new-privileges`, `pids_limit=256`, `mem_limit=1g`, optional gVisor `runtime='runsc'`.
### What ingestion/claude_code.py can ingest today
`ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]`:
- Input: Claude Code session JSONL at `~/.claude/projects/<encoded>/<sessionId>.jsonl`
- Output: one `TraceState` per assistant TURN (`state_id`, `messages`, `student_action`)
- Skips: subagent files (`agent-` prefix), sidechain records (`isSidechain: True`), `summary` / `attachment` / `queue-operation` / `file-history-snapshot` records
- `student_action`: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if `strip_thinking=True`)
- `tool_error` flag: structurally set on `user` messages where any `tool_result` block has `is_error: true` — this is the SDPO error-site detection signal
- `state_id`: `f"{path.stem}::{state_idx:04d}"`
- Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL
---
## (2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)
From `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §5/§8/§10.
1. **Seed trace ingestion (Stage a):** `ClaudeCodeIngester.ingest()` over `s3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl` → Parquet at `traces/v1/run_id=<id>/part-*.parquet` via AWS Glue 5.0 Spark ETL job (`glue_ingest_job.py`, ~80 LOC, NOT YET BUILT).
2. **Schema inversion (Stage c1):** `SweBenchAdapter.to_task()` per SWE-bench row → `FeatureDeletionTask` JSONL at `tasks/v1/run_id=<id>/manifest.jsonl` (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (`is_redistributable()`) applied here.
3. **N-teacher replay (Stage b):** `teacher_replay.replay_trace()` generalized from flat OpenRouter to `BedrockBatchTeacherPool` — write one shared `replay/v1/run_id=<id>/input/states.jsonl`, submit one `CreateModelInvocationJob` per teacher, write `.jsonl.out` per teacher to `replay/v1/.../teacher=<slug>/`. An EMR Serverless aggregation step joins all N outputs by `state_id``list[TeacherCallResult]`. (`teacher_replay_bedrock.py`, ~180 LOC, NOT YET BUILT).
4. **Multi-model tree expansion (the core delta — NOT BUILT):** A `tree_controller.py` (~250–350 LOC, design-only) that, for each `TraceState` node, fires N models, applies each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branches again from the new state, grades leaves with `_grade()`. Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8).
5. **Sandbox materialization + 4-gate validation (Stage c2):** AWS Batch array jobs on EC2 Spot, one child per task. Each child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task in the S3 manifest, boots `DockerSandbox`/`LocalSubprocessSandbox`, runs `validator.validate_task()` (4 gates), writes `task_grades/v1/run_id=<id>/<task_id>.json`. (`datagen/aws/batch_validate.py`, ~120 LOC, NOT YET BUILT).
6. **DPO pair extraction + normalization (Stage d):** `extract_dpo_pairs()` (already built in `teacher_replay.py`) on the fan-in of teacher outputs → `DPOPair` rows → `DJNormalizer` data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → `corpus/v1/run_id=<id>/dpo/part-*.parquet` and `corpus/sft/part-*.parquet`. (`replaysim/emr_normalize_job.py`, ~100 LOC, NOT YET BUILT).
7. **Orchestration:** AWS Step Functions Standard Workflow: `Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda)`. (`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py`, ~250 LOC IaC, NOT YET BUILT).
8. **S3 typed dataset contract (full set):**
- `raw/claude_code/**/*.jsonl` — input seed traces
- `traces/v1/run_id=<id>/part-*.parquet` — TraceState rows (Stage a output)
- `tasks/v1/run_id=<id>/manifest.jsonl` — FeatureDeletionTask rows (Stage c1 output)
- `tasks/golden/run_id=<id>/` — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)
- `replay/v1/run_id=<id>/input/states.jsonl` — shared Bedrock batch input
- `replay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out` — per-teacher Bedrock batch output
- `task_grades/v1/run_id=<id>/<task_id>.json` — validator + _grade() results
- `corpus/v1/run_id=<id>/sft/part-*.parquet` — clean winning trajectories (SFT-first floor)
- `corpus/v1/run_id=<id>/dpo/part-*.parquet` — DPO pairs (normalized DPOPair)
- `dpo_pairs/` — divergence-derived DPO pairs from the tree (sibling winners vs losers)
- `rl_task_pool/` — FeatureDeletionTask registry + DifficultyCurriculum priors
- `divergence_pairs/` — divergence-annotated nodes (where sibling next-action distributions forked)
- `wm_tuples/` — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)
- `holdout/` — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)
- `diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt` — DiLoCo outer-sync (already used by existing allreduce.py)
- `manifests/run_id=<id>.json` — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
9. **SFT-first stage:** Read `sft_corpus/` (clean `_grade()` gate-1 passing trajectories), run `compose_loss` with `alpha_sdpo=0, beta_replay=0` (reduces to `_lm_response_ce` — next-token CE masked to response tokens), write `ckpt_sft/`. (`pipeline/sft_floor.py`, ~60 LOC, NOT YET BUILT).
10. **Inner RL loop:** `ComposerReplicationTrainer` (trl.GRPOTrainer subclass) on `rl_task_pool/` with `FeatureDeletionEnv.reward_fn`; `total = grpo + α·sdpo + β·trace_replay_dpo`; DiLoCo outer-sync via S3; `HeldOutGuard` kill-switch now wired (Wave 3).
11. **Flywheel:** Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.
---
## (3) Unbuilt Components the Vision Depends On
Every item below is design-only or a skeleton; none has real production code.
| Component | File Estimate | Source | Status |
|---|---|---|---|
| `datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes | ~250–350 LOC | design-F1, final_report §1/§5/§6 | **0% built** — no file exists |
| `SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice | ~60 LOC | design-F5 Tier 1 / final_report §1/§6 | **0% built** — not a class in hint_generator.py at all |
| `pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract | ~80 LOC | design-F1 §4 | **0% built** — no `pipeline/` directory exists |
| `pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/` | ~60 LOC | design-F1 §2 / design-F5 d | **0% built** |
| `teacher_replay_bedrock.py``BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]` | ~180 LOC | design-F2 §b | **0% built** |
| `datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/` | ~120 LOC | design-F2 §c2 | **0% built**`datagen/aws/` subdirectory does not exist |
| `datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet | ~80 LOC | design-F2 §a | **0% built** |
| `replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet | ~100 LOC | design-F2 §d | **0% built** |
| `datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection | ~120 LOC | design-F2 §contract | **0% built** |
| `infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) | ~250 LOC IaC | design-F2 §orchestration | **0% built**`infra/` directory does not exist |
| `trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `<deliberate>` token as second SDPO mode | ~40 LOC delta | design-F1 §4 / final_report §2 | **0% built** — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`<deliberate>` anywhere in `composer_replication/` |
| Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R <golden_diff>`, run `scrub_tree`, build and push a Docker image to ECR | unspecified | ADR-010 §decision / design-F2 §c2 | **0% built** — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch |
| `EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop | Wave-2 executor skeleton built; Argo controller design-only | design-F1 §AWS / final_report §8 | **skeleton built**`eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% |
| `verl AsyncServer` backend for tool-heavy tree | — | final_report §8 | **0% built** — design note only |
| Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) | — | design-F5 §Tier 4 | **0% built** |
---
## (4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code
The `SweBenchAdapter` is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:
### Break 1: `broken_image` assumes a pre-built SWE-bench image exists
`SweBenchAdapter.image_for()` returns either `instance["docker_image"]` (SWE-rebench) or the convention `swebench/sweb.eval.x86_64.{iid}:latest`. For an arbitrary OSS repo there is no such image. A fresh repo would need:
- Clone at `base_commit`
- Install the project's Python/Java/etc. toolchain
- Apply `git apply -R <golden_diff>` to manufacture the broken state
- Run `scrub_tree()` to strip caches
- Build a Docker image that encapsulates this broken state
- Push the image to a registry accessible by `DockerSandbox.boot()`
None of this code exists. `DockerSandbox.boot(image)` raises `RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.")` if the image is absent.
### Break 2: `test_command` is hardcoded
`SweBenchAdapter.default_test_command = "python -m pytest -q"`. A fresh repo may use `make test`, `npm test`, `cargo test`, `mvn verify`, or any other test runner. There is no test-discovery logic anywhere in the repo.
### Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels
SWE-bench instances ship with `FAIL_TO_PASS` and `PASS_TO_PASS` as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. `FeatureDeletionTask.__post_init__` raises `ValueError` if `fail_to_pass` is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.
### Break 4: `deleted_symbols` is never populated
`SweBenchAdapter` hardcodes `deleted_symbols=()`. The `HackMonitor._patch_provenance_hack()` check (`monitor.py:157-182`) skips the symbol-reappearance test if `deleted_symbols` is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.
### Break 5: No copyleft scrub for arbitrary repos
`is_redistributable()` reads `upstream_license` from `instance["license_name"]`. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.
### Break 6: No env setup for non-Python repos
`LocalSubprocessSandbox.run_tests` runs `subprocess.run(cmd, shell=True, ...)` against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. `DockerSandbox` depends on a pre-baked image that already has the environment. A fresh Python repo would need `pip install -e .` run inside the image, and a non-Python repo would need a completely different image and test runner.
---
## (5) What ingestion/claude_code.py Can Ingest Today
`ClaudeCodeIngester.ingest(path)` handles exactly one format: **Claude Code session JSONL** at `~/.claude/projects/<encoded>/<sessionId>.jsonl`.
Supported record types handled:
- `type="user"` — string content or list of text/tool_result blocks → OpenAI-style user message; `tool_error` structural flag set if any `tool_result` block has `is_error: true`
- `type="assistant"` — list of text/thinking/tool_use blocks → one `TraceState` with `student_action` (full blocks including thinking) and `messages` (history, optionally with thinking stripped)
Record types silently skipped:
- `type="summary"` — Claude Code conversation summary records
- `type="attachment"`, `"queue-operation"`, `"file-history-snapshot"`, `"last-prompt"`, `"system"` — auxiliary records
- `isSidechain: True` records — subagent traces (skipped in v0.1 per ADR-002)
- Files starting with `agent-` — subagent session files by naming convention
Structural features:
- `state_id = f"{path.stem}::{state_idx:04d}"` — stable within-session identifier
- `strip_thinking` flag (default True) — strips `[THINKING] ...` lines from the teacher-facing `messages` history but keeps them in `student_action`
- Injects synthetic system prompt at `messages[0]` (`"You are a senior software engineer..."`)
- Version check: warns on schema version outside `2.x.x`
NOT handled by this ingester:
- OpenHands trajectory format (planned for v0.2 per ADR-002)
- SWE-smith trajectories (planned for v0.2)
- Cline VS Code export
- Aider chat history
- SWE-bench leaderboard trajectory submissions
- Any binary or non-JSONL format
---
## Critical Cross-Checks: What the Repo Claims vs What Exists
### Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")
**What the blog says (COMPOSER_RECIPE_MAPPING.md):** "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests."
**What the repo does:** Inverts *existing* SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was *explicitly rejected*.
### Claim 2: "25× synthetic data"
**What the blog says:** Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2).
**What the repo has:** A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the *training run*; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.
### Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"
**What Composer 2.5 says:** "We both select for and create harder tasks dynamically throughout the run."
**What the repo has:** The SELECT-FOR half: `DifficultyCurriculum` with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. `granularity` is set statically to `"feature"` for all SweBenchAdapter tasks; no escalation logic exists.
### Claim 4: `deleted_symbols` enables AST-provenance monitoring
**What ADR-010 says:** "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads.
**Reality:** `deleted_symbols=()` on every `SweBenchAdapter`-derived task (line 81 in substrates.py: hardcoded empty tuple). `HackMonitor._patch_provenance_hack()` returns False immediately when `deleted_symbols` is empty (`reappeared = [s for s in deleted_symbols if s and s in patch]` → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.
### Claim 5: The tree controller and world-model head are part of the system
**What design docs say:** "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table).
**Reality:** The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no `SiblingBootstrap`, `world_model`, `WorldModel`, `next_state_head`, `tree_controller`, `MCTS`, `deliberate_token` anywhere in `composer_replication/`. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.
### Claim 6: The broken-repo image is manufactured by the pipeline
**What design-F2 says:** Step c2 involves "pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()`, run the test command, confirm FAIL_TO_PASS actually fails."
**Reality:** This describes what SHOULD happen in the Batch array child. No such code is written. `SweBenchAdapter.image_for()` returns a string tag; that tag must be pre-pulled on the host before `DockerSandbox.boot()` can use it (`RuntimeError` on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.
---
## Summary of Unbuilt vs Built
### BUILT and tested (production-ready CPU, Docker-gated where noted):
- `FeatureDeletionTask` schema + `FeatureDeletionEnv` (reset/step/_grade/reward_fn)
- `SweBenchAdapter` schema inversion (pure dict transform)
- `FakeSandbox`, `LocalSubprocessSandbox`, `DockerSandbox` (hardware-gated e2e green in Wave 1/2)
- `scrub_tree()` primary reward-hack control
- `HackMonitor` (signature + patch-provenance, obfuscation-resistant)
- `DifficultyCurriculum` (SELECT-FOR half + effort tilt)
- `validate_task()` 4-gate solvability validator
- `ClaudeCodeIngester` (Claude Code JSONL only)
- `behavior_rewards.py``c_length`, `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (Wave 20)
- `kl_in_reward.py` — k1-in-reward path opt-in (Wave 20)
- `HeldOutGuard` + `HeldoutSplit` + wired into trainer (Wave 2/3)
- `EKSExecutor` skeleton + `SageMakerExecutor` skeleton (Wave 2)
### DESIGN-ONLY (no code):
- Tree controller (`datagen/tree_controller.py`)
- `SiblingBootstrapGenerator` in `hint_generator.py`
- `pipeline/s3_layout.py`, `pipeline/sft_floor.py`
- `teacher_replay_bedrock.py` (BedrockBatchTeacherPool)
- `datagen/aws/batch_validate.py`, `datagen/aws/glue_ingest_job.py`, `datagen/aws/s3_contract.py`
- `replaysim/emr_normalize_job.py`
- `infra/datagen_stepfunctions.json`, `infra/datagen_stack.py`
- World-model next-state head in trainer
- Argo Workflows outer-loop controller
- Broken-repo image builder (clone → git apply -R → build → push)
- CREATE half of difficulty curriculum (mint harder tasks during run)
- SFT-first training stage
- Offline LLM-judge hack monitor