# Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned **Agent:** REPO-GROUNDING **Date:** 2026-06-09 **Scope:** composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py, hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5, research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md, docs/BACKLOG_RESOLUTION_2026-06-09.md --- ## (1) Exact Current Dataset-Generation Capability ### FeatureDeletionTask schema (`datagen/schema.py`) Six load-bearing fields and what produces each today: | Field | Type | Producer today | Notes | |---|---|---|---| | `task_id` | `str` | `SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]` | `"unknown"` if missing | | `repo` | `str` | `instance["repo"]` via `SweBenchAdapter.to_task()` | e.g. `"getmoto/moto"` | | `base_commit` | `str` | `instance["base_commit"]` | no code to `git checkout` this commit exists today | | `broken_image` | `str` | `SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest` | This tag is a **pre-built SWE-bench eval image**; no code in the repo pulls or builds these images | | `fail_to_pass` | `tuple[str,...]` | `_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list | validated non-empty in `__post_init__` | | `pass_to_pass` | `tuple[str,...]` | `_as_tuple(instance["PASS_TO_PASS"])` | may be empty | | `test_command` | `str` | `SweBenchAdapter.default_test_command` = `"python -m pytest -q"` | hardcoded; not read from instance | | `deleted_symbols` | `tuple[str,...]` | **never populated by SweBenchAdapter** — hardcoded `()` in every substrate inversion | the monitor can't do symbol-provenance checks without this | | `golden_diff` | `str` | `instance["patch"]` | held out of repr; used only by validator | | `granularity` | `str` | hardcoded `"feature"` in `SweBenchAdapter.to_task()` | CREATE-half escalation (function→file→feature) not wired to anything | | `difficulty_prior` | `float` | `instance["difficulty"]` if present (SWE-rebench) else `0.5` | | | `upstream_license` | `str` | `instance["license_name"]` | copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL | ### What SweBenchAdapter actually does and does NOT do `SweBenchAdapter.to_task(instance: dict)` is a **pure schema inversion** — it takes one SWE-bench-shaped dict and maps it to a `FeatureDeletionTask`. It does NOT: - Pull or build a Docker image - Apply the gold patch in reverse (`git apply -R`) - Run any tests - Discover test node IDs - Populate `deleted_symbols` (always empty) - Escalate `granularity` beyond the static `"feature"` The broken-repo Docker image is **assumed to exist pre-built** (the SWE-bench project publishes these images; SWE-rebench carries its own `docker_image` field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented `[~]` gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, `scrub_tree` is built, `LocalSubprocessSandbox` and `DockerSandbox` are built) but there is no code in the repo that actually clones a repo, runs `git apply -R `, builds a Docker image, and pushes it to a registry. ### What FeatureDeletionEnv does during training (`datagen/env.py`) - `reset(task)` — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes `task.repo`, `task.fail_to_pass`, `task.test_command` but NEVER `golden_diff` or `deleted_symbols`. - `step(action)` — delegates to `sandbox.exec(action)`, returning observation text; grades on `submit` or turn limit. - `_grade()` — runs `sandbox.run_tests(test_command, fail_to_pass + pass_to_pass)`, computes pass-fraction over `fail_to_pass`, gates to 0 if `pass_to_pass` guard is broken OR `HackMonitor.flag()` fires. - `reward_fn(prompts, completions, *, task_id, **kwargs)` — TRL `RewardFunc` face; dispatches through `reset`/`step`; feeds fractional credit (not binary) to `DifficultyCurriculum.update`. ### Safeguards implemented - `scrub_tree(workdir)` — physically removes `__pycache__`, `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, `*.pyc/.pyo/.class` before episode start. This is the PRIMARY control (added in Wave 2; was absent before). - `SANDBOX_DENYLIST` — blocks `find`, `strings`, `unzip`, `jar`, `javap`, decompilers, `git`. First-token-only check; bypassable via `sh -c "..."`. Documented as defense-in-depth, NOT the wall. - `HackMonitor.flag()` — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in `submit_patch`). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat `"__py"+"cache__"` obfuscation), flags the trajectory. - `DockerSandbox` — `network_mode='none'`, `read_only=True`, `cap_drop=['ALL']`, `no-new-privileges`, `pids_limit=256`, `mem_limit=1g`, optional gVisor `runtime='runsc'`. ### What ingestion/claude_code.py can ingest today `ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]`: - Input: Claude Code session JSONL at `~/.claude/projects//.jsonl` - Output: one `TraceState` per assistant TURN (`state_id`, `messages`, `student_action`) - Skips: subagent files (`agent-` prefix), sidechain records (`isSidechain: True`), `summary` / `attachment` / `queue-operation` / `file-history-snapshot` records - `student_action`: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if `strip_thinking=True`) - `tool_error` flag: structurally set on `user` messages where any `tool_result` block has `is_error: true` — this is the SDPO error-site detection signal - `state_id`: `f"{path.stem}::{state_idx:04d}"` - Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL --- ## (2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop) From `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §5/§8/§10. 1. **Seed trace ingestion (Stage a):** `ClaudeCodeIngester.ingest()` over `s3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl` → Parquet at `traces/v1/run_id=/part-*.parquet` via AWS Glue 5.0 Spark ETL job (`glue_ingest_job.py`, ~80 LOC, NOT YET BUILT). 2. **Schema inversion (Stage c1):** `SweBenchAdapter.to_task()` per SWE-bench row → `FeatureDeletionTask` JSONL at `tasks/v1/run_id=/manifest.jsonl` (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (`is_redistributable()`) applied here. 3. **N-teacher replay (Stage b):** `teacher_replay.replay_trace()` generalized from flat OpenRouter to `BedrockBatchTeacherPool` — write one shared `replay/v1/run_id=/input/states.jsonl`, submit one `CreateModelInvocationJob` per teacher, write `.jsonl.out` per teacher to `replay/v1/.../teacher=/`. An EMR Serverless aggregation step joins all N outputs by `state_id` → `list[TeacherCallResult]`. (`teacher_replay_bedrock.py`, ~180 LOC, NOT YET BUILT). 4. **Multi-model tree expansion (the core delta — NOT BUILT):** A `tree_controller.py` (~250–350 LOC, design-only) that, for each `TraceState` node, fires N models, applies each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branches again from the new state, grades leaves with `_grade()`. Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8). 5. **Sandbox materialization + 4-gate validation (Stage c2):** AWS Batch array jobs on EC2 Spot, one child per task. Each child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task in the S3 manifest, boots `DockerSandbox`/`LocalSubprocessSandbox`, runs `validator.validate_task()` (4 gates), writes `task_grades/v1/run_id=/.json`. (`datagen/aws/batch_validate.py`, ~120 LOC, NOT YET BUILT). 6. **DPO pair extraction + normalization (Stage d):** `extract_dpo_pairs()` (already built in `teacher_replay.py`) on the fan-in of teacher outputs → `DPOPair` rows → `DJNormalizer` data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → `corpus/v1/run_id=/dpo/part-*.parquet` and `corpus/sft/part-*.parquet`. (`replaysim/emr_normalize_job.py`, ~100 LOC, NOT YET BUILT). 7. **Orchestration:** AWS Step Functions Standard Workflow: `Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda)`. (`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py`, ~250 LOC IaC, NOT YET BUILT). 8. **S3 typed dataset contract (full set):** - `raw/claude_code/**/*.jsonl` — input seed traces - `traces/v1/run_id=/part-*.parquet` — TraceState rows (Stage a output) - `tasks/v1/run_id=/manifest.jsonl` — FeatureDeletionTask rows (Stage c1 output) - `tasks/golden/run_id=/` — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/) - `replay/v1/run_id=/input/states.jsonl` — shared Bedrock batch input - `replay/v1/run_id=/teacher=/*.jsonl.out` — per-teacher Bedrock batch output - `task_grades/v1/run_id=/.json` — validator + _grade() results - `corpus/v1/run_id=/sft/part-*.parquet` — clean winning trajectories (SFT-first floor) - `corpus/v1/run_id=/dpo/part-*.parquet` — DPO pairs (normalized DPOPair) - `dpo_pairs/` — divergence-derived DPO pairs from the tree (sibling winners vs losers) - `rl_task_pool/` — FeatureDeletionTask registry + DifficultyCurriculum priors - `divergence_pairs/` — divergence-annotated nodes (where sibling next-action distributions forked) - `wm_tuples/` — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target) - `holdout/` — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back) - `diloco/rendezvous/round_/rank_.pt` — DiLoCo outer-sync (already used by existing allreduce.py) - `manifests/run_id=.json` — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel) 9. **SFT-first stage:** Read `sft_corpus/` (clean `_grade()` gate-1 passing trajectories), run `compose_loss` with `alpha_sdpo=0, beta_replay=0` (reduces to `_lm_response_ce` — next-token CE masked to response tokens), write `ckpt_sft/`. (`pipeline/sft_floor.py`, ~60 LOC, NOT YET BUILT). 10. **Inner RL loop:** `ComposerReplicationTrainer` (trl.GRPOTrainer subclass) on `rl_task_pool/` with `FeatureDeletionEnv.reward_fn`; `total = grpo + α·sdpo + β·trace_replay_dpo`; DiLoCo outer-sync via S3; `HeldOutGuard` kill-switch now wired (Wave 3). 11. **Flywheel:** Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate. --- ## (3) Unbuilt Components the Vision Depends On Every item below is design-only or a skeleton; none has real production code. | Component | File Estimate | Source | Status | |---|---|---|---| | `datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes | ~250–350 LOC | design-F1, final_report §1/§5/§6 | **0% built** — no file exists | | `SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice | ~60 LOC | design-F5 Tier 1 / final_report §1/§6 | **0% built** — not a class in hint_generator.py at all | | `pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract | ~80 LOC | design-F1 §4 | **0% built** — no `pipeline/` directory exists | | `pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/` | ~60 LOC | design-F1 §2 / design-F5 d | **0% built** | | `teacher_replay_bedrock.py` — `BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]` | ~180 LOC | design-F2 §b | **0% built** | | `datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/` | ~120 LOC | design-F2 §c2 | **0% built** — `datagen/aws/` subdirectory does not exist | | `datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet | ~80 LOC | design-F2 §a | **0% built** | | `replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet | ~100 LOC | design-F2 §d | **0% built** | | `datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection | ~120 LOC | design-F2 §contract | **0% built** | | `infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) | ~250 LOC IaC | design-F2 §orchestration | **0% built** — `infra/` directory does not exist | | `trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `` token as second SDPO mode | ~40 LOC delta | design-F1 §4 / final_report §2 | **0% built** — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`` anywhere in `composer_replication/` | | Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R `, run `scrub_tree`, build and push a Docker image to ECR | unspecified | ADR-010 §decision / design-F2 §c2 | **0% built** — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch | | `EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop | Wave-2 executor skeleton built; Argo controller design-only | design-F1 §AWS / final_report §8 | **skeleton built** — `eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% | | `verl AsyncServer` backend for tool-heavy tree | — | final_report §8 | **0% built** — design note only | | Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) | — | design-F5 §Tier 4 | **0% built** | --- ## (4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code The `SweBenchAdapter` is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural: ### Break 1: `broken_image` assumes a pre-built SWE-bench image exists `SweBenchAdapter.image_for()` returns either `instance["docker_image"]` (SWE-rebench) or the convention `swebench/sweb.eval.x86_64.{iid}:latest`. For an arbitrary OSS repo there is no such image. A fresh repo would need: - Clone at `base_commit` - Install the project's Python/Java/etc. toolchain - Apply `git apply -R ` to manufacture the broken state - Run `scrub_tree()` to strip caches - Build a Docker image that encapsulates this broken state - Push the image to a registry accessible by `DockerSandbox.boot()` None of this code exists. `DockerSandbox.boot(image)` raises `RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.")` if the image is absent. ### Break 2: `test_command` is hardcoded `SweBenchAdapter.default_test_command = "python -m pytest -q"`. A fresh repo may use `make test`, `npm test`, `cargo test`, `mvn verify`, or any other test runner. There is no test-discovery logic anywhere in the repo. ### Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels SWE-bench instances ship with `FAIL_TO_PASS` and `PASS_TO_PASS` as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. `FeatureDeletionTask.__post_init__` raises `ValueError` if `fail_to_pass` is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs. ### Break 4: `deleted_symbols` is never populated `SweBenchAdapter` hardcodes `deleted_symbols=()`. The `HackMonitor._patch_provenance_hack()` check (`monitor.py:157-182`) skips the symbol-reappearance test if `deleted_symbols` is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required. ### Break 5: No copyleft scrub for arbitrary repos `is_redistributable()` reads `upstream_license` from `instance["license_name"]`. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied. ### Break 6: No env setup for non-Python repos `LocalSubprocessSandbox.run_tests` runs `subprocess.run(cmd, shell=True, ...)` against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. `DockerSandbox` depends on a pre-baked image that already has the environment. A fresh Python repo would need `pip install -e .` run inside the image, and a non-Python repo would need a completely different image and test runner. --- ## (5) What ingestion/claude_code.py Can Ingest Today `ClaudeCodeIngester.ingest(path)` handles exactly one format: **Claude Code session JSONL** at `~/.claude/projects//.jsonl`. Supported record types handled: - `type="user"` — string content or list of text/tool_result blocks → OpenAI-style user message; `tool_error` structural flag set if any `tool_result` block has `is_error: true` - `type="assistant"` — list of text/thinking/tool_use blocks → one `TraceState` with `student_action` (full blocks including thinking) and `messages` (history, optionally with thinking stripped) Record types silently skipped: - `type="summary"` — Claude Code conversation summary records - `type="attachment"`, `"queue-operation"`, `"file-history-snapshot"`, `"last-prompt"`, `"system"` — auxiliary records - `isSidechain: True` records — subagent traces (skipped in v0.1 per ADR-002) - Files starting with `agent-` — subagent session files by naming convention Structural features: - `state_id = f"{path.stem}::{state_idx:04d}"` — stable within-session identifier - `strip_thinking` flag (default True) — strips `[THINKING] ...` lines from the teacher-facing `messages` history but keeps them in `student_action` - Injects synthetic system prompt at `messages[0]` (`"You are a senior software engineer..."`) - Version check: warns on schema version outside `2.x.x` NOT handled by this ingester: - OpenHands trajectory format (planned for v0.2 per ADR-002) - SWE-smith trajectories (planned for v0.2) - Cline VS Code export - Aider chat history - SWE-bench leaderboard trajectory submissions - Any binary or non-JSONL format --- ## Critical Cross-Checks: What the Repo Claims vs What Exists ### Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo") **What the blog says (COMPOSER_RECIPE_MAPPING.md):** "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests." **What the repo does:** Inverts *existing* SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was *explicitly rejected*. ### Claim 2: "25× synthetic data" **What the blog says:** Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2). **What the repo has:** A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the *training run*; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting. ### Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks" **What Composer 2.5 says:** "We both select for and create harder tasks dynamically throughout the run." **What the repo has:** The SELECT-FOR half: `DifficultyCurriculum` with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. `granularity` is set statically to `"feature"` for all SweBenchAdapter tasks; no escalation logic exists. ### Claim 4: `deleted_symbols` enables AST-provenance monitoring **What ADR-010 says:** "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads. **Reality:** `deleted_symbols=()` on every `SweBenchAdapter`-derived task (line 81 in substrates.py: hardcoded empty tuple). `HackMonitor._patch_provenance_hack()` returns False immediately when `deleted_symbols` is empty (`reappeared = [s for s in deleted_symbols if s and s in patch]` → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks. ### Claim 5: The tree controller and world-model head are part of the system **What design docs say:** "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table). **Reality:** The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no `SiblingBootstrap`, `world_model`, `WorldModel`, `next_state_head`, `tree_controller`, `MCTS`, `deliberate_token` anywhere in `composer_replication/`. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design. ### Claim 6: The broken-repo image is manufactured by the pipeline **What design-F2 says:** Step c2 involves "pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()`, run the test command, confirm FAIL_TO_PASS actually fails." **Reality:** This describes what SHOULD happen in the Batch array child. No such code is written. `SweBenchAdapter.image_for()` returns a string tag; that tag must be pre-pulled on the host before `DockerSandbox.boot()` can use it (`RuntimeError` on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap. --- ## Summary of Unbuilt vs Built ### BUILT and tested (production-ready CPU, Docker-gated where noted): - `FeatureDeletionTask` schema + `FeatureDeletionEnv` (reset/step/_grade/reward_fn) - `SweBenchAdapter` schema inversion (pure dict transform) - `FakeSandbox`, `LocalSubprocessSandbox`, `DockerSandbox` (hardware-gated e2e green in Wave 1/2) - `scrub_tree()` primary reward-hack control - `HackMonitor` (signature + patch-provenance, obfuscation-resistant) - `DifficultyCurriculum` (SELECT-FOR half + effort tilt) - `validate_task()` 4-gate solvability validator - `ClaudeCodeIngester` (Claude Code JSONL only) - `behavior_rewards.py` — `c_length`, `EffortWeights`, `LengthEffortPenalty`, `UnfinishedTodoPenalty`, `LeftoverCoTPenalty`, `CommunicationReward` (Wave 20) - `kl_in_reward.py` — k1-in-reward path opt-in (Wave 20) - `HeldOutGuard` + `HeldoutSplit` + wired into trainer (Wave 2/3) - `EKSExecutor` skeleton + `SageMakerExecutor` skeleton (Wave 2) ### DESIGN-ONLY (no code): - Tree controller (`datagen/tree_controller.py`) - `SiblingBootstrapGenerator` in `hint_generator.py` - `pipeline/s3_layout.py`, `pipeline/sft_floor.py` - `teacher_replay_bedrock.py` (BedrockBatchTeacherPool) - `datagen/aws/batch_validate.py`, `datagen/aws/glue_ingest_job.py`, `datagen/aws/s3_contract.py` - `replaysim/emr_normalize_job.py` - `infra/datagen_stepfunctions.json`, `infra/datagen_stack.py` - World-model next-state head in trainer - Argo Workflows outer-loop controller - Broken-repo image builder (clone → git apply -R → build → push) - CREATE half of difficulty curriculum (mint harder tasks during run) - SFT-first training stage - Offline LLM-judge hack monitor