Baladithya Balamurugan

Wave 21: deep-read critical review — 8 source clusters re-read, findings verified

2a16b30 21 days ago

25.3 kB

Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned

Agent: REPO-GROUNDING
Date: 2026-06-09
Scope: composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py, hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5, research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md, docs/BACKLOG_RESOLUTION_2026-06-09.md

(1) Exact Current Dataset-Generation Capability

FeatureDeletionTask schema (`datagen/schema.py`)

Six load-bearing fields and what produces each today:

Field	Type	Producer today	Notes
`task_id`	`str`	`SweBenchAdapter.to_task()` — copied from `instance["instance_id"]` or `instance["task_id"]`	`"unknown"` if missing
`repo`	`str`	`instance["repo"]` via `SweBenchAdapter.to_task()`	e.g. `"getmoto/moto"`
`base_commit`	`str`	`instance["base_commit"]`	no code to `git checkout` this commit exists today
`broken_image`	`str`	`SweBenchAdapter.image_for(instance)` — either `instance["docker_image"]` (SWE-rebench) or the conventional `swebench/sweb.eval.x86_64.{iid}:latest`	This tag is a pre-built SWE-bench eval image; no code in the repo pulls or builds these images
`fail_to_pass`	`tuple[str,...]`	`_as_tuple(instance["FAIL_TO_PASS"])` — handles JSON-encoded string OR list	validated non-empty in `__post_init__`
`pass_to_pass`	`tuple[str,...]`	`_as_tuple(instance["PASS_TO_PASS"])`	may be empty
`test_command`	`str`	`SweBenchAdapter.default_test_command` = `"python -m pytest -q"`	hardcoded; not read from instance
`deleted_symbols`	`tuple[str,...]`	never populated by SweBenchAdapter — hardcoded `()` in every substrate inversion	the monitor can't do symbol-provenance checks without this
`golden_diff`	`str`	`instance["patch"]`	held out of repr; used only by validator
`granularity`	`str`	hardcoded `"feature"` in `SweBenchAdapter.to_task()`	CREATE-half escalation (function→file→feature) not wired to anything
`difficulty_prior`	`float`	`instance["difficulty"]` if present (SWE-rebench) else `0.5`
`upstream_license`	`str`	`instance["license_name"]`	copyleft filter in `is_redistributable()` strips GPL/AGPL/LGPL

What SweBenchAdapter actually does and does NOT do

SweBenchAdapter.to_task(instance: dict) is a pure schema inversion — it takes one SWE-bench-shaped dict and maps it to a FeatureDeletionTask. It does NOT:

Pull or build a Docker image
Apply the gold patch in reverse (git apply -R)
Run any tests
Discover test node IDs
Populate deleted_symbols (always empty)
Escalate granularity beyond the static "feature"

The broken-repo Docker image is assumed to exist pre-built (the SWE-bench project publishes these images; SWE-rebench carries its own docker_image field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented [~] gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, scrub_tree is built, LocalSubprocessSandbox and DockerSandbox are built) but there is no code in the repo that actually clones a repo, runs git apply -R <gold_patch>, builds a Docker image, and pushes it to a registry.

What FeatureDeletionEnv does during training (`datagen/env.py`)

reset(task) — boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposes task.repo, task.fail_to_pass, task.test_command but NEVER golden_diff or deleted_symbols.
step(action) — delegates to sandbox.exec(action), returning observation text; grades on submit or turn limit.
_grade() — runs sandbox.run_tests(test_command, fail_to_pass + pass_to_pass), computes pass-fraction over fail_to_pass, gates to 0 if pass_to_pass guard is broken OR HackMonitor.flag() fires.
reward_fn(prompts, completions, *, task_id, **kwargs) — TRL RewardFunc face; dispatches through reset/step; feeds fractional credit (not binary) to DifficultyCurriculum.update.

Safeguards implemented

scrub_tree(workdir) — physically removes __pycache__, .mypy_cache, .pytest_cache, .git, .hg, *.pyc/.pyo/.class before episode start. This is the PRIMARY control (added in Wave 2; was absent before).
SANDBOX_DENYLIST — blocks find, strings, unzip, jar, javap, decompilers, git. First-token-only check; bypassable via sh -c "...". Documented as defense-in-depth, NOT the wall.
HackMonitor.flag() — layer 1: substring scan for cache/decompiler signatures in trajectory actions (not in submit_patch). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat "__py"+"cache__" obfuscation), flags the trajectory.
DockerSandbox — network_mode='none', read_only=True, cap_drop=['ALL'], no-new-privileges, pids_limit=256, mem_limit=1g, optional gVisor runtime='runsc'.

What ingestion/claude_code.py can ingest today

ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]:

Input: Claude Code session JSONL at ~/.claude/projects/<encoded>/<sessionId>.jsonl
Output: one TraceState per assistant TURN (state_id, messages, student_action)
Skips: subagent files (agent- prefix), sidechain records (isSidechain: True), summary / attachment / queue-operation / file-history-snapshot records
student_action: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages if strip_thinking=True)
tool_error flag: structurally set on user messages where any tool_result block has is_error: true — this is the SDPO error-site detection signal
state_id: f"{path.stem}::{state_idx:04d}"
Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL

(2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)

From research/design-F1-systems-framing.md, research/design-F2-aws-datagen.md, and research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md §5/§8/§10.

Seed trace ingestion (Stage a): ClaudeCodeIngester.ingest() over s3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl → Parquet at traces/v1/run_id=<id>/part-*.parquet via AWS Glue 5.0 Spark ETL job (glue_ingest_job.py, ~80 LOC, NOT YET BUILT).
Schema inversion (Stage c1): SweBenchAdapter.to_task() per SWE-bench row → FeatureDeletionTask JSONL at tasks/v1/run_id=<id>/manifest.jsonl (one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (is_redistributable()) applied here.
N-teacher replay (Stage b): teacher_replay.replay_trace() generalized from flat OpenRouter to BedrockBatchTeacherPool — write one shared replay/v1/run_id=<id>/input/states.jsonl, submit one CreateModelInvocationJob per teacher, write .jsonl.out per teacher to replay/v1/.../teacher=<slug>/. An EMR Serverless aggregation step joins all N outputs by state_id → list[TeacherCallResult]. (teacher_replay_bedrock.py, ~180 LOC, NOT YET BUILT).
Multi-model tree expansion (the core delta — NOT BUILT): A tree_controller.py (~250–350 LOC, design-only) that, for each TraceState node, fires N models, applies each candidate action through FeatureDeletionEnv.step() to get a real next observation, branches again from the new state, grades leaves with _grade(). Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8).
Sandbox materialization + 4-gate validation (Stage c2): AWS Batch array jobs on EC2 Spot, one child per task. Each child reads AWS_BATCH_JOB_ARRAY_INDEX, looks up its task in the S3 manifest, boots DockerSandbox/LocalSubprocessSandbox, runs validator.validate_task() (4 gates), writes task_grades/v1/run_id=<id>/<task_id>.json. (datagen/aws/batch_validate.py, ~120 LOC, NOT YET BUILT).
DPO pair extraction + normalization (Stage d): extract_dpo_pairs() (already built in teacher_replay.py) on the fan-in of teacher outputs → DPOPair rows → DJNormalizer data-juicer op-graph → EMR Serverless Spark for cross-partition dedup → corpus/v1/run_id=<id>/dpo/part-*.parquet and corpus/sft/part-*.parquet. (replaysim/emr_normalize_job.py, ~100 LOC, NOT YET BUILT).
Orchestration: AWS Step Functions Standard Workflow: Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda). (infra/datagen_stepfunctions.json + infra/datagen_stack.py, ~250 LOC IaC, NOT YET BUILT).
S3 typed dataset contract (full set):
- raw/claude_code/**/*.jsonl — input seed traces
- traces/v1/run_id=<id>/part-*.parquet — TraceState rows (Stage a output)
- tasks/v1/run_id=<id>/manifest.jsonl — FeatureDeletionTask rows (Stage c1 output)
- tasks/golden/run_id=<id>/ — golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)
- replay/v1/run_id=<id>/input/states.jsonl — shared Bedrock batch input
- replay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out — per-teacher Bedrock batch output
- task_grades/v1/run_id=<id>/<task_id>.json — validator + _grade() results
- corpus/v1/run_id=<id>/sft/part-*.parquet — clean winning trajectories (SFT-first floor)
- corpus/v1/run_id=<id>/dpo/part-*.parquet — DPO pairs (normalized DPOPair)
- dpo_pairs/ — divergence-derived DPO pairs from the tree (sibling winners vs losers)
- rl_task_pool/ — FeatureDeletionTask registry + DifficultyCurriculum priors
- divergence_pairs/ — divergence-annotated nodes (where sibling next-action distributions forked)
- wm_tuples/ — (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)
- holdout/ — disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)
- diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt — DiLoCo outer-sync (already used by existing allreduce.py)
- manifests/run_id=<id>.json — run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
SFT-first stage: Read sft_corpus/ (clean _grade() gate-1 passing trajectories), run compose_loss with alpha_sdpo=0, beta_replay=0 (reduces to _lm_response_ce — next-token CE masked to response tokens), write ckpt_sft/. (pipeline/sft_floor.py, ~60 LOC, NOT YET BUILT).
Inner RL loop: ComposerReplicationTrainer (trl.GRPOTrainer subclass) on rl_task_pool/ with FeatureDeletionEnv.reward_fn; total = grpo + α·sdpo + β·trace_replay_dpo; DiLoCo outer-sync via S3; HeldOutGuard kill-switch now wired (Wave 3).
Flywheel: Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.

(3) Unbuilt Components the Vision Depends On

Every item below is design-only or a skeleton; none has real production code.

Component	File Estimate	Source	Status
`datagen/tree_controller.py` — the core delta: env-step between branches, `_grade()` at leaves, divergence-gated expansion, six typed S3 prefix writes	~250–350 LOC	design-F1, final_report §1/§5/§6	0% built — no file exists
`SiblingBootstrapGenerator` in `hint_generator.py` — select max-reward sibling → emit "a working approach looks like: …" → feed `ctx_teacher` splice	~60 LOC	design-F5 Tier 1 / final_report §1/§6	0% built — not a class in hint_generator.py at all
`pipeline/s3_layout.py` — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract	~80 LOC	design-F1 §4	0% built — no `pipeline/` directory exists
`pipeline/sft_floor.py` — SFT-first driver: read `sft_corpus/`, run TRL SFTTrainer or `compose_loss` `alpha=beta=0`, write `ckpt_sft/`	~60 LOC	design-F1 §2 / design-F5 d	0% built
`teacher_replay_bedrock.py` — `BedrockBatchTeacherPool`: submit one Bedrock `CreateModelInvocationJob` per teacher, poll, parse `.jsonl.out` back into `list[TeacherCallResult]`	~180 LOC	design-F2 §b	0% built
`datagen/aws/batch_validate.py` — AWS Batch array-child entrypoint: read `BATCH_JOB_ARRAY_INDEX` → manifest line → `DockerSandbox` + `validator` + `_grade()` → write `task_grades/`	~120 LOC	design-F2 §c2	0% built — `datagen/aws/` subdirectory does not exist
`datagen/aws/glue_ingest_job.py` — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; write `traces/` Parquet	~80 LOC	design-F2 §a	0% built
`replaysim/emr_normalize_job.py` — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; write `corpus/dpo/` + `corpus/sft/` Parquet	~100 LOC	design-F2 §d	0% built
`datagen/aws/s3_contract.py` — S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers, `recordId==state_id` join helpers, `schema_version`/`split` column injection	~120 LOC	design-F2 §contract	0% built
`infra/datagen_stepfunctions.json` + `infra/datagen_stack.py` — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue)	~250 LOC IaC	design-F2 §orchestration	0% built — `infra/` directory does not exist
`trainer/composer_trainer.py` world-model head — parameter-isolated next-state adapter + `<deliberate>` token as second SDPO mode	~40 LOC delta	design-F1 §4 / final_report §2	0% built — grep confirms no `world_model`/`WorldModel`/`next_state_head`/`<deliberate>` anywhere in `composer_replication/`
Broken-repo image builder — code to clone a repo at `base_commit`, apply `git apply -R <golden_diff>`, run `scrub_tree`, build and push a Docker image to ECR	unspecified	ADR-010 §decision / design-F2 §c2	0% built — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch
`EKSExecutor` (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop	Wave-2 executor skeleton built; Argo controller design-only	design-F1 §AWS / final_report §8	skeleton built — `eks.py` is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0%
`verl AsyncServer` backend for tool-heavy tree	—	final_report §8	0% built — design note only
Offline LLM-judge hack monitor (EvilGenie-style, Bedrock)	—	design-F5 §Tier 4	0% built

(4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code

The SweBenchAdapter is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:

Break 1: `broken_image` assumes a pre-built SWE-bench image exists

SweBenchAdapter.image_for() returns either instance["docker_image"] (SWE-rebench) or the convention swebench/sweb.eval.x86_64.{iid}:latest. For an arbitrary OSS repo there is no such image. A fresh repo would need:

Clone at base_commit
Install the project's Python/Java/etc. toolchain
Apply git apply -R <golden_diff> to manufacture the broken state
Run scrub_tree() to strip caches
Build a Docker image that encapsulates this broken state
Push the image to a registry accessible by DockerSandbox.boot()

None of this code exists. DockerSandbox.boot(image) raises RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.") if the image is absent.

Break 2: `test_command` is hardcoded

SweBenchAdapter.default_test_command = "python -m pytest -q". A fresh repo may use make test, npm test, cargo test, mvn verify, or any other test runner. There is no test-discovery logic anywhere in the repo.

Break 3: `fail_to_pass` and `pass_to_pass` require pre-existing test labels

SWE-bench instances ship with FAIL_TO_PASS and PASS_TO_PASS as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. FeatureDeletionTask.__post_init__ raises ValueError if fail_to_pass is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.

Break 4: `deleted_symbols` is never populated

SweBenchAdapter hardcodes deleted_symbols=(). The HackMonitor._patch_provenance_hack() check (monitor.py:157-182) skips the symbol-reappearance test if deleted_symbols is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.

Break 5: No copyleft scrub for arbitrary repos

is_redistributable() reads upstream_license from instance["license_name"]. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.

Break 6: No env setup for non-Python repos

LocalSubprocessSandbox.run_tests runs subprocess.run(cmd, shell=True, ...) against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. DockerSandbox depends on a pre-baked image that already has the environment. A fresh Python repo would need pip install -e . run inside the image, and a non-Python repo would need a completely different image and test runner.

(5) What ingestion/claude_code.py Can Ingest Today

ClaudeCodeIngester.ingest(path) handles exactly one format: Claude Code session JSONL at ~/.claude/projects/<encoded>/<sessionId>.jsonl.

Supported record types handled:

type="user" — string content or list of text/tool_result blocks → OpenAI-style user message; tool_error structural flag set if any tool_result block has is_error: true
type="assistant" — list of text/thinking/tool_use blocks → one TraceState with student_action (full blocks including thinking) and messages (history, optionally with thinking stripped)

Record types silently skipped:

type="summary" — Claude Code conversation summary records
type="attachment", "queue-operation", "file-history-snapshot", "last-prompt", "system" — auxiliary records
isSidechain: True records — subagent traces (skipped in v0.1 per ADR-002)
Files starting with agent- — subagent session files by naming convention

Structural features:

state_id = f"{path.stem}::{state_idx:04d}" — stable within-session identifier
strip_thinking flag (default True) — strips [THINKING] ... lines from the teacher-facing messages history but keeps them in student_action
Injects synthetic system prompt at messages[0] ("You are a senior software engineer...")
Version check: warns on schema version outside 2.x.x

NOT handled by this ingester:

OpenHands trajectory format (planned for v0.2 per ADR-002)
SWE-smith trajectories (planned for v0.2)
Cline VS Code export
Aider chat history
SWE-bench leaderboard trajectory submissions
Any binary or non-JSONL format

Critical Cross-Checks: What the Repo Claims vs What Exists

Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")

What the blog says (COMPOSER_RECIPE_MAPPING.md): "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests." What the repo does: Inverts existing SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was explicitly rejected.

Claim 2: "25× synthetic data"

What the blog says: Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2). What the repo has: A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the training run; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.

Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"

What Composer 2.5 says: "We both select for and create harder tasks dynamically throughout the run." What the repo has: The SELECT-FOR half: DifficultyCurriculum with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. granularity is set statically to "feature" for all SweBenchAdapter tasks; no escalation logic exists.

Claim 4: `deleted_symbols` enables AST-provenance monitoring

What ADR-010 says: "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads. Reality: deleted_symbols=() on every SweBenchAdapter-derived task (line 81 in substrates.py: hardcoded empty tuple). HackMonitor._patch_provenance_hack() returns False immediately when deleted_symbols is empty (reappeared = [s for s in deleted_symbols if s and s in patch] → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.

Claim 5: The tree controller and world-model head are part of the system

What design docs say: "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table). Reality: The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no SiblingBootstrap, world_model, WorldModel, next_state_head, tree_controller, MCTS, deliberate_token anywhere in composer_replication/. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.

Claim 6: The broken-repo image is manufactured by the pipeline

What design-F2 says: Step c2 involves "pull the substrate's frozen image, git apply -R the gold patch, scrub_tree(), run the test command, confirm FAIL_TO_PASS actually fails." Reality: This describes what SHOULD happen in the Batch array child. No such code is written. SweBenchAdapter.image_for() returns a string tag; that tag must be pre-pulled on the host before DockerSandbox.boot() can use it (RuntimeError on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.

Summary of Unbuilt vs Built

BUILT and tested (production-ready CPU, Docker-gated where noted):

FeatureDeletionTask schema + FeatureDeletionEnv (reset/step/_grade/reward_fn)
SweBenchAdapter schema inversion (pure dict transform)
FakeSandbox, LocalSubprocessSandbox, DockerSandbox (hardware-gated e2e green in Wave 1/2)
scrub_tree() primary reward-hack control
HackMonitor (signature + patch-provenance, obfuscation-resistant)
DifficultyCurriculum (SELECT-FOR half + effort tilt)
validate_task() 4-gate solvability validator
ClaudeCodeIngester (Claude Code JSONL only)
behavior_rewards.py — c_length, EffortWeights, LengthEffortPenalty, UnfinishedTodoPenalty, LeftoverCoTPenalty, CommunicationReward (Wave 20)
kl_in_reward.py — k1-in-reward path opt-in (Wave 20)
HeldOutGuard + HeldoutSplit + wired into trainer (Wave 2/3)
EKSExecutor skeleton + SageMakerExecutor skeleton (Wave 2)

DESIGN-ONLY (no code):

Tree controller (datagen/tree_controller.py)
SiblingBootstrapGenerator in hint_generator.py
pipeline/s3_layout.py, pipeline/sft_floor.py
teacher_replay_bedrock.py (BedrockBatchTeacherPool)
datagen/aws/batch_validate.py, datagen/aws/glue_ingest_job.py, datagen/aws/s3_contract.py
replaysim/emr_normalize_job.py
infra/datagen_stepfunctions.json, infra/datagen_stack.py
World-model next-state head in trainer
Argo Workflows outer-loop controller
Broken-repo image builder (clone → git apply -R → build → push)
CREATE half of difficulty curriculum (mint harder tasks during run)
SFT-first training stage
Offline LLM-judge hack monitor