Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Grounding Map: Dataset-Generation Pipeline — What Exists vs What Is Claimed vs What Is Envisioned
Agent: REPO-GROUNDING
Date: 2026-06-09
Scope: composer_replication/datagen/*.py, teacher_replay.py, trainer/composer_trainer.py, loss.py,
hint_generator.py, docs/adrs/ADR-010 + ADR-002 + ADR-013, research/design-F1..F5,
research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md, docs/COMPOSER_RECIPE_MAPPING.md,
docs/BACKLOG_RESOLUTION_2026-06-09.md
(1) Exact Current Dataset-Generation Capability
FeatureDeletionTask schema (datagen/schema.py)
Six load-bearing fields and what produces each today:
| Field | Type | Producer today | Notes |
|---|---|---|---|
task_id |
str |
SweBenchAdapter.to_task() — copied from instance["instance_id"] or instance["task_id"] |
"unknown" if missing |
repo |
str |
instance["repo"] via SweBenchAdapter.to_task() |
e.g. "getmoto/moto" |
base_commit |
str |
instance["base_commit"] |
no code to git checkout this commit exists today |
broken_image |
str |
SweBenchAdapter.image_for(instance) — either instance["docker_image"] (SWE-rebench) or the conventional swebench/sweb.eval.x86_64.{iid}:latest |
This tag is a pre-built SWE-bench eval image; no code in the repo pulls or builds these images |
fail_to_pass |
tuple[str,...] |
_as_tuple(instance["FAIL_TO_PASS"]) — handles JSON-encoded string OR list |
validated non-empty in __post_init__ |
pass_to_pass |
tuple[str,...] |
_as_tuple(instance["PASS_TO_PASS"]) |
may be empty |
test_command |
str |
SweBenchAdapter.default_test_command = "python -m pytest -q" |
hardcoded; not read from instance |
deleted_symbols |
tuple[str,...] |
never populated by SweBenchAdapter — hardcoded () in every substrate inversion |
the monitor can't do symbol-provenance checks without this |
golden_diff |
str |
instance["patch"] |
held out of repr; used only by validator |
granularity |
str |
hardcoded "feature" in SweBenchAdapter.to_task() |
CREATE-half escalation (function→file→feature) not wired to anything |
difficulty_prior |
float |
instance["difficulty"] if present (SWE-rebench) else 0.5 |
|
upstream_license |
str |
instance["license_name"] |
copyleft filter in is_redistributable() strips GPL/AGPL/LGPL |
What SweBenchAdapter actually does and does NOT do
SweBenchAdapter.to_task(instance: dict) is a pure schema inversion — it takes one SWE-bench-shaped dict and maps it to a FeatureDeletionTask. It does NOT:
- Pull or build a Docker image
- Apply the gold patch in reverse (
git apply -R) - Run any tests
- Discover test node IDs
- Populate
deleted_symbols(always empty) - Escalate
granularitybeyond the static"feature"
The broken-repo Docker image is assumed to exist pre-built (the SWE-bench project publishes these images; SWE-rebench carries its own docker_image field). The full pipeline step "revert gold patch → scrub caches → freeze image" is the documented [~] gate in ADR-010 — implemented in concept (the 4-gate validator interface exists, scrub_tree is built, LocalSubprocessSandbox and DockerSandbox are built) but there is no code in the repo that actually clones a repo, runs git apply -R <gold_patch>, builds a Docker image, and pushes it to a registry.
What FeatureDeletionEnv does during training (datagen/env.py)
reset(task)— boots the sandbox (by image tag), returns a text prompt listing failing tests. The prompt exposestask.repo,task.fail_to_pass,task.test_commandbut NEVERgolden_diffordeleted_symbols.step(action)— delegates tosandbox.exec(action), returning observation text; grades onsubmitor turn limit._grade()— runssandbox.run_tests(test_command, fail_to_pass + pass_to_pass), computes pass-fraction overfail_to_pass, gates to 0 ifpass_to_passguard is broken ORHackMonitor.flag()fires.reward_fn(prompts, completions, *, task_id, **kwargs)— TRLRewardFuncface; dispatches throughreset/step; feeds fractional credit (not binary) toDifficultyCurriculum.update.
Safeguards implemented
scrub_tree(workdir)— physically removes__pycache__,.mypy_cache,.pytest_cache,.git,.hg,*.pyc/.pyo/.classbefore episode start. This is the PRIMARY control (added in Wave 2; was absent before).SANDBOX_DENYLIST— blocksfind,strings,unzip,jar,javap, decompilers,git. First-token-only check; bypassable viash -c "...". Documented as defense-in-depth, NOT the wall.HackMonitor.flag()— layer 1: substring scan for cache/decompiler signatures in trajectory actions (not insubmit_patch). Layer 2: patch-provenance — if a deleted symbol reappears verbatim in the patch AND the trajectory shows a cache/bytecode artifact being read (normalized to defeat"__py"+"cache__"obfuscation), flags the trajectory.DockerSandbox—network_mode='none',read_only=True,cap_drop=['ALL'],no-new-privileges,pids_limit=256,mem_limit=1g, optional gVisorruntime='runsc'.
What ingestion/claude_code.py can ingest today
ClaudeCodeIngester.ingest(path: Path) -> Iterator[TraceState]:
- Input: Claude Code session JSONL at
~/.claude/projects/<encoded>/<sessionId>.jsonl - Output: one
TraceStateper assistant TURN (state_id,messages,student_action) - Skips: subagent files (
agent-prefix), sidechain records (isSidechain: True),summary/attachment/queue-operation/file-history-snapshotrecords student_action: JSON-serialized list of text + tool_use + thinking blocks (thinking KEPT in student_action, STRIPPED from teacher-facing messages ifstrip_thinking=True)tool_errorflag: structurally set onusermessages where anytool_resultblock hasis_error: true— this is the SDPO error-site detection signalstate_id:f"{path.stem}::{state_idx:04d}"- Does NOT handle: OpenHands traces, SWE-smith trajectories, any format other than Claude Code JSONL
(2) Envisioned Pipeline End-to-End (S3 Contract Prefixes, Tree Controller, Outer Loop)
From research/design-F1-systems-framing.md, research/design-F2-aws-datagen.md, and research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md §5/§8/§10.
- Seed trace ingestion (Stage a):
ClaudeCodeIngester.ingest()overs3://composer-datagen-386931836011-us-west-2/raw/claude_code/**/*.jsonl→ Parquet attraces/v1/run_id=<id>/part-*.parquetvia AWS Glue 5.0 Spark ETL job (glue_ingest_job.py, ~80 LOC, NOT YET BUILT). - Schema inversion (Stage c1):
SweBenchAdapter.to_task()per SWE-bench row →FeatureDeletionTaskJSONL attasks/v1/run_id=<id>/manifest.jsonl(one task per line, array index = line number). Pure CPU; runs inside the Glue job or a Lambda. License gate (is_redistributable()) applied here. - N-teacher replay (Stage b):
teacher_replay.replay_trace()generalized from flat OpenRouter toBedrockBatchTeacherPool— write one sharedreplay/v1/run_id=<id>/input/states.jsonl, submit oneCreateModelInvocationJobper teacher, write.jsonl.outper teacher toreplay/v1/.../teacher=<slug>/. An EMR Serverless aggregation step joins all N outputs bystate_id→list[TeacherCallResult]. (teacher_replay_bedrock.py, ~180 LOC, NOT YET BUILT). - Multi-model tree expansion (the core delta — NOT BUILT): A
tree_controller.py(~250–350 LOC, design-only) that, for eachTraceStatenode, fires N models, applies each candidate action throughFeatureDeletionEnv.step()to get a real next observation, branches again from the new state, grades leaves with_grade(). Expansion is gated on pre-expansion divergence between sibling next-action distributions (to avoid O(N^D) explosion). Emits six typed S3 prefixes (see step 8). - Sandbox materialization + 4-gate validation (Stage c2): AWS Batch array jobs on EC2 Spot, one child per task. Each child reads
AWS_BATCH_JOB_ARRAY_INDEX, looks up its task in the S3 manifest, bootsDockerSandbox/LocalSubprocessSandbox, runsvalidator.validate_task()(4 gates), writestask_grades/v1/run_id=<id>/<task_id>.json. (datagen/aws/batch_validate.py, ~120 LOC, NOT YET BUILT). - DPO pair extraction + normalization (Stage d):
extract_dpo_pairs()(already built inteacher_replay.py) on the fan-in of teacher outputs →DPOPairrows →DJNormalizerdata-juicer op-graph → EMR Serverless Spark for cross-partition dedup →corpus/v1/run_id=<id>/dpo/part-*.parquetandcorpus/sft/part-*.parquet. (replaysim/emr_normalize_job.py, ~100 LOC, NOT YET BUILT). - Orchestration: AWS Step Functions Standard Workflow:
Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless) → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless) → WriteManifest(Lambda). (infra/datagen_stepfunctions.json+infra/datagen_stack.py, ~250 LOC IaC, NOT YET BUILT). - S3 typed dataset contract (full set):
raw/claude_code/**/*.jsonl— input seed tracestraces/v1/run_id=<id>/part-*.parquet— TraceState rows (Stage a output)tasks/v1/run_id=<id>/manifest.jsonl— FeatureDeletionTask rows (Stage c1 output)tasks/golden/run_id=<id>/— golden_diff ACL-isolated prefix (deny-by-default; NEVER co-located with policy-visible tasks/)replay/v1/run_id=<id>/input/states.jsonl— shared Bedrock batch inputreplay/v1/run_id=<id>/teacher=<slug>/*.jsonl.out— per-teacher Bedrock batch outputtask_grades/v1/run_id=<id>/<task_id>.json— validator + _grade() resultscorpus/v1/run_id=<id>/sft/part-*.parquet— clean winning trajectories (SFT-first floor)corpus/v1/run_id=<id>/dpo/part-*.parquet— DPO pairs (normalized DPOPair)dpo_pairs/— divergence-derived DPO pairs from the tree (sibling winners vs losers)rl_task_pool/— FeatureDeletionTask registry + DifficultyCurriculum priorsdivergence_pairs/— divergence-annotated nodes (where sibling next-action distributions forked)wm_tuples/— (state, action, next_state, outcome) for ALL branches incl. failures (world-model training target)holdout/— disjoint held-out eval anchor (HeldoutSplit; NEVER fed back)diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt— DiLoCo outer-sync (already used by existing allreduce.py)manifests/run_id=<id>.json— run-level manifest (counts, cost, lineage, schema_version, parent_run_id for flywheel)
- SFT-first stage: Read
sft_corpus/(clean_grade()gate-1 passing trajectories), runcompose_losswithalpha_sdpo=0, beta_replay=0(reduces to_lm_response_ce— next-token CE masked to response tokens), writeckpt_sft/. (pipeline/sft_floor.py, ~60 LOC, NOT YET BUILT). - Inner RL loop:
ComposerReplicationTrainer(trl.GRPOTrainer subclass) onrl_task_pool/withFeatureDeletionEnv.reward_fn;total = grpo + α·sdpo + β·trace_replay_dpo; DiLoCo outer-sync via S3;HeldOutGuardkill-switch now wired (Wave 3). - Flywheel: Improved student generates next outer loop's seed traces; learned deliberation-confidence becomes the next round's divergence gate.
(3) Unbuilt Components the Vision Depends On
Every item below is design-only or a skeleton; none has real production code.
| Component | File Estimate | Source | Status |
|---|---|---|---|
datagen/tree_controller.py — the core delta: env-step between branches, _grade() at leaves, divergence-gated expansion, six typed S3 prefix writes |
~250–350 LOC | design-F1, final_report §1/§5/§6 | 0% built — no file exists |
SiblingBootstrapGenerator in hint_generator.py — select max-reward sibling → emit "a working approach looks like: …" → feed ctx_teacher splice |
~60 LOC | design-F5 Tier 1 / final_report §1/§6 | 0% built — not a class in hint_generator.py at all |
pipeline/s3_layout.py — typed writers for all six S3 dataset prefixes; the OUTER→INNER contract |
~80 LOC | design-F1 §4 | 0% built — no pipeline/ directory exists |
pipeline/sft_floor.py — SFT-first driver: read sft_corpus/, run TRL SFTTrainer or compose_loss alpha=beta=0, write ckpt_sft/ |
~60 LOC | design-F1 §2 / design-F5 d | 0% built |
teacher_replay_bedrock.py — BedrockBatchTeacherPool: submit one Bedrock CreateModelInvocationJob per teacher, poll, parse .jsonl.out back into list[TeacherCallResult] |
~180 LOC | design-F2 §b | 0% built |
datagen/aws/batch_validate.py — AWS Batch array-child entrypoint: read BATCH_JOB_ARRAY_INDEX → manifest line → DockerSandbox + validator + _grade() → write task_grades/ |
~120 LOC | design-F2 §c2 | 0% built — datagen/aws/ subdirectory does not exist |
datagen/aws/glue_ingest_job.py — Glue Spark entrypoint wrapping ClaudeCodeIngester.ingest in mapPartitions; write traces/ Parquet |
~80 LOC | design-F2 §a | 0% built |
replaysim/emr_normalize_job.py — EMR Serverless Spark entrypoint wrapping DJNormalizer per partition + Spark cross-partition dedup; write corpus/dpo/ + corpus/sft/ Parquet |
~100 LOC | design-F2 §d | 0% built |
datagen/aws/s3_contract.py — S3 layout constants, RunManifest dataclass, Parquet/JSONL serializers, recordId==state_id join helpers, schema_version/split column injection |
~120 LOC | design-F2 §contract | 0% built |
infra/datagen_stepfunctions.json + infra/datagen_stack.py — Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless, Glue) |
~250 LOC IaC | design-F2 §orchestration | 0% built — infra/ directory does not exist |
trainer/composer_trainer.py world-model head — parameter-isolated next-state adapter + <deliberate> token as second SDPO mode |
~40 LOC delta | design-F1 §4 / final_report §2 | 0% built — grep confirms no world_model/WorldModel/next_state_head/<deliberate> anywhere in composer_replication/ |
Broken-repo image builder — code to clone a repo at base_commit, apply git apply -R <golden_diff>, run scrub_tree, build and push a Docker image to ECR |
unspecified | ADR-010 §decision / design-F2 §c2 | 0% built — there is NO code anywhere in the repo that manufactures a broken-repo Docker image from scratch |
EKSExecutor (now skeleton-built in Wave 2) + Argo Workflows controller for outer loop |
Wave-2 executor skeleton built; Argo controller design-only | design-F1 §AWS / final_report §8 | skeleton built — eks.py is a functional executor (IndexedJob dispatch) but the Argo outer-loop controller is 0% |
verl AsyncServer backend for tool-heavy tree |
— | final_report §8 | 0% built — design note only |
| Offline LLM-judge hack monitor (EvilGenie-style, Bedrock) | — | design-F5 §Tier 4 | 0% built |
(4) Seams Where "Point at an Arbitrary OSS Repo" Breaks the Current Code
The SweBenchAdapter is designed to consume pre-packaged SWE-bench-shaped datasets, not arbitrary GitHub repos. The breaks are structural:
Break 1: broken_image assumes a pre-built SWE-bench image exists
SweBenchAdapter.image_for() returns either instance["docker_image"] (SWE-rebench) or the convention swebench/sweb.eval.x86_64.{iid}:latest. For an arbitrary OSS repo there is no such image. A fresh repo would need:
- Clone at
base_commit - Install the project's Python/Java/etc. toolchain
- Apply
git apply -R <golden_diff>to manufacture the broken state - Run
scrub_tree()to strip caches - Build a Docker image that encapsulates this broken state
- Push the image to a registry accessible by
DockerSandbox.boot()
None of this code exists. DockerSandbox.boot(image) raises RuntimeError("DockerSandbox.boot: image {image!r} not found locally and could not be pulled (the container is --network none). Pull it on the host first.") if the image is absent.
Break 2: test_command is hardcoded
SweBenchAdapter.default_test_command = "python -m pytest -q". A fresh repo may use make test, npm test, cargo test, mvn verify, or any other test runner. There is no test-discovery logic anywhere in the repo.
Break 3: fail_to_pass and pass_to_pass require pre-existing test labels
SWE-bench instances ship with FAIL_TO_PASS and PASS_TO_PASS as pre-identified pytest node IDs. For an arbitrary repo the mapping from "the code change" to "which tests exercise the deleted symbols" must be derived — e.g., via coverage analysis or AST-reachability. FeatureDeletionTask.__post_init__ raises ValueError if fail_to_pass is empty. The 4-gate validator's Gate 2 (deletion breaks the feature) cannot be verified without pre-identified test node IDs.
Break 4: deleted_symbols is never populated
SweBenchAdapter hardcodes deleted_symbols=(). The HackMonitor._patch_provenance_hack() check (monitor.py:157-182) skips the symbol-reappearance test if deleted_symbols is empty — so the provenance layer of the hack monitor is effectively a no-op on all SweBenchAdapter-derived tasks. For a fresh repo, AST analysis to identify the deleted symbols would be required.
Break 5: No copyleft scrub for arbitrary repos
is_redistributable() reads upstream_license from instance["license_name"]. For a fresh GitHub repo there is no pre-populated license field; the repo license must be detected (e.g., via SPDX scanning) before the copyleft filter can be applied.
Break 6: No env setup for non-Python repos
LocalSubprocessSandbox.run_tests runs subprocess.run(cmd, shell=True, ...) against the working tree with a hard-coded 600s timeout. It has no virtualenv creation, no dependency installation, no multi-language support. DockerSandbox depends on a pre-baked image that already has the environment. A fresh Python repo would need pip install -e . run inside the image, and a non-Python repo would need a completely different image and test runner.
(5) What ingestion/claude_code.py Can Ingest Today
ClaudeCodeIngester.ingest(path) handles exactly one format: Claude Code session JSONL at ~/.claude/projects/<encoded>/<sessionId>.jsonl.
Supported record types handled:
type="user"— string content or list of text/tool_result blocks → OpenAI-style user message;tool_errorstructural flag set if anytool_resultblock hasis_error: truetype="assistant"— list of text/thinking/tool_use blocks → oneTraceStatewithstudent_action(full blocks including thinking) andmessages(history, optionally with thinking stripped)
Record types silently skipped:
type="summary"— Claude Code conversation summary recordstype="attachment","queue-operation","file-history-snapshot","last-prompt","system"— auxiliary recordsisSidechain: Truerecords — subagent traces (skipped in v0.1 per ADR-002)- Files starting with
agent-— subagent session files by naming convention
Structural features:
state_id = f"{path.stem}::{state_idx:04d}"— stable within-session identifierstrip_thinkingflag (default True) — strips[THINKING] ...lines from the teacher-facingmessageshistory but keeps them instudent_action- Injects synthetic system prompt at
messages[0]("You are a senior software engineer...") - Version check: warns on schema version outside
2.x.x
NOT handled by this ingester:
- OpenHands trajectory format (planned for v0.2 per ADR-002)
- SWE-smith trajectories (planned for v0.2)
- Cline VS Code export
- Aider chat history
- SWE-bench leaderboard trajectory submissions
- Any binary or non-JSONL format
Critical Cross-Checks: What the Repo Claims vs What Exists
Claim 1: "Feature Deletion generator" (Composer 2.5 blog says "point at a repo")
What the blog says (COMPOSER_RECIPE_MAPPING.md): "take a repo with passing tests, delete some code, ask the agent to reimplement to pass tests." What the repo does: Inverts existing SWE-bench-shaped instances — reverts their gold patch. There is NO code that: (a) points at an arbitrary OSS repo, (b) identifies deletable symbols, (c) synthesizes a broken state beyond SWE-bench's pre-packaged ones. The ADR correctly scopes this as "Option A — invert OSS substrates" vs "Option B — greenfield repo scraping." The blog's "point at a repo" vision is Option B, which was explicitly rejected.
Claim 2: "25× synthetic data"
What the blog says: Composer 2.5 uses 25× more synthetic tasks than Composer 2 (COMPOSER_RECIPE_MAPPING.md §2). What the repo has: A schema adapter for 5 existing OSS datasets (SWE-bench-Lite ~300, SWE-Gym ~2.4k, R2E-Gym ~8.1k, SWE-rebench ~21.3k, OpenHands/Nemotron ~59k). ADR-010 notes ~15 node-days to invert all SWE-rebench tasks. No actual inverted task corpus has been generated. The 25× claim refers to the training run; the repo has the generation machinery for the inversion shape but not the greenfield synthesis needed for genuine novel task minting.
Claim 3: "Dynamic difficulty curriculum — select for AND create harder tasks"
What Composer 2.5 says: "We both select for and create harder tasks dynamically throughout the run."
What the repo has: The SELECT-FOR half: DifficultyCurriculum with p̂(1−p̂) frontier weighting, retire/quarantine thresholds, and effort tilt on turns/think-tokens (Wave 20). The CREATE half (escalating deletion span, coupling complexity, multi-feature targets during the run) is explicitly listed as MISSING in design-F5 row b2. granularity is set statically to "feature" for all SweBenchAdapter tasks; no escalation logic exists.
Claim 4: deleted_symbols enables AST-provenance monitoring
What ADR-010 says: "signature + patch-provenance monitor" that detects if deleted symbols reappear via cache reads.
Reality: deleted_symbols=() on every SweBenchAdapter-derived task (line 81 in substrates.py: hardcoded empty tuple). HackMonitor._patch_provenance_hack() returns False immediately when deleted_symbols is empty (reappeared = [s for s in deleted_symbols if s and s in patch] → empty list). The provenance layer of the monitor is a dead code path for all currently-generable tasks.
Claim 5: The tree controller and world-model head are part of the system
What design docs say: "roughly nine-tenths of it" is reuse (final_report §6 reuse-vs-build table).
Reality: The tree controller is 0/0 — no file, no function, no class. Confirmed by exhaustive grep: no SiblingBootstrap, world_model, WorldModel, next_state_head, tree_controller, MCTS, deliberate_token anywhere in composer_replication/. The "nine-tenths reuse" claim is accurate for the Composer recipe replication; the tree itself (the framework's own addition) is entirely design.
Claim 6: The broken-repo image is manufactured by the pipeline
What design-F2 says: Step c2 involves "pull the substrate's frozen image, git apply -R the gold patch, scrub_tree(), run the test command, confirm FAIL_TO_PASS actually fails."
Reality: This describes what SHOULD happen in the Batch array child. No such code is written. SweBenchAdapter.image_for() returns a string tag; that tag must be pre-pulled on the host before DockerSandbox.boot() can use it (RuntimeError on image-not-found). The full broken-image manufacture pipeline (clone → revert → scrub → build → push) is a gap.
Summary of Unbuilt vs Built
BUILT and tested (production-ready CPU, Docker-gated where noted):
FeatureDeletionTaskschema +FeatureDeletionEnv(reset/step/_grade/reward_fn)SweBenchAdapterschema inversion (pure dict transform)FakeSandbox,LocalSubprocessSandbox,DockerSandbox(hardware-gated e2e green in Wave 1/2)scrub_tree()primary reward-hack controlHackMonitor(signature + patch-provenance, obfuscation-resistant)DifficultyCurriculum(SELECT-FOR half + effort tilt)validate_task()4-gate solvability validatorClaudeCodeIngester(Claude Code JSONL only)behavior_rewards.py—c_length,EffortWeights,LengthEffortPenalty,UnfinishedTodoPenalty,LeftoverCoTPenalty,CommunicationReward(Wave 20)kl_in_reward.py— k1-in-reward path opt-in (Wave 20)HeldOutGuard+HeldoutSplit+ wired into trainer (Wave 2/3)EKSExecutorskeleton +SageMakerExecutorskeleton (Wave 2)
DESIGN-ONLY (no code):
- Tree controller (
datagen/tree_controller.py) SiblingBootstrapGeneratorinhint_generator.pypipeline/s3_layout.py,pipeline/sft_floor.pyteacher_replay_bedrock.py(BedrockBatchTeacherPool)datagen/aws/batch_validate.py,datagen/aws/glue_ingest_job.py,datagen/aws/s3_contract.pyreplaysim/emr_normalize_job.pyinfra/datagen_stepfunctions.json,infra/datagen_stack.py- World-model next-state head in trainer
- Argo Workflows outer-loop controller
- Broken-repo image builder (clone → git apply -R → build → push)
- CREATE half of difficulty curriculum (mint harder tasks during run)
- SFT-first training stage
- Offline LLM-judge hack monitor