Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Design Critic: Architecture of the Envisioned Dataset-Generation Pipeline
Reviewer: DESIGN CRITIC (critical-review pipeline, subagent)
Date: 2026-06-10
Inputs: research/deepread/00-grounding.md + 01–08, research/design-F1-systems-framing.md, research/design-F2-aws-datagen.md, docs/adrs/ADR-010, and direct reads of composer_replication/datagen/{env,substrates,validator,sandbox}.py, teacher_replay.py, safety/holdout.py, hint_generator.py.
Scope: Attack the ARCHITECTURE of the envisioned dataset pipeline — buy-vs-build, the S3 contract, tree-controller implementability, package placement, the minimal end-to-end pipeline, and components no design mentions.
P0 — Foundational architecture defects (the design as drawn cannot produce the promised artifact)
1. The tree-of-work's seed traces and its execution oracle operate on DISJOINT data — a chicken-and-egg no design resolves. [P0]
F1's diagram and the final report (§5: "ingest a seed trace, expand the divergence-gated tree across N models, execute every branch in a sandbox, grade leaves by _grade()") compose two components that cannot currently meet:
- Seed traces come from
ClaudeCodeIngesterover~/.claude/projects/**.jsonl— sessions against the user's local working directories at arbitrary commits, with no Docker image, no pinned dependencies, noFAIL_TO_PASSlabels, noFeatureDeletionTask. - The execution oracle (
FeatureDeletionEnv.step()/_grade()) requirestask.broken_image(a frozen container) and pre-identified test node IDs (__post_init__raises on emptyfail_to_pass).
You cannot env.step() a Claude Code trace action: there is no environment that materializes the repo state the trace was recorded against. The grounding map's Breaks 1–6 document this for "arbitrary OSS repo" but nobody applies it to the trace-replay-tree, where it is fatal: every branch of the tree needs an executable sandbox, and Claude Code traces have none. The only traces that are tree-expandable are traces collected inside FeatureDeletionEnv episodes — which do not exist because nothing in the repo runs an agent inside the env (see Finding 2).
Recommendation: Invert the bootstrap order. Phase 1 of the tree must seed from env-grounded traces (agent rollouts on FeatureDeletionTask episodes, where reset state is reproducible by task_id), not Claude Code sessions. Demote Claude Code traces to (a) flat Channel-3 replay (DPO text pairs, no execution) and (b) SFT style data — uses that need no oracle. Document this split in a new ADR; the current F1 diagram is misleading and will misdirect the build.
2. No agent rollout harness exists — the SFT corpus has no producer. [P0]
The "SFT-first competence floor" (F1 §2, final report §5) reads sft_corpus/ = "clean winning trajectories." Trace the producers: teacher_replay emits single next-actions per frozen state, not episodes. env.reward_fn's fallback treats the whole completion as one submit (08 §5.1 calls this "a dead end for genuine multi-turn"). The tree controller is 0% built. Nothing in the repo, built or designed, drives a multi-turn agent loop (LLM → tool call → sandbox.exec → observation → LLM) to completion and serializes the trajectory. SWE-smith collected its 5k SFT trajectories with SWE-agent + Claude; SWE-Gym with OpenHands. Every design doc skips this component; F2's four stages (ingest/replay/validate/normalize) produce tasks and DPO pairs but no SFT trajectories at all — yet F2's stage (d) claims to write corpus/sft/.
Recommendation: Add an explicit rollout harness component to the pipeline: adopt mini-swe-agent or SWE-agent (battle-tested, supports any API model) as the expert-trajectory collector against FeatureDeletionEnv tasks, with _grade()==1.0 + HackMonitor-clean as the SFT admission filter. ~200–400 LOC of adapter, not a new agent. This is the single highest-priority build item — without it the minimal pipeline (Finding 16) cannot terminate in an SFT corpus.
3. Divergence-gating is not computable from what the current components emit. [P0]
The tree's economics depend entirely on the divergence gate (final report §3: it turns O(N^D) into "O(N · decision-points)"). The gate needs "pre-expansion divergence between sibling next-action distributions." But:
- Teacher APIs (OpenRouter, Bedrock batch) return text, not distributions. Bedrock batch (
CreateModelInvocationJob) does not return logprobs usable for a cross-model divergence measure (and cross-model token distributions live in different vocabularies anyway — KL between them is undefined without a common action space). - The only equality/divergence measure in the codebase is
_normalize_action()— whitespace-collapse + lowercase. Deepread 07 (FR-R8, HIGH) already established this produces "mostly noise on real traces": semantically identical tool calls in different JSON formatting count as disagreement, so the gate would fire on nearly every node, silently degrading the tree to the O(N^D) ungated cost the gate exists to prevent. The cost-control mechanism and the known-broken normalizer are the same component.
Recommendation: Define divergence over a canonical action algebra, not text: parse every candidate into (tool_name, normalized_args) (AST-normalize code args, path-normalize file args), and gate on (i) tool-name disagreement, (ii) arg-level edit distance over the canonical form, with an escalation tier that asks a cheap judge model only when (i)/(ii) is ambiguous. Build and unit-test the gate's firing rate on 5 real traces (expected: fires on <20% of steps) before writing tree_controller.py. The tool-call parser is the prerequisite, not a polish item.
4. The tree requires a sandbox fork/snapshot primitive that neither the Sandbox API nor any design has. [P0]
tree_controller.py (design: "apply each candidate action through FeatureDeletionEnv.step() to get a real next observation, branch again from the new state"). Branching N ways from a mid-episode state requires N independent copies of the mutated working tree. The Sandbox protocol has boot/exec/run_tests/trajectory — no fork(), no snapshot()/restore(). DockerSandbox boots one container per episode with read_only=True rootfs. The two realizable options are architecturally very different and nobody has chosen:
- Replay-from-root: re-boot the image and re-execute the action prefix for every branch — cost O(depth) sandbox-execs per node, multiplying the already-dominant sandbox cost (final report §10 names per-branch sandbox isolation "the throughput ceiling of the whole idea").
- Filesystem snapshot fork: overlayfs/
docker commit/CRIU per node — fast but pins the design to one isolation backend and conflicts with gVisor/Kata choices in F2 stage (c).
Recommendation: Extend the Sandbox protocol with fork() -> Sandbox now, implement it as overlayfs-upper-dir copy for LocalSubprocessSandbox and docker commit+boot for DockerSandbox, and measure fork latency in a spike before committing to tree depth >1. If fork costs >5s, the honest fallback is depth-1 trees (N candidate actions, each executed one step + graded by a cheap proxy) — which is most of the DPO value at a fraction of the machinery.
5. Zero benchmark decontamination anywhere in the pipeline. [P0]
The pipeline trains on SWE-bench-family substrates and the program will inevitably report SWE-bench Verified numbers (every comparison point in the research notes — DeepSWE 42.2%, SWE-smith 40.2%, Socratic-SWE 50.4% — is SWE-bench Verified). The substrates have different and partial decontamination policies: R2E-Gym decontaminates only its Subset against SWE-bench test repos (deepread 02 §6.2.2: "The full R2E-Gym (8,135 tasks) may overlap"); SWE-rebench spans 3,468 repos with no stated guarantee; the deepread 02 review explicitly flags research/06's "no contamination worry" as an OVERCLAIM. HeldoutSplit guards only the internal train/holdout partition — it has no concept of an external benchmark. No design doc (F1, F2, ADR-010) contains the word "decontamination."
Recommendation: Add a mandatory decontaminate(tasks, benchmark) gate at stage (c1), enforced at three levels: (i) repo-level — drop any task whose repo appears in SWE-bench Lite/Verified/full or the chosen eval suite; (ii) instance-level — drop tasks whose golden_diff or fail_to_pass node IDs match an eval instance; (iii) record the decontamination manifest (benchmark name + version hash + drop count) in manifests/run_id.json. ~50 LOC + a pinned eval-instance index. This must exist before the first corpus is generated, because retro-filtering a published corpus is a credibility event.
6. Buy-vs-build: the designs build the one component that is buyable, and skip the integration deepread 02 already identified. [P0]
ADR-010 chose "Option A: invert OSS substrates" and rejected "Option B: greenfield repo scraping" — correct in 2026-05. But the user's new ask ("point at a repo") is Option B, and the design response (grounding map: "Broken-repo image builder — clone, git apply -R, scrub, build, push — unspecified, 0% built") is to hand-build exactly what pip install swesmith (MIT) ships: environment construction from arbitrary GitHub repos (~7 min human/repo, one image per repo = 500× storage win over per-task images), four bug-synthesis strategies, validation, issue-text generation, and rp.get_container(task) returning a booted container. Deepread 02 flagged this at HIGH ("the ADR does not evaluate it as a dependency") and noted SWE-smith's PR Mirror strategy — the exact gold-patch-reversion mechanic of SweBenchAdapter — produces the best training data in SWE-smith's own ablation (Table 5). The repo's core approach is validated by prior art it doesn't cite, and its missing half is shipped by a library it doesn't depend on.
Recommendation: Commit a revised ADR: [BUY] swesmith for env-construction + bug synthesis on new repos; [BUY] SWE-smith 59k / R2E-Gym-Subset 4.6k / SWE-Gym 2.4k datasets through the existing SweBenchAdapter (02 confirms it works on SWE-smith instances unchanged); [BUILD] only the genuinely novel pieces — DifficultyCurriculum, HackMonitor/scrub, decontamination, the rollout harness (Finding 2), and (later, ablation-gated) coverage-guided deletion targeting. The "point-at-a-repo" feature becomes a ~100 LOC SweSmithProfileAdapter instead of an unspecified image-builder subsystem. Budget reality check from 02: SWE-smith built 50k tasks for $1,360 + 20 human-hours; any in-house builder must beat that.
P1 — Significant design errors (buildable, but will mislead or bite)
7. There are TWO unreconciled S3 contracts; the "6-prefix contract" is neither complete nor singular. [P1]
F1 commits six prefixes (sft_corpus/ dpo_pairs/ rl_task_pool/ divergence_pairs/ wm_tuples/ holdout/ under runs/<run_id>/). F2 commits a different layout (traces/v1/run_id=<id>/, tasks/v1/..., replay/v1/..., task_grades/v1/..., corpus/v1/run_id=<id>/{sft,dpo}/, manifests/). The grounding map (§2 step 8) pastes both into one list — so corpus/v1/.../dpo/ and dpo_pairs/ both exist for the same artifact, with different partitioning conventions (Hive run_id= vs path runs/<id>/), different versioning (F2 prefixes carry v1, F1 prefixes carry none), and two different buckets named across the docs (amazon-sagemaker-... in F1, composer-datagen-... in F2). Additionally: divergence_pairs/ and dpo_pairs/ describe one lineage (divergence-annotated nodes → extracted pairs) split across two prefixes, inviting drift; there is no prefix for quarantined/retired tasks (the curriculum produces them) nor for eval results.
Recommendation: Write s3_contract.py FIRST and make both design docs subordinate to it. One layout: s3://<bucket>/<contract_version>/run_id=<id>/<artifact>/ where artifact ∈ {traces, tasks, tasks_golden, replay, grades, corpus_sft, corpus_dpo, wm_tuples, holdout, quarantine, manifest.json} — every artifact versioned by the single top-level contract_version, every prefix owned by exactly one writer stage, divergence_pairs folded into corpus_dpo as a provenance column rather than a sibling prefix. Until this file exists, no AWS stage should be built (they would each encode one of the two divergent layouts).
8. golden_diff leaks into the policy-visible manifest: repr=False is not serialization-exclusion. [P1]
F2 (open questions) correctly demands golden_diff live in a deny-by-default tasks/golden/ prefix. But stage (c1) writes "FeatureDeletionTask rows" to tasks/v1/run_id=<id>/manifest.jsonl, and any naive serializer (dataclasses.asdict, json.dumps(vars(task))) includes golden_diff — field(repr=False) only affects __repr__. The Batch validator children legitimately need the gold diff (Gate 4), but rl_task_pool/ (read by the training env, whose prompt renderer carefully hides golden) would carry it in plaintext one json.loads away from any reward-hacking trajectory that reads its own task manifest. The safeguard exists in the prompt renderer and nowhere in the storage contract.
Recommendation: In s3_contract.py, define two explicit serializers — to_policy_row(task) (drops golden_diff AND deleted_symbols) and to_validator_row(task) (full) — and make the policy-row writer the only code path that can populate rl_task_pool/. Add a unit test asserting "golden_diff" not in json for policy rows. Enforce the prefix split with bucket policy, not convention.
9. The F2 architecture is a five-service orchestration for a pipeline that has never run once locally. [P1]
F2 commits Glue 5.0 + Bedrock batch + EMR Serverless + AWS Batch + Step Functions + Lambda + CDK (250 LOC IaC) before a single task has been validated end-to-end on a laptop (ADR-010's own post-review: the gates passed "against FakeSandbox materializers"; the Docker e2e is still `[]). Every stage's per-service rationale in F2 is individually sound, but the composition is premature: the corpus that matters first is O(10²–10³) tasks (SWE-smith's full 50k cost 20 human-hours and one machine), which is a **single-node workload**. The Step Functions DAG also has no idempotency/restart semantics, no run-level budget envelope (only teacher_replay's in-process max_total_usd`), and a 24h Bedrock-batch SLA in the middle of what should be a same-day iteration loop during development.
Recommendation: Build the pipeline as stage functions with a local driver first (python -m composer_replication.pipeline.run --stage all --tasks 200), with S3 used only as a dumb artifact store via the s3_contract.py writers. Promote individual stages to managed services only when a measured bottleneck demands it, in this order: (1) AWS Batch for sandbox validation (the only genuinely parallel-heavy stage), (2) Bedrock batch when replay volume × cost crosses the 50%-discount break-even, (3) Step Functions only when >1 unattended run/week. Glue and EMR Serverless are likely never needed at this corpus scale — F2's own "live caveat" already concedes the ingester "is not intrinsically Spark-shaped."
10. No secrets/PII gate at trace ingest — raw Claude Code sessions go to S3 verbatim. [P1]
F2 stage (a) uploads ~/.claude/projects/**.jsonl to raw/claude_code/ and Parquet-izes them. Claude Code session files contain the user's local file contents, env-var echoes, API keys in tool outputs, internal hostnames, and proprietary code from whatever repos the user worked on — none of which passed any license, secrets, or PII filter. The copyleft filter (is_redistributable) applies only to SWE-substrate tasks, not to traces. The flywheel then trains on this and (per the publications/ directory) the corpus is intended to be shareable.
Recommendation: Insert a mandatory scrub stage between ingest and storage: secrets scanning (gitleaks/trufflehog rule pack over message contents), path anonymization, and a per-session allowlist (only sessions from designated repos enter the corpus). Record scrub stats in the run manifest. This is ~1 day of work and belongs in ClaudeCodeIngester itself so no unscrubbbed TraceState can exist downstream.
11. No canonical trajectory IR — three trace shapes are about to become five. [P1]
Today: Claude Code JSONL → TraceState (messages + student_action as serialized block-list). Planned: Bedrock .jsonl.out rows, tree-controller branch trajectories, rollout-harness episodes (Finding 2), OpenHands/SWE-smith trajectories (ADR-002 v0.2). Each design names its own shape; _normalize_action's whitespace hack is the symptom of the missing abstraction. Without one normalized trajectory schema, every pairwise consumer (DPO extractor, SFT formatter, wm_tuple writer, replay submitter) needs format-specific code, and the "trace format normalization" cost grows quadratically.
Recommendation: Define a CanonicalTrajectory schema (list of Turn{role, content, tool_calls: [(name, canonical_args)], tool_results, error_kind}) in datagen/schema.py as the single internal currency; every ingester is a X -> CanonicalTrajectory adapter, and extract_dpo_pairs/SFT formatting/wm_tuples consume only it. This also gives the divergence gate (Finding 3) its action algebra for free.
12. The flywheel has no cross-generation dedup — it will feed itself duplicates. [P1]
The flywheel (F1: "improved student generates the next round's seed traces") loops the corpus into itself. Dedup today: data-juicer document_deduplicator is per-batch only; F2 adds Spark dropDuplicates within one run's normalize stage. Nothing dedups across run_ids, so generation N+1's SFT corpus will contain near-copies of generation N's winning trajectories (same tasks, similar solutions), compounding each cycle — a known self-training collapse accelerant. The held-out guard detects collapse after the fact; dedup prevents its cheapest cause.
Recommendation: Add MinHash/LSH near-dup dedup keyed on (task_id, canonical_action_sequence) across all prior run manifests at corpus-write time, plus a per-task cap on retained trajectories (e.g., ≤K winners per task across all generations). Store the dedup index alongside manifests/.
13. License handling is field-deep, not repo-deep — and absent for the "point-at-a-repo" path. [P1]
is_redistributable() lowercases instance["license_name"] and substring-matches ("gpl","agpl","lgpl"). For arbitrary repos there is no license_name (grounding Break 5); for SWE-smith instances the license lives in the toolkit's repo profiles, not the instance dict (02 §1: 2 GPLv3 + 4 LGPL repos in SWE-smith would need mapping); the substring check also misclassifies (e.g., "GPL-2.0-with-classpath-exception" semantics, dual-licensed repos) and ignores the difference between using a repo for training and redistributing derivative diffs (02 notes SWE-smith only claims the former). Repo-ingest is exactly where this must run, and no design places it there.
Recommendation: At repo ingest, run SPDX detection (licensee/askalono) on the cloned tree, store the SPDX id + detection confidence on the task, and split policy into two explicit gates: trainable(license) (permissive + most copyleft OK for internal training) and redistributable(license) (permissive only) — applied at corpus-publish time, not generation time, so copyleft repos still contribute non-redistributed training signal. Keep the existing function as the redistribution gate, fix it to exact-SPDX matching.
14. wm_tuples/ ("ALL branches incl. failures") is an unbounded write path serving an ablation-gated consumer. [P1]
The world-model head — the sole consumer of wm_tuples/ — has, per deepread 06, zero direct evidence for its configuration ("no published paper has tested an auxiliary next-state-prediction objective during RL on a code policy") and is explicitly an ablation arm, not a premise. Yet the S3 contract gives it the highest-volume prefix in the system (every env.step() observation of every branch, including all failures — observations are full test logs/file contents), with no size estimate, no sampling policy, no retention/lifecycle rule in any design. The pipeline's storage architecture is load-bearing on its most speculative research bet.
Recommendation: Make wm_tuple emission opt-in per run (emit_wm_tuples: bool in the run config, default off until the P4 ablation is scheduled), store observations as content-addressed blobs with dedup (test logs repeat massively), and attach an S3 lifecycle rule (expire after N days unless pinned by an ablation manifest). Do not let the typed-routing elegance ("failed branch is gold for the world model") force eager collection before the consumer exists.
15. The "create harder tasks dynamically" half of the curriculum is absent from every pipeline design — yet it is the blog's actual mechanism. [P1]
Deepread 01 confirms the verbatim Composer claim: "we both select for and create harder tasks dynamically throughout the run." The repo has SELECT-FOR (DifficultyCurriculum); the CREATE half (escalate deletion granularity function→file→feature, combine bugs, multi-feature targets minted during the run) appears in no design — granularity is hardcoded "feature", and F1/F2's outer loop has no stage that takes curriculum state as input to task synthesis. Meanwhile SWE-smith's Combine-Bugs strategy (96.9% yield, zero cost, 15 median F2P — 02 §1) is precisely a CREATE-half mechanism sitting in the buyable toolkit.
Recommendation: Add a task_escalation stage to the outer loop contract now (input: curriculum pass-rates per task; output: new combined/escalated task candidates into the validation queue), implemented first as SWE-smith Combine-Bugs over already-validated per-repo bugs. Even if deferred, the stage boundary must exist in the orchestration design, or the pipeline hard-codes a select-only curriculum and the eventual retrofit will break the Step Functions/driver topology.
P2 — Real gaps, lower blast radius
16. Where the pipeline should live: extend the monorepo with isolated extras; do NOT split a package. [P2]
F1/F2 propose composer_replication/pipeline/, composer_replication/datagen/aws/, and root infra/. This is broadly right; a separate datagen package/repo would be wrong now: the shared dataclasses (FeatureDeletionTask, TraceState, DPOPair, future CanonicalTrajectory) ARE the contract, and splitting them across repos forces schema version-pinning with zero external consumers. The risks to manage are dependency bleed (boto3/pyspark/docker must not enter the training image) and entrypoint importability.
Recommendation: (a) composer_replication/datagen/ stays the pure library (schema, env, validator, monitor, curriculum — no cloud deps); (b) new composer_replication/pipeline/ holds stage drivers + s3_contract.py, all cloud/Spark imports lazy and gated behind a [pipeline] extra; (c) infra/ at repo root for IaC only; (d) CI check that import composer_replication.datagen succeeds with no extras installed. Revisit a package split only when a second project actually consumes the datagen library.
17. No state/node ID contract for tree-generated states. [P2]
state_id = f"{path.stem}::{idx:04d}" is unique only within one session file; tree-controller branches, rollout-harness episodes, and Bedrock recordId joins all need globally unique, lineage-encoding IDs (parent pointer, branch index, run_id), and recordId==state_id is named "the universal join key" without ever specifying the tree extension.
Recommendation: Commit node_id = {run_id}/{trace_id}/{path-from-root as branch indices} in s3_contract.py; parent derivable by truncation, collision-free by construction.
18. No dataset versioning, cards, or reproducibility pins. [P2]
manifests/run_id.json carries counts/cost/lineage but no: substrate dataset revision hashes (HF datasets mutate — SWE-smith grew from 50,137 to 59,136 rows post-publication, per 02), swesmith/toolchain versions, Docker image digests (tags like :latest in image_for() are mutable!), prompt-template hashes, or a generated dataset card (composition, license mix, decontamination statement, known limitations). Irreproducible corpora are unpublishable and undebuggable.
Recommendation: Pin image digests not tags in FeatureDeletionTask.broken_image; record substrate (dataset_id, revision) and generator git SHA in the manifest; auto-generate a dataset card per run from the manifest (the HF datasets card template is fine). ~100 LOC total.
19. DiLoCo rendezvous traffic co-located in the dataset bucket. [P2]
Both F1 and F2 put diloco/rendezvous/ (hot, high-frequency, delete-heavy training sync) in the same bucket as the immutable corpus. This entangles IAM (training nodes get write access to a bucket containing tasks/golden/), lifecycle policies, and cost attribution.
Recommendation: Separate bucket (or at minimum a separate top-level prefix with its own bucket policy statement and aggressive expiry), keeping the dataset bucket append-only for everything except quarantine/.
20. No corpus-quality acceptance metric — the pipeline has no definition of "done/good." [P2]
Every stage has pass/fail gates for tasks, but nothing measures whether the resulting corpus is any good before it consumes GPU budget. SWE-Gym/SWE-smith both validated with small SFT probes (491 and 5,016 trajectories respectively, measurable lift on held-out).
Recommendation: Define the corpus acceptance test as part of the pipeline: SFT a small model (e.g., Qwen3-Coder-7B, LoRA) on each new corpus generation and require a measurable delta on the internal holdout (and decontaminated SWE-bench subset) before the corpus is promoted to status=accepted in its manifest. This is the dataset analogue of the trainer's HeldOutGuard.
21. Orchestration restart semantics and budget envelopes are unspecified. [P2]
F2's Step Functions design has .sync integrations but no statement of stage idempotency (re-running stage (b) after partial Bedrock job failure double-writes replay/?), no run-level cost ceiling (the only budget control in the system is replay_trace's in-process max_total_usd=5.0), and no poison-task quarantine path at the orchestration level (a task that wedges Batch children retries forever).
Recommendation: Make every stage write-once per (run_id, stage, attempt) with a completion marker object; add a budget_usd field to the run manifest enforced by the driver before each paid stage; route repeatedly-failing array indices to quarantine/ after retryStrategy exhaustion.
The MINIMAL pipeline (one pointed-at repo → real SFT corpus), and what the full vision adds in what order
Minimal (Stage 0 — local, no new AWS services, ~1–2 weeks):
pip install swesmith; build the repo profile (env image, test parser) — buy, ~7 min human + automation [Finding 6].- Synthesize candidate tasks: PR Mirror first (best data per SWE-smith Table 5 = the repo's own gold-patch-reversion mechanic), procedural/Combine as volume fallback — buy.
- SPDX license gate at clone + benchmark decontamination check (repo ∉ eval suite) — build, ~80 LOC [Findings 5, 13].
- 4-gate
validate_task()inDockerSandbox(exists) +scrub_tree— have; wire the swesmith container intoSandbox.boot— ~50 LOC. - Expert trajectory collection: mini-swe-agent/SWE-agent + a frontier model over validated tasks,
$capper task — adopt + adapt, ~200–400 LOC [Finding 2]. This is the critical missing component. - Admission filter:
_grade()==1.0+HackMonitorclean +pass_to_passguard — have. - Format to messages-schema SFT rows via
CanonicalTrajectory; MinHash dedup;HeldoutSplitwithcheck_content=True; write Parquet + manifest + dataset card vias3_contract.py— build, ~250 LOC [Findings 7, 11, 12, 18].
Total new code ≈ 600–900 LOC plus two adopted dependencies. Output: a real, decontaminated, license-clean, deduped, carded SFT corpus from one repo — and, as a free byproduct, the env-grounded traces the tree needs (Finding 1).
Then, strictly in this order:
- Stage 1: AWS Batch array validation + Bedrock batch replay when volume justifies (Finding 9); DPO channel on env-grounded traces after the tool-call parser fixes
_normalize_action(Finding 3). - Stage 2: Depth-1 tree (N candidates, one env-step each, oracle-graded) — requires
Sandbox.fork()spike (Finding 4); divergence-gated depth>1 only after the gate's firing rate is measured. - Stage 3: Curriculum CREATE-half via Combine-Bugs (Finding 15); flywheel with cross-generation dedup (Finding 12);
wm_tuplesemission only when the P4 ablation is scheduled (Finding 14). - Stage 4: Step Functions/Argo orchestration once runs are routine (Finding 21).
Severity tally
| Severity | Count | Findings |
|---|---|---|
| P0 | 6 | 1 (trace/oracle disjointness), 2 (no rollout harness), 3 (divergence gate uncomputable), 4 (no sandbox fork), 5 (no decontamination), 6 (buy-vs-build inversion) |
| P1 | 9 | 7 (two S3 contracts), 8 (golden_diff serialization leak), 9 (premature 5-service orchestration), 10 (no secrets/PII gate), 11 (no trajectory IR), 12 (no cross-generation dedup), 13 (license gate too shallow), 14 (wm_tuples unbounded/speculative), 15 (CREATE-half absent) |
| P2 | 6 | 16 (package placement), 17 (node ID contract), 18 (versioning/cards), 19 (bucket co-location), 20 (corpus acceptance metric), 21 (restart/budget semantics) |