Baladithya Balamurugan
Wave 21: deep-read critical review — 8 source clusters re-read, findings verified
2a16b30
|
Raw
History Blame Contribute Delete
30.1 kB
# Design Critic: Architecture of the Envisioned Dataset-Generation Pipeline
**Reviewer:** DESIGN CRITIC (critical-review pipeline, subagent)
**Date:** 2026-06-10
**Inputs:** `research/deepread/00-grounding.md` + `01``08`, `research/design-F1-systems-framing.md`, `research/design-F2-aws-datagen.md`, `docs/adrs/ADR-010`, and direct reads of `composer_replication/datagen/{env,substrates,validator,sandbox}.py`, `teacher_replay.py`, `safety/holdout.py`, `hint_generator.py`.
**Scope:** Attack the ARCHITECTURE of the envisioned dataset pipeline — buy-vs-build, the S3 contract, tree-controller implementability, package placement, the minimal end-to-end pipeline, and components no design mentions.
---
## P0 — Foundational architecture defects (the design as drawn cannot produce the promised artifact)
### 1. The tree-of-work's seed traces and its execution oracle operate on DISJOINT data — a chicken-and-egg no design resolves. **[P0]**
F1's diagram and the final report (§5: *"ingest a seed trace, expand the divergence-gated tree across N models, execute every branch in a sandbox, grade leaves by `_grade()`"*) compose two components that cannot currently meet:
- Seed traces come from `ClaudeCodeIngester` over `~/.claude/projects/**.jsonl` — sessions against the user's **local working directories** at arbitrary commits, with no Docker image, no pinned dependencies, no `FAIL_TO_PASS` labels, no `FeatureDeletionTask`.
- The execution oracle (`FeatureDeletionEnv.step()`/`_grade()`) requires `task.broken_image` (a frozen container) and pre-identified test node IDs (`__post_init__` raises on empty `fail_to_pass`).
You cannot `env.step()` a Claude Code trace action: there is no environment that materializes the repo state the trace was recorded against. The grounding map's Breaks 1–6 document this for "arbitrary OSS repo" but nobody applies it to the trace-replay-tree, where it is fatal: **every branch of the tree needs an executable sandbox, and Claude Code traces have none.** The only traces that are tree-expandable are traces collected *inside* FeatureDeletionEnv episodes — which do not exist because nothing in the repo runs an agent inside the env (see Finding 2).
**Recommendation:** Invert the bootstrap order. Phase 1 of the tree must seed from **env-grounded traces** (agent rollouts on FeatureDeletionTask episodes, where reset state is reproducible by `task_id`), not Claude Code sessions. Demote Claude Code traces to (a) flat Channel-3 replay (DPO text pairs, no execution) and (b) SFT style data — uses that need no oracle. Document this split in a new ADR; the current F1 diagram is misleading and will misdirect the build.
### 2. No agent rollout harness exists — the SFT corpus has no producer. **[P0]**
The "SFT-first competence floor" (F1 §2, final report §5) reads `sft_corpus/` = "clean winning trajectories." Trace the producers: `teacher_replay` emits **single next-actions** per frozen state, not episodes. `env.reward_fn`'s fallback treats the whole completion as one `submit` (08 §5.1 calls this "a dead end for genuine multi-turn"). The tree controller is 0% built. **Nothing in the repo, built or designed, drives a multi-turn agent loop (LLM → tool call → `sandbox.exec` → observation → LLM) to completion and serializes the trajectory.** SWE-smith collected its 5k SFT trajectories with SWE-agent + Claude; SWE-Gym with OpenHands. Every design doc skips this component; F2's four stages (ingest/replay/validate/normalize) produce tasks and DPO pairs but *no SFT trajectories at all* — yet F2's stage (d) claims to write `corpus/sft/`.
**Recommendation:** Add an explicit `rollout harness` component to the pipeline: adopt **mini-swe-agent or SWE-agent** (battle-tested, supports any API model) as the expert-trajectory collector against `FeatureDeletionEnv` tasks, with `_grade()==1.0` + `HackMonitor`-clean as the SFT admission filter. ~200–400 LOC of adapter, not a new agent. This is the single highest-priority build item — without it the minimal pipeline (Finding 16) cannot terminate in an SFT corpus.
### 3. Divergence-gating is not computable from what the current components emit. **[P0]**
The tree's economics depend entirely on the divergence gate (final report §3: it turns O(N^D) into "O(N · decision-points)"). The gate needs "pre-expansion divergence between sibling next-action distributions." But:
- Teacher APIs (OpenRouter, Bedrock batch) return **text**, not distributions. Bedrock batch (`CreateModelInvocationJob`) does not return logprobs usable for a cross-model divergence measure (and cross-model token distributions live in different vocabularies anyway — KL between them is undefined without a common action space).
- The only equality/divergence measure in the codebase is `_normalize_action()` — whitespace-collapse + lowercase. Deepread 07 (FR-R8, HIGH) already established this produces "mostly noise on real traces": semantically identical tool calls in different JSON formatting count as disagreement, so the gate would fire on nearly every node, **silently degrading the tree to the O(N^D) ungated cost the gate exists to prevent.** The cost-control mechanism and the known-broken normalizer are the same component.
**Recommendation:** Define divergence over a **canonical action algebra**, not text: parse every candidate into `(tool_name, normalized_args)` (AST-normalize code args, path-normalize file args), and gate on (i) tool-name disagreement, (ii) arg-level edit distance over the canonical form, with an escalation tier that asks a cheap judge model only when (i)/(ii) is ambiguous. Build and *unit-test the gate's firing rate on 5 real traces* (expected: fires on <20% of steps) **before** writing `tree_controller.py`. The tool-call parser is the prerequisite, not a polish item.
### 4. The tree requires a sandbox fork/snapshot primitive that neither the Sandbox API nor any design has. **[P0]**
`tree_controller.py` (design: "apply each candidate action through `FeatureDeletionEnv.step()` to get a real next observation, branch again from the new state"). Branching N ways from a mid-episode state requires N **independent copies of the mutated working tree**. The `Sandbox` protocol has `boot/exec/run_tests/trajectory` — no `fork()`, no `snapshot()/restore()`. `DockerSandbox` boots one container per episode with `read_only=True` rootfs. The two realizable options are architecturally very different and nobody has chosen:
1. **Replay-from-root:** re-boot the image and re-execute the action prefix for every branch — cost O(depth) sandbox-execs per node, multiplying the already-dominant sandbox cost (final report §10 names per-branch sandbox isolation "the throughput ceiling of the whole idea").
2. **Filesystem snapshot fork:** overlayfs/`docker commit`/CRIU per node — fast but pins the design to one isolation backend and conflicts with gVisor/Kata choices in F2 stage (c).
**Recommendation:** Extend the `Sandbox` protocol with `fork() -> Sandbox` *now*, implement it as overlayfs-upper-dir copy for `LocalSubprocessSandbox` and `docker commit`+boot for `DockerSandbox`, and measure fork latency in a spike before committing to tree depth >1. If fork costs >5s, the honest fallback is depth-1 trees (N candidate actions, each executed one step + graded by a cheap proxy) — which is most of the DPO value at a fraction of the machinery.
### 5. Zero benchmark decontamination anywhere in the pipeline. **[P0]**
The pipeline trains on SWE-bench-family substrates and the program will inevitably report SWE-bench Verified numbers (every comparison point in the research notes — DeepSWE 42.2%, SWE-smith 40.2%, Socratic-SWE 50.4% — is SWE-bench Verified). The substrates have **different and partial** decontamination policies: R2E-Gym decontaminates only its *Subset* against SWE-bench test repos (deepread 02 §6.2.2: "The full R2E-Gym (8,135 tasks) may overlap"); SWE-rebench spans 3,468 repos with no stated guarantee; the deepread 02 review explicitly flags research/06's "no contamination worry" as an OVERCLAIM. `HeldoutSplit` guards only the *internal* train/holdout partition — it has no concept of an external benchmark. No design doc (F1, F2, ADR-010) contains the word "decontamination."
**Recommendation:** Add a mandatory `decontaminate(tasks, benchmark)` gate at stage (c1), enforced at three levels: (i) repo-level — drop any task whose `repo` appears in SWE-bench Lite/Verified/full or the chosen eval suite; (ii) instance-level — drop tasks whose `golden_diff` or `fail_to_pass` node IDs match an eval instance; (iii) record the decontamination manifest (benchmark name + version hash + drop count) in `manifests/run_id.json`. ~50 LOC + a pinned eval-instance index. This must exist before the *first* corpus is generated, because retro-filtering a published corpus is a credibility event.
### 6. Buy-vs-build: the designs build the one component that is buyable, and skip the integration deepread 02 already identified. **[P0]**
ADR-010 chose "Option A: invert OSS substrates" and rejected "Option B: greenfield repo scraping" — correct in 2026-05. But the user's new ask ("point at a repo") **is Option B**, and the design response (grounding map: "Broken-repo image builder — clone, `git apply -R`, scrub, build, push — unspecified, 0% built") is to hand-build exactly what `pip install swesmith` (MIT) ships: environment construction from arbitrary GitHub repos (~7 min human/repo, one image per repo = 500× storage win over per-task images), four bug-synthesis strategies, validation, issue-text generation, and `rp.get_container(task)` returning a booted container. Deepread 02 flagged this at HIGH ("the ADR does not evaluate it as a dependency") and noted SWE-smith's **PR Mirror** strategy — the exact gold-patch-reversion mechanic of `SweBenchAdapter` — produces the *best* training data in SWE-smith's own ablation (Table 5). The repo's core approach is validated by prior art it doesn't cite, and its missing half is shipped by a library it doesn't depend on.
**Recommendation:** Commit a revised ADR: **[BUY]** swesmith for env-construction + bug synthesis on new repos; **[BUY]** SWE-smith 59k / R2E-Gym-Subset 4.6k / SWE-Gym 2.4k datasets through the existing `SweBenchAdapter` (02 confirms it works on SWE-smith instances unchanged); **[BUILD]** only the genuinely novel pieces — `DifficultyCurriculum`, `HackMonitor`/scrub, decontamination, the rollout harness (Finding 2), and (later, ablation-gated) coverage-guided deletion targeting. The "point-at-a-repo" feature becomes a ~100 LOC `SweSmithProfileAdapter` instead of an unspecified image-builder subsystem. Budget reality check from 02: SWE-smith built 50k tasks for **$1,360 + 20 human-hours**; any in-house builder must beat that.
---
## P1 — Significant design errors (buildable, but will mislead or bite)
### 7. There are TWO unreconciled S3 contracts; the "6-prefix contract" is neither complete nor singular. **[P1]**
F1 commits six prefixes (`sft_corpus/ dpo_pairs/ rl_task_pool/ divergence_pairs/ wm_tuples/ holdout/` under `runs/<run_id>/`). F2 commits a different layout (`traces/v1/run_id=<id>/`, `tasks/v1/...`, `replay/v1/...`, `task_grades/v1/...`, `corpus/v1/run_id=<id>/{sft,dpo}/`, `manifests/`). The grounding map (§2 step 8) pastes **both** into one list — so `corpus/v1/.../dpo/` and `dpo_pairs/` both exist for the same artifact, with different partitioning conventions (Hive `run_id=` vs path `runs/<id>/`), different versioning (F2 prefixes carry `v1`, F1 prefixes carry none), and two different buckets named across the docs (`amazon-sagemaker-...` in F1, `composer-datagen-...` in F2). Additionally: `divergence_pairs/` and `dpo_pairs/` describe one lineage (divergence-annotated nodes → extracted pairs) split across two prefixes, inviting drift; there is no prefix for quarantined/retired tasks (the curriculum produces them) nor for eval results.
**Recommendation:** Write `s3_contract.py` FIRST and make both design docs subordinate to it. One layout: `s3://<bucket>/<contract_version>/run_id=<id>/<artifact>/` where artifact ∈ {traces, tasks, tasks_golden, replay, grades, corpus_sft, corpus_dpo, wm_tuples, holdout, quarantine, manifest.json} — every artifact versioned by the single top-level `contract_version`, every prefix owned by exactly one writer stage, `divergence_pairs` folded into `corpus_dpo` as a `provenance` column rather than a sibling prefix. Until this file exists, no AWS stage should be built (they would each encode one of the two divergent layouts).
### 8. `golden_diff` leaks into the policy-visible manifest: `repr=False` is not serialization-exclusion. **[P1]**
F2 (open questions) correctly demands `golden_diff` live in a deny-by-default `tasks/golden/` prefix. But stage (c1) writes "FeatureDeletionTask rows" to `tasks/v1/run_id=<id>/manifest.jsonl`, and any naive serializer (`dataclasses.asdict`, `json.dumps(vars(task))`) **includes `golden_diff`** — `field(repr=False)` only affects `__repr__`. The Batch validator children legitimately need the gold diff (Gate 4), but `rl_task_pool/` (read by the training env, whose prompt renderer carefully hides golden) would carry it in plaintext one `json.loads` away from any reward-hacking trajectory that reads its own task manifest. The safeguard exists in the prompt renderer and nowhere in the storage contract.
**Recommendation:** In `s3_contract.py`, define two explicit serializers — `to_policy_row(task)` (drops `golden_diff` AND `deleted_symbols`) and `to_validator_row(task)` (full) — and make the policy-row writer the *only* code path that can populate `rl_task_pool/`. Add a unit test asserting `"golden_diff" not in json` for policy rows. Enforce the prefix split with bucket policy, not convention.
### 9. The F2 architecture is a five-service orchestration for a pipeline that has never run once locally. **[P1]**
F2 commits Glue 5.0 + Bedrock batch + EMR Serverless + AWS Batch + Step Functions + Lambda + CDK (~250 LOC IaC) before a single task has been validated end-to-end on a laptop (ADR-010's own post-review: the gates passed "against FakeSandbox materializers"; the Docker e2e is still `[~]`). Every stage's per-service rationale in F2 is individually sound, but the composition is premature: the corpus that matters first is O(10²–10³) tasks (SWE-smith's full 50k cost 20 human-hours and one machine), which is a **single-node workload**. The Step Functions DAG also has no idempotency/restart semantics, no run-level budget envelope (only `teacher_replay`'s in-process `max_total_usd`), and a 24h Bedrock-batch SLA in the middle of what should be a same-day iteration loop during development.
**Recommendation:** Build the pipeline as **stage functions with a local driver first** (`python -m composer_replication.pipeline.run --stage all --tasks 200`), with S3 used only as a dumb artifact store via the `s3_contract.py` writers. Promote individual stages to managed services only when a measured bottleneck demands it, in this order: (1) AWS Batch for sandbox validation (the only genuinely parallel-heavy stage), (2) Bedrock batch when replay volume × cost crosses the 50%-discount break-even, (3) Step Functions only when >1 unattended run/week. Glue and EMR Serverless are likely never needed at this corpus scale — F2's own "live caveat" already concedes the ingester "is not intrinsically Spark-shaped."
### 10. No secrets/PII gate at trace ingest — raw Claude Code sessions go to S3 verbatim. **[P1]**
F2 stage (a) uploads `~/.claude/projects/**.jsonl` to `raw/claude_code/` and Parquet-izes them. Claude Code session files contain the user's local file contents, env-var echoes, API keys in tool outputs, internal hostnames, and proprietary code from *whatever repos the user worked on* — none of which passed any license, secrets, or PII filter. The copyleft filter (`is_redistributable`) applies only to SWE-substrate tasks, not to traces. The flywheel then trains on this and (per the publications/ directory) the corpus is intended to be shareable.
**Recommendation:** Insert a mandatory scrub stage between ingest and storage: secrets scanning (gitleaks/trufflehog rule pack over message contents), path anonymization, and a per-session allowlist (only sessions from designated repos enter the corpus). Record scrub stats in the run manifest. This is ~1 day of work and belongs in `ClaudeCodeIngester` itself so no unscrubbbed `TraceState` can exist downstream.
### 11. No canonical trajectory IR — three trace shapes are about to become five. **[P1]**
Today: Claude Code JSONL → `TraceState` (messages + `student_action` as serialized block-list). Planned: Bedrock `.jsonl.out` rows, tree-controller branch trajectories, rollout-harness episodes (Finding 2), OpenHands/SWE-smith trajectories (ADR-002 v0.2). Each design names its own shape; `_normalize_action`'s whitespace hack is the symptom of the missing abstraction. Without one normalized trajectory schema, every pairwise consumer (DPO extractor, SFT formatter, wm_tuple writer, replay submitter) needs format-specific code, and the "trace format normalization" cost grows quadratically.
**Recommendation:** Define a `CanonicalTrajectory` schema (list of `Turn{role, content, tool_calls: [(name, canonical_args)], tool_results, error_kind}`) in `datagen/schema.py` as the single internal currency; every ingester is a `X -> CanonicalTrajectory` adapter, and `extract_dpo_pairs`/SFT formatting/wm_tuples consume only it. This also gives the divergence gate (Finding 3) its action algebra for free.
### 12. The flywheel has no cross-generation dedup — it will feed itself duplicates. **[P1]**
The flywheel (F1: "improved student generates the next round's seed traces") loops the corpus into itself. Dedup today: data-juicer `document_deduplicator` is per-batch only; F2 adds Spark `dropDuplicates` *within one run's* normalize stage. Nothing dedups **across `run_id`s**, so generation N+1's SFT corpus will contain near-copies of generation N's winning trajectories (same tasks, similar solutions), compounding each cycle — a known self-training collapse accelerant. The held-out guard detects collapse after the fact; dedup prevents its cheapest cause.
**Recommendation:** Add MinHash/LSH near-dup dedup keyed on `(task_id, canonical_action_sequence)` across all prior run manifests at corpus-write time, plus a per-task cap on retained trajectories (e.g., ≤K winners per task across all generations). Store the dedup index alongside `manifests/`.
### 13. License handling is field-deep, not repo-deep — and absent for the "point-at-a-repo" path. **[P1]**
`is_redistributable()` lowercases `instance["license_name"]` and substring-matches `("gpl","agpl","lgpl")`. For arbitrary repos there is no `license_name` (grounding Break 5); for SWE-smith instances the license lives in the toolkit's repo profiles, not the instance dict (02 §1: 2 GPLv3 + 4 LGPL repos in SWE-smith would need mapping); the substring check also misclassifies (e.g., "GPL-2.0-with-classpath-exception" semantics, dual-licensed repos) and ignores the difference between *using* a repo for training and *redistributing* derivative diffs (02 notes SWE-smith only claims the former). Repo-ingest is exactly where this must run, and no design places it there.
**Recommendation:** At repo ingest, run SPDX detection (`licensee`/`askalono`) on the cloned tree, store the SPDX id + detection confidence on the task, and split policy into two explicit gates: `trainable(license)` (permissive + most copyleft OK for internal training) and `redistributable(license)` (permissive only) — applied at corpus-*publish* time, not generation time, so copyleft repos still contribute non-redistributed training signal. Keep the existing function as the redistribution gate, fix it to exact-SPDX matching.
### 14. `wm_tuples/` ("ALL branches incl. failures") is an unbounded write path serving an ablation-gated consumer. **[P1]**
The world-model head — the sole consumer of `wm_tuples/` — has, per deepread 06, **zero direct evidence** for its configuration ("no published paper has tested an auxiliary next-state-prediction objective during RL on a code policy") and is explicitly an ablation arm, not a premise. Yet the S3 contract gives it the highest-volume prefix in the system (every `env.step()` observation of every branch, including all failures — observations are full test logs/file contents), with no size estimate, no sampling policy, no retention/lifecycle rule in any design. The pipeline's storage architecture is load-bearing on its most speculative research bet.
**Recommendation:** Make wm_tuple emission opt-in per run (`emit_wm_tuples: bool` in the run config, default off until the P4 ablation is scheduled), store observations as content-addressed blobs with dedup (test logs repeat massively), and attach an S3 lifecycle rule (expire after N days unless pinned by an ablation manifest). Do not let the typed-routing elegance ("failed branch is gold for the world model") force eager collection before the consumer exists.
### 15. The "create harder tasks dynamically" half of the curriculum is absent from every pipeline design — yet it is the blog's actual mechanism. **[P1]**
Deepread 01 confirms the verbatim Composer claim: "we both select for **and create** harder tasks dynamically throughout the run." The repo has SELECT-FOR (`DifficultyCurriculum`); the CREATE half (escalate deletion granularity function→file→feature, combine bugs, multi-feature targets minted *during* the run) appears in no design — `granularity` is hardcoded `"feature"`, and F1/F2's outer loop has no stage that takes curriculum state as *input* to task synthesis. Meanwhile SWE-smith's **Combine-Bugs** strategy (96.9% yield, zero cost, 15 median F2P — 02 §1) is precisely a CREATE-half mechanism sitting in the buyable toolkit.
**Recommendation:** Add a `task_escalation` stage to the outer loop contract now (input: curriculum pass-rates per task; output: new combined/escalated task candidates into the validation queue), implemented first as SWE-smith Combine-Bugs over already-validated per-repo bugs. Even if deferred, the *stage boundary* must exist in the orchestration design, or the pipeline hard-codes a select-only curriculum and the eventual retrofit will break the Step Functions/driver topology.
---
## P2 — Real gaps, lower blast radius
### 16. Where the pipeline should live: extend the monorepo with isolated extras; do NOT split a package. **[P2]**
F1/F2 propose `composer_replication/pipeline/`, `composer_replication/datagen/aws/`, and root `infra/`. This is broadly right; a separate datagen package/repo would be wrong now: the shared dataclasses (`FeatureDeletionTask`, `TraceState`, `DPOPair`, future `CanonicalTrajectory`) ARE the contract, and splitting them across repos forces schema version-pinning with zero external consumers. The risks to manage are dependency bleed (boto3/pyspark/docker must not enter the training image) and entrypoint importability.
**Recommendation:** (a) `composer_replication/datagen/` stays the pure library (schema, env, validator, monitor, curriculum — no cloud deps); (b) new `composer_replication/pipeline/` holds stage drivers + `s3_contract.py`, all cloud/Spark imports lazy and gated behind a `[pipeline]` extra; (c) `infra/` at repo root for IaC only; (d) CI check that `import composer_replication.datagen` succeeds with no extras installed. Revisit a package split only when a second project actually consumes the datagen library.
### 17. No state/node ID contract for tree-generated states. **[P2]**
`state_id = f"{path.stem}::{idx:04d}"` is unique only within one session file; tree-controller branches, rollout-harness episodes, and Bedrock `recordId` joins all need globally unique, lineage-encoding IDs (parent pointer, branch index, run_id), and `recordId==state_id` is named "the universal join key" without ever specifying the tree extension.
**Recommendation:** Commit `node_id = {run_id}/{trace_id}/{path-from-root as branch indices}` in `s3_contract.py`; parent derivable by truncation, collision-free by construction.
### 18. No dataset versioning, cards, or reproducibility pins. **[P2]**
`manifests/run_id.json` carries counts/cost/lineage but no: substrate dataset revision hashes (HF datasets mutate — SWE-smith grew from 50,137 to 59,136 rows post-publication, per 02), swesmith/toolchain versions, Docker image digests (tags like `:latest` in `image_for()` are mutable!), prompt-template hashes, or a generated dataset card (composition, license mix, decontamination statement, known limitations). Irreproducible corpora are unpublishable and undebuggable.
**Recommendation:** Pin image **digests** not tags in `FeatureDeletionTask.broken_image`; record substrate `(dataset_id, revision)` and generator git SHA in the manifest; auto-generate a dataset card per run from the manifest (the HF `datasets` card template is fine). ~100 LOC total.
### 19. DiLoCo rendezvous traffic co-located in the dataset bucket. **[P2]**
Both F1 and F2 put `diloco/rendezvous/` (hot, high-frequency, delete-heavy training sync) in the same bucket as the immutable corpus. This entangles IAM (training nodes get write access to a bucket containing `tasks/golden/`), lifecycle policies, and cost attribution.
**Recommendation:** Separate bucket (or at minimum a separate top-level prefix with its own bucket policy statement and aggressive expiry), keeping the dataset bucket append-only for everything except `quarantine/`.
### 20. No corpus-quality acceptance metric — the pipeline has no definition of "done/good." **[P2]**
Every stage has pass/fail gates for *tasks*, but nothing measures whether the resulting *corpus* is any good before it consumes GPU budget. SWE-Gym/SWE-smith both validated with small SFT probes (491 and 5,016 trajectories respectively, measurable lift on held-out).
**Recommendation:** Define the corpus acceptance test as part of the pipeline: SFT a small model (e.g., Qwen3-Coder-7B, LoRA) on each new corpus generation and require a measurable delta on the internal holdout (and decontaminated SWE-bench subset) before the corpus is promoted to `status=accepted` in its manifest. This is the dataset analogue of the trainer's `HeldOutGuard`.
### 21. Orchestration restart semantics and budget envelopes are unspecified. **[P2]**
F2's Step Functions design has `.sync` integrations but no statement of stage idempotency (re-running stage (b) after partial Bedrock job failure double-writes `replay/`?), no run-level cost ceiling (the only budget control in the system is `replay_trace`'s in-process `max_total_usd=5.0`), and no poison-task quarantine path at the orchestration level (a task that wedges Batch children retries forever).
**Recommendation:** Make every stage write-once per `(run_id, stage, attempt)` with a completion marker object; add a `budget_usd` field to the run manifest enforced by the driver before each paid stage; route repeatedly-failing array indices to `quarantine/` after `retryStrategy` exhaustion.
---
## The MINIMAL pipeline (one pointed-at repo → real SFT corpus), and what the full vision adds in what order
**Minimal (Stage 0 — local, no new AWS services, ~1–2 weeks):**
1. `pip install swesmith`; build the repo profile (env image, test parser) — *buy*, ~7 min human + automation [Finding 6].
2. Synthesize candidate tasks: PR Mirror first (best data per SWE-smith Table 5 = the repo's own gold-patch-reversion mechanic), procedural/Combine as volume fallback — *buy*.
3. SPDX license gate at clone + benchmark decontamination check (repo ∉ eval suite) — *build*, ~80 LOC [Findings 5, 13].
4. 4-gate `validate_task()` in `DockerSandbox` (exists) + `scrub_tree`*have*; wire the swesmith container into `Sandbox.boot` — ~50 LOC.
5. Expert trajectory collection: mini-swe-agent/SWE-agent + a frontier model over validated tasks, `$cap` per task — *adopt + adapt*, ~200–400 LOC [Finding 2]. **This is the critical missing component.**
6. Admission filter: `_grade()==1.0` + `HackMonitor` clean + `pass_to_pass` guard — *have*.
7. Format to messages-schema SFT rows via `CanonicalTrajectory`; MinHash dedup; `HeldoutSplit` with `check_content=True`; write Parquet + manifest + dataset card via `s3_contract.py`*build*, ~250 LOC [Findings 7, 11, 12, 18].
Total new code ≈ 600–900 LOC plus two adopted dependencies. Output: a real, decontaminated, license-clean, deduped, carded SFT corpus from one repo — and, as a free byproduct, the env-grounded traces the tree needs (Finding 1).
**Then, strictly in this order:**
- **Stage 1:** AWS Batch array validation + Bedrock batch replay when volume justifies (Finding 9); DPO channel on env-grounded traces after the tool-call parser fixes `_normalize_action` (Finding 3).
- **Stage 2:** Depth-1 tree (N candidates, one env-step each, oracle-graded) — requires `Sandbox.fork()` spike (Finding 4); divergence-gated depth>1 only after the gate's firing rate is measured.
- **Stage 3:** Curriculum CREATE-half via Combine-Bugs (Finding 15); flywheel with cross-generation dedup (Finding 12); `wm_tuples` emission only when the P4 ablation is scheduled (Finding 14).
- **Stage 4:** Step Functions/Argo orchestration once runs are routine (Finding 21).
---
## Severity tally
| Severity | Count | Findings |
|---|---|---|
| **P0** | 6 | 1 (trace/oracle disjointness), 2 (no rollout harness), 3 (divergence gate uncomputable), 4 (no sandbox fork), 5 (no decontamination), 6 (buy-vs-build inversion) |
| **P1** | 9 | 7 (two S3 contracts), 8 (golden_diff serialization leak), 9 (premature 5-service orchestration), 10 (no secrets/PII gate), 11 (no trajectory IR), 12 (no cross-generation dedup), 13 (license gate too shallow), 14 (wm_tuples unbounded/speculative), 15 (CREATE-half absent) |
| **P2** | 6 | 16 (package placement), 17 (node ID contract), 18 (versioning/cards), 19 (bucket co-location), 20 (corpus acceptance metric), 21 (restart/budget semantics) |