Wave 20: 5-facet AWS-native architecture design (F1-F5)

Answers the dataset-gen+SFT-vs-RL-vs-BOTH question with 5 grounded
facet designs from the aws-native-architecture-design workflow:

- F1 systems-framing: BOTH = two loops at two timescales, not two phases;
component->loop mapping table; SFT-first ordering; S3 dataset contract
(6 typed prefixes); repo delta (tree_controller, s3_layout, sft_floor).
- F2 aws-datagen: per-stage native verdict (Glue 5.0 Spark ingest ->
Bedrock batch replay -> AWS Batch array on Spot test-exec -> EMR
Serverless normalize), Step Functions orchestration.
- F3 rl-sagemaker: runnable-now g5.2xlarge GSM8K GRPO smoke (<$1),
colocated vLLM, BYO container from PyTorch DLC; live-verified quota.
- F4 decoupled-diloco-s3: N single-instance jobs, ObjectStoreAllReduce
PUT-poll-mean over S3, strong-consistency correctness, IAM gotcha,
cheapest validating-run ladder.
- F5 fidelity-audit: rubric vs Composer 2.5/Composer-2 + load-bearing
papers; 5 FULLY-REPLICATED, k3-vs-k1 KL documented infidelity,
missing behavior-shaping recipe, tree+world-model 100% design;
prioritized build order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (5) hide show

research/design-F1-systems-framing.md +229 -0
research/design-F2-aws-datagen.md +191 -0
research/design-F3-rl-sagemaker.md +299 -0
research/design-F4-decoupled-diloco-s3.md +143 -0
research/design-F5-fidelity-audit.md +106 -0

research/design-F1-systems-framing.md ADDED Viewed

	@@ -0,0 +1,229 @@

+# F1 — Systems Framing: Dataset-Gen+SFT, RL, or BOTH?
+**Facet question:** Is this a dataset-generation+SFT system, an RL system, or BOTH?
+**Committed answer: BOTH — and decisively so — structured as TWO LOOPS at two
+timescales, not two phases.** The repo already physically contains both halves.
+The OUTER (slow) loop is dataset/curriculum *construction*; the INNER (fast)
+loop is RL. They feed each other continuously: the inner loop's improved student
+generates the outer loop's next seed traces, and the inner loop's learned
+deliberation-confidence becomes the outer loop's branch gate. This is the report
+§5 verdict ("two loops at different timescales, not two phases") made concrete
+against the code.
+This is not an architectural opinion — it is forced by what the modules *are*:
+| Repo component | What it is, mechanically | Loop |
+|---|---|---|
+| `ingestion/claude_code.py` | Claude Code JSONL → `TraceState` (one node per assistant turn; `tool_error` flag; `strip_thinking=False`) — produces seed traces | **OUTER** (dataset) |
+| `teacher_replay.py` | N-teacher OpenRouter replay + `extract_dpo_pairs` — OFFLINE dataset gen, flat depth-1 stars hung off a frozen trace | **OUTER** (dataset) |
+| `datagen/substrates.py` (`SweBenchAdapter`) | SWE-bench tuple → `FeatureDeletionTask` (revert gold patch = synthesize broken repo) — task SYNTHESIS | **OUTER** (curriculum) |
+| `datagen/env.py` `FeatureDeletionEnv.step()/_grade()` | execution oracle: runs action in sandbox, returns observation; `_grade()` = masked `FAIL_TO_PASS` pass-fraction — this is the **RL env reward kernel** | **OUTER** (fitness) AND consumed by **INNER** (`reward_fn`) |
+| `datagen/curriculum.py` `DifficultyCurriculum` | p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02 — selection/sampling for the next round | **OUTER** (selection) |
+| The multi-model MCTS tree controller (to build) | recursion: apply each model's action via `env.step()`, branch again — depth-1 stars → tree | **OUTER** (the core delta) |
+| `trainer/composer_trainer.py` `ComposerReplicationTrainer` | a real `trl.GRPOTrainer` subclass; `total = grpo + α·sdpo + β·trace_replay_dpo` — **this is RL** | **INNER** (fast RL) |
+| `datagen/env.py` `FeatureDeletionEnv.reward_fn` | TRL `RewardFunc` adapter (`prompts, completions → list[float]`) — the env wearing its RL face | **INNER** (RL reward) |
+| `loss.py` `compose_loss` | TRL-free 3-channel harness; Channel 1 = LM cross-entropy on response tokens — **this is the SFT-able face** (BC limit GRPO converges to) | **INNER**, also the **SFT-first floor** |
+| `safety/holdout.py` + `safety/kill_switch.py` (`HeldOutGuard`/`HeldoutSplit`), wired into the trainer | run-level collapse/reward-hacking tripwire on proxy-minus-realeval gap — an **RL-run safeguard** | **INNER** (safety) |
+So: `FeatureDeletionEnv.reward_fn` is an RL env. `teacher_replay` is offline
+dataset gen. `ComposerReplicationTrainer(trl.GRPOTrainer)` is RL.
+`compose_loss`'s `lm_ce` channel is an SFT harness. `HeldOutGuard` is an RL-run
+safeguard. The system *is* both, by construction.
+---
+## The SFT-first competence floor, THEN RL (mirrors Cursor's CPT+SFT→RL)
+`docs/COMPOSER_RECIPE_MAPPING.md` confirms Cursor's ordering is **Continued
+Pretraining → SFT → RL**, and that the repo deliberately *skips* CPT (starts
+from an already-code-tuned base, e.g. `Qwen3-Coder-7B` / `Qwen3-Coder-30B-A3B`).
+That leaves a two-step ordering this facet commits to:
+1. **SFT-first (competence floor).** Take the OUTER loop's *clean winning
+   trajectories* (oracle-clean `_grade()` passes only — Gate 1) and run standard
+   SFT. The carrier already exists: `compose_loss` with `alpha_sdpo=0,
+   beta_replay=0` reduces to `_lm_response_ce` — next-token cross-entropy masked
+   to assistant-response tokens. This is the "GRPO converges to BC under
+   deterministic rewards" limit stated in `loss.py`. It establishes a floor so
+   GRPO has a non-degenerate starting policy. (For a clean separation, an SFT
+   `Trainer` over the same corpus is equivalent; the point is the *corpus* —
+   winning leaves — and the *masking* — response tokens only.)
+2. **THEN RL.** `ComposerReplicationTrainer` runs GRPO/Dr.GRPO (`make_po_config`)
+   on the `FeatureDeletionEnv.reward_fn` execution oracle, with the optional
+   SDPO (α) and trace-replay-DPO (β) channels, plus the contested world-model
+   next-state head as a **second SDPO mode** (parameter-isolated; report §2/§4).
+This is exactly the report §5 line: "SFT-first establishes a competence floor on
+clean winning trajectories before RL — mirroring Cursor's CPT+SFT→RL ordering and
+the repo's own outer (datagen/teacher_replay) / inner
+(`ComposerReplicationTrainer`) split."
+---
+## The single diagram-able data flow
+```
+                    ┌──────────────────────────────────────────────────────────────┐
+                    │  OUTER LOOP (slow: hours→days; bursty, Spot-friendly, EKS)     │
+                    │                                                                │
+ raw base model ───►│  ingestion/claude_code.py   (seed traces, TraceState)          │
+ + Claude traces    │            │                                                   │
+                    │            ▼                                                   │
+                    │  multi-model MCTS tree controller  ──► N models branch          │
+                    │  (divergence-gated; teacher_replay generalized flat→tree)       │
+                    │            │                                                   │
+                    │            ▼  apply each action                                 │
+                    │  FeatureDeletionEnv.step()  ─► sandbox exec ─► new state         │
+                    │            │                                                   │
+                    │            ▼  leaf grade                                        │
+                    │  FeatureDeletionEnv._grade()  (masked FAIL_TO_PASS pass-frac)    │
+                    │            │  + DifficultyCurriculum.update (p(1-p) selection)   │
+                    │            ▼  HARVEST + TYPE the divergence                      │
+                    └────────────┼───────────────────────────────────────────────────┘
+                                 ▼
+        ┌────────────────────  S3  ────────────────────────────────────────────────┐
+        │  s3://<bucket>/runs/<run_id>/                                              │
+        │    sft_corpus/          ← clean WINNING trajectories  (SFT-first)          │
+        │    dpo_pairs/           ← near-miss (chosen=sibling winner, rejected=loser)│
+        │    rl_task_pool/        ← FeatureDeletionTask registry + curriculum priors │
+        │    wm_tuples/           ← (state, action, next_state, outcome) — ALL branches│
+        │    divergence_pairs/    ← divergence-annotated nodes (where siblings forked)│
+        │    holdout/             ← DISJOINT held-out eval anchor (never fed back)    │
+        │    diloco_rendezvous/   ← round_{NNNNNN}/rank_{RRRR}.pt (ObjectStoreAllReduce)│
+        └────────────┬──────────────────────────────────────────────────────────────┘
+                     ▼
+        ┌─────────  SFT JOB  ───────────────┐   (compose_loss lm_ce / SFT Trainer)
+        │  sft_corpus → competence-floor ckpt│   ──► seeds the RL init policy
+        └────────────┬──────────────────────┘
+                     ▼
+                    ┌──────────────────────────────────────────────────────────────┐
+                    │  INNER LOOP (fast: minutes→steps; resilience-bound, GPU)        │
+                    │  ComposerReplicationTrainer (trl.GRPOTrainer subclass)          │
+                    │   total = grpo(reward_fn on _grade)                              │
+                    │         + α·sdpo (hint-distill, JSD on aligned post-hint tokens) │
+                    │         + β·trace_replay_dpo (dpo_pairs)                          │
+                    │         + [world-model next-state head — 2nd SDPO mode, gated]   │
+                    │  HeldOutGuard tripwire on proxy−realeval gap (kill-switch)        │
+                    │  DiLoCo outer-sync every ~500-1000 steps via S3 diloco_rendezvous │
+                    └────────────┬──────────────────────────────────────────────────┘
+                                 ▼
+                        improved student model
+                                 │
+        ┌────────────────────────┴────────────────────────────────────────────────┐
+        │  FEEDBACK (why loops, not phases):                                         │
+        │   1. improved student generates the next round's seed traces (back to OUTER)│
+        │   2. its learned deliberation-confidence becomes the next round's branch    │
+        │      gate (the §3 bootstrap: cross-model disagreement early → learned        │
+        │      deliberation-confidence later — same lever, two levels)                 │
+        └────────────────────────────────────────────────────────────────────────────┘
+```
+---
+## What goes in S3 between the loops (crisp)
+The boundary between OUTER and INNER is S3 — and on AWS S3 *is* the DiLoCo
+rendezvous backend with zero new code (the `ObjectStoreAllReduce` fsspec path,
+`round_{NNNNNN}/rank_{RRRR}.pt`). The same bucket carries the dataset hand-off.
+Live target bucket in this account/region:
+`s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/` (us-west-2;
+a `sagemaker-dynamo-on-eks-hyperpod-*` bucket also already exists, evidence
+HyperPod-on-EKS is provisioned here).
+| S3 prefix | Contents | Producer (OUTER) | Consumer (INNER) |
+|---|---|---|---|
+| `sft_corpus/` | **SFT corpus** — clean winning trajectories (oracle-clean `_grade()` passes), masked to assistant-response tokens; JSONL | tree controller + `_grade()` Gate 1 | SFT job → `compose_loss` `lm_ce` (or SFT `Trainer`) |
+| `dpo_pairs/` | **DPO pairs** — `{chosen=sibling-winner-or-teacher-consensus, rejected=student/losing-branch, state_messages}` (the `DPOPair` schema from `teacher_replay.py`) | `extract_dpo_pairs` generalized to sibling winners | Channel 3 (`β·trace_replay_dpo`) |
+| `rl_task_pool/` | **RL task pool** — `FeatureDeletionTask` registry (`repo, broken_image, fail_to_pass, pass_to_pass, deleted_symbols, test_command`) + `DifficultyCurriculum` priors | `SweBenchAdapter` + curriculum | `FeatureDeletionEnv` registry → `reward_fn` |
+| `divergence_pairs/` | **Divergence-annotated pairs** — nodes where sibling next-action distributions disagreed (the SDPO privileged-info conditioning variable + which sibling subtree separated first) | tree controller (records pre-expansion divergence) | Channel 2 SDPO `ctx_teacher` splice / `SiblingBootstrapGenerator` |
+| `wm_tuples/` | **World-model next-state tuples** — `(state, action, next_state, outcome)` from **ALL** branches incl. failures (CWM "train-on-all" target; the safe home for failed-branch signal) | every `env.step()` observation, every leaf grade | world-model next-state head (2nd SDPO mode), gated |
+| `holdout/` | **Disjoint held-out eval anchor** — `HeldoutSplit` partition; NEVER fed back to the generator | `HeldoutSplit.split(seed=…)` once | `HeldOutGuard.heldout_eval_fn()` |
+| `diloco_rendezvous/` | DiLoCo pseudo-gradient exchange, `round_{NNNNNN}/rank_{RRRR}.pt` | inner replicas | inner replicas (`ObjectStoreAllReduce`) |
+Note the keystone (report §4/§2): a *failed* branch is poison for the policy
+gradient but **gold for the world model** — so a loser goes to `wm_tuples/`
+(safe, no policy penalty) and optionally to `dpo_pairs/` as a contrastive
+`rejected` against a sibling winner (never as raw negative gradient), but never
+to `sft_corpus/`. That is "type the signal and route it" realized as S3 prefixes.
+---
+## AWS-native realization (concrete, this account)
+- **OUTER loop on EKS** (single control plane): Argo Workflows DAG, one node =
+  one divergence-gated branch; vLLM RayService pods for open-weight model
+  families + API-egress pods for hosted teachers; gVisor (`runsc` RuntimeClass)
+  sandbox pods running `FeatureDeletionEnv._grade()`. Writes all six dataset
+  prefixes to S3 via IRSA.
+- **SFT job** between loops: a SageMaker Training Job (`File` mode for the
+  `sft_corpus/` < 100 GB; `FastFile` if it grows past ~50–100 GB) OR an EKS GPU
+  pod; reads `sft_corpus/`, writes a competence-floor checkpoint to
+  `s3://…/runs/<id>/ckpt_sft/`.
+- **INNER loop**: `ComposerReplicationTrainer` on a Karpenter p5/g6e NodePool,
+  swappable to a HyperPod-attached node-group (1:1 EKS↔HyperPod mapping, already
+  provisioned per the `sagemaker-dynamo-on-eks-hyperpod-*` bucket) for
+  resilience-bound long runs. DiLoCo replicas rendezvous only through S3 — no
+  cross-job NCCL — so a straggler blocks at the poll loop (`timeout_s=1800`)
+  instead of deadlocking a gang. `SageMakerExecutor` is the bursty fallback
+  (N single-instance Training Jobs, `EnableNetworkIsolation=False` so the
+  container can S3 PUT/GET the rendezvous).
+The DiLoCo math, `MockManager`, `make_diloco_outer_loop`, the trainer, the loss,
+and the env are **untouched** across all of this — the `ServerlessExecutor`
+Protocol + `ObjectStoreAllReduce` are the entire portability contract.
+---
+## Repo delta to realize the framing
+1. `composer_replication/datagen/tree_controller.py` (NEW, ~250-350 LOC) — the
+   recursion: apply each model's candidate action via `FeatureDeletionEnv.step()`,
+   branch again, grade leaves, emit the six S3 prefixes. The core OUTER delta.
+2. `composer_replication/pipeline/s3_layout.py` (NEW, ~80 LOC) — typed writers
+   for `sft_corpus/ | dpo_pairs/ | rl_task_pool/ | divergence_pairs/ | wm_tuples/
+   | holdout/`; the OUTER→INNER contract in one place.
+3. `composer_replication/pipeline/sft_floor.py` (NEW, ~60 LOC) — SFT-first
+   driver: read `sft_corpus/`, run `compose_loss` (`alpha_sdpo=0, beta_replay=0`)
+   or an SFT `Trainer`, write `ckpt_sft/`. Wraps existing `_lm_response_ce`.
+4. `composer_replication/trainer/composer_trainer.py` (EDIT, ~40 LOC) — add the
+   world-model next-state head as a 2nd SDPO mode (parameter-isolated adapter +
+   `<deliberate>` token), reading `wm_tuples/`; gated OFF by default.
+5. `composer_replication/diloco/serverless/eks.py` (EDIT/FINISH) — `EKSExecutor`
+   Indexed Job mapping for the inner loop (a few hundred LOC; sibling of
+   `ModalSpawnExecutor`).
+6. `[serverless]` extra: add `s3fs`/`boto3`/`kubernetes` (the documented dep gap).
+---
+## Open questions
+- Where exactly does SFT-first run — a dedicated SageMaker Training Job, or an
+  EKS pod sharing the inner NodePool? (Cost/latency tradeoff; corpus size gates
+  File vs FastFile.)
+- Should the SFT-floor checkpoint be re-derived each outer generation
+  (full SFT→RL re-warm) or only once at gen-0 (RL-only thereafter)? The flywheel
+  feedback suggests gen-0-only, with RL carrying subsequent gains.
+- Is the world-model head trained inside the inner GRPO trainer (shared step) or
+  as a separate offline pass over `wm_tuples/`? Report §2 favors
+  parameter-isolation; a separate pass is the strongest isolation.
+## Citations
+- research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md §5 (two
+  loops), §2 (world-model head as 2nd SDPO mode), §4 (type-the-signal routing),
+  §6 (reuse/build table)
+- composer_replication/datagen/env.py:63-94 (`step`/`_grade`/`reward_fn`)
+- composer_replication/teacher_replay.py:162-262 (`replay_trace`,
+  `extract_dpo_pairs`, `DPOPair`)
+- composer_replication/trainer/composer_trainer.py:54-178 (3-channel
+  `_compute_loss`), :184-251 (`HeldOutGuard` wiring)
+- composer_replication/loss.py:71-261, :277-304 (`compose_loss`, `_lm_response_ce`)
+- composer_replication/safety/holdout.py (`HeldoutSplit` disjointness)
+- composer_replication/diloco/serverless/{executor.py,allreduce.py,sagemaker.py}
+  (Protocol + `ObjectStoreAllReduce` + S3 rendezvous)
+- composer_replication/ingestion/claude_code.py:6-21 (one-node-per-turn,
+  `strip_thinking`)
+- docs/COMPOSER_RECIPE_MAPPING.md:48-137 (CPT+SFT→RL ordering; repo skips CPT)
+- Live: `aws s3 ls` → amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d,
+  sagemaker-dynamo-on-eks-hyperpod-* (us-west-2, acct 386931836011)

research/design-F2-aws-datagen.md ADDED Viewed

	@@ -0,0 +1,191 @@

+# Facet F2 — AWS-Native Dataset Generation (the outer loop)
+**Account 386931836011 · us-west-2 · Admin · 2026-06-09**
+The outer (slow) loop of §5 — ingest → multi-teacher replay → invert SWE substrates into FeatureDeletion tasks → normalize to SFT/DPO corpus on S3 — is **embarrassingly parallel, bursty, fault-tolerant** (report §5, §8). The wrong move is to pick one service for the whole loop. Each of the four stages has a different compute shape, so each gets the right AWS-native primitive, all stitched by **Step Functions** (the AWS-native analog of the report's Argo controller for the *non-cluster* path) and all reading/writing one S3 dataset contract.
+This note commits to a concrete service per stage and maps every choice to the repo file it backs.
+---
+## TL;DR — the per-stage verdict
+| Outer-loop stage | Repo code | Compute shape | **AWS service (committed)** | Why not the others |
+|---|---|---|---|---|
+| (a) Ingest + clean Claude Code traces | `ingestion/claude_code.py` | Light CPU, JSONL → TraceState, IO-bound | **Glue Spark ETL job** (Glue 5.0) | Trivial scale; one Spark job over a JSONL prefix; Glue's Data Catalog + bookmarks are free here. SM Processing also fine but Glue is the cheaper IO-bound default. |
+| (b) Replay each state across N teachers | `teacher_replay.py` | API-bound fan-out, N×T calls, no local GPU | **Bedrock batch inference** (`CreateModelInvocationJob`, one job per teacher) + a thin **EMR Serverless** aggregation step | NOT Glue-Ray (end-of-support). NOT SageMaker-Processing-with-Ray for the *calls* (you'd pay for idle instances waiting on API latency). Bedrock batch = 50% discount, S3-native, the fan-out IS the job. |
+| (c) Synthesize FeatureDeletionEnv tasks (run tests in sandboxes) | `datagen/substrates.py`, `env.py`, `docker_sandbox.py`, `validator.py` | CPU-heavy, untrusted code, 1 container per task, fault-tolerant | **AWS Batch array jobs on EC2 Spot** (container = the existing `DockerSandbox` image) | NOT SM Processing (no per-task isolation/retry granularity; 1 job = 1 fleet). NOT Fargate-only (no `--privileged`/gVisor, 4-vCPU cap, no Spot-as-cheap). Batch array index = task index, matches SWE-bench `--max_workers` model exactly. |
+| (d) Normalize → SFT corpus + DPO pairs | `replaysim/normalize.py` (data-juicer), `teacher_replay.extract_dpo_pairs` | CPU Spark/dataframe dedup + the data-juicer op-graph | **EMR Serverless (Spark)** running data-juicer's Ray/standalone executor in `--mode local` per partition, OR SM Processing for the single-node data-juicer path | EMR Serverless is the Spark-style dataframe normalize/dedup home; data-juicer runs CPU-only here. Glue ETL would also work but EMR Serverless gives finer Spark control + Graviton price/perf. |
+The recurring principle: **API-bound and test-execution work must NOT live on a service that bills for idle wall-clock** (Glue/SM Processing keep instances hot while you wait on Bedrock or a 5-minute pytest). Bedrock batch is async-priced; Batch array jobs are per-task Spot. Spark-shaped dataframe work (ingest, dedup, normalize) goes to the Spark services (Glue ETL for the trivial ingest, EMR Serverless for the heavy normalize).
+---
+## Stage (a) — Ingest + clean traces · Glue 5.0 Spark ETL
+`ingestion/claude_code.py::ClaudeCodeIngester.ingest()` is a pure JSONL→TraceState transform (one TraceState per assistant turn; `strip_thinking=False` per report §6 — 67% of error-recovery turns are pure thinking). This is IO-bound dataframe work over a prefix of session files.
+**Service: AWS Glue 5.0 Spark ETL job.**
+- Input: `s3://<datagen-bucket>/raw/claude_code/**/*.jsonl` (Claude Code session JSONL uploaded from `~/.claude/projects`).
+- The Glue script `mapPartitions` over the JSONL files; each partition runs `ClaudeCodeIngester().ingest(path)` unchanged (it already yields `TraceState`), filters subagent/sidechain records, and writes Parquet.
+- Output: `s3://<datagen-bucket>/traces/v1/run_id=<id>/part-*.parquet` partitioned `by run_id`.
+- Glue Data Catalog table `composer_traces` makes the trace store queryable by Athena (debug: "how many error-site turns this run?").
+Why Glue over SM Processing here: ingestion is small, periodic, IO-bound; Glue's per-DPU-second billing + job bookmarks (skip already-ingested sessions) fit better than spinning a Processing fleet. `ClaudeCodeIngester` runs verbatim inside the Spark UDF — zero repo change to the ingester itself.
+> **Live caveat (worth flagging):** `ClaudeCodeIngester.ingest()` is a per-file, single-pass, *record-parallel* pure-Python iterator with no shuffle/join — i.e. it is **not** intrinsically Spark-shaped. **SageMaker Processing** with `ProcessingInput.S3DataDistributionType = ShardedByS3Key` (custom container, the ingester runs unchanged, slice config from `processingjobconfig.json`/`resourceconfig.json`) is the *cleaner* primitive for this exact shape and avoids any dependence on the Glue stack — relevant because **AWS Glue-for-Ray is now closed to new customers** (confirmed in the Glue engine docs), so leaning on Glue here means committing to Glue-for-Spark specifically. Verdict stands as Glue ETL for the trivial periodic ingest (catalog + bookmarks are a real convenience), with SM Processing as the equally-valid, Glue-independent fallback. Either way the ingester code is untouched.
+---
+## Stage (b) — N-teacher replay · Bedrock batch, NOT OpenRouter
+Today `teacher_replay.py` calls **OpenRouter** over httpx with `DEFAULT_TEACHERS = [claude-opus, gpt-5, deepseek-v4]` and a `max_total_usd` cap. The facet question: stay API or move to Bedrock?
+### Verdict: move the AWS-native default to **Bedrock batch inference**, keep OpenRouter as a fallback adapter.
+Reasoning grounded in research + repo:
+1. **Bedrock batch = 50% of on-demand pricing**, S3-in/S3-out, ~24h turnaround SLA. The replay is the single most expensive piece of the whole system (report §10: ~$0.98/trace at N=3, ~$64/trace at 8-teacher×1000-step). A 50% cut on the dominant cost line is the highest-leverage AWS-native move in this facet, and replay is *exactly* the latency-insensitive offline workload Bedrock batch targets.
+2. **The fan-out IS one job per teacher.** Bedrock batch does not multiplex models in a single job — `CreateModelInvocationJob` takes one `modelId`. So the N-teacher fan-out = N batch jobs over the *same* JSONL, each writing to its own S3 prefix. The flat `for state: for teacher:` loop in `replay_trace` becomes "emit one shared JSONL of all states, submit N jobs." This is structurally *simpler* than the current per-call gather.
+3. **Heterogeneity survives on Bedrock — VERIFIED LIVE in this account.** `aws bedrock list-foundation-models --region us-west-2` returns a genuinely multi-lab pool: `anthropic.claude-opus-4-8`, `anthropic.claude-opus-4-7`, `anthropic.claude-opus-4-6-v1`, `anthropic.claude-sonnet-4-6`, `anthropic.claude-haiku-4-5-...`, `deepseek.v3.2`, `deepseek.r1-v1:0`, `meta.llama4-maverick-17b-instruct`, `meta.llama3-3-70b-instruct`, etc. That is Claude + DeepSeek + Llama = three different families, satisfying the report's N≥3 heterogeneous-population anti-collapse safeguard (§5 #4) and the "drop same-family teachers if Claude is the student" leakage rule (§6).
+   - **CRITICAL batch-eligibility split (read live from the Bedrock "Supported Regions and models for batch inference" table):** not every model has a *batch* ID. **Batch-eligible** in/through `us-west-2`: **DeepSeek V3.2** (`deepseek.v3.2`, single-region `us-west-2`), **Claude Sonnet 4.6 / Opus 4.6** (via the `us.` **cross-region inference profile** whose destinations include `us-west-2`), and the Nova/Titan families. **On-demand ONLY (no ARN-versioned batch ID, so NOT batchable):** **Claude Opus 4.7 / 4.8** — reachable via `bedrock-runtime` `Converse`/`InvokeModel` and fanned out with a bounded async/thread pool exactly like today's `replay_trace`, but you pay on-demand. So the cheap bulk batch pool is `{deepseek.v3.2, us.anthropic.claude-sonnet-4-6, us.anthropic.claude-opus-4-6}`; if the teacher set must include 4.7/4.8 they ride the on-demand path or are substituted by batchable Opus 4.6.
+4. **Bedrock data does not train Anthropic models** (governance) — a real reason to prefer it over OpenRouter for a self-improving loop.
+### Concrete wiring
+`teacher_replay.py` gains a `BedrockBatchTeacherPool` alongside the OpenRouter path (the `TeacherSpec` slug becomes a Bedrock `modelId` or inference-profile id like `us.anthropic.claude-haiku-4-5-...`):
+```
+# new: teacher_replay_bedrock.py  (~180 LOC)
+def submit_replay_batch(states, teachers, s3_in, s3_out, role_arn) -> list[jobArn]:
+    # 1. write ONE shared abc.jsonl: {recordId: state_id, modelInput: {anthropic_version, messages, max_tokens}}
+    #    one record per state (messages = state["messages"]).  recordId == state_id is the join key.
+    # 2. for each teacher: bedrock.create_model_invocation_job(
+    #        modelId=teacher.model_id, jobName=..., roleArn=role_arn,
+    #        inputDataConfig={"s3InputDataConfig":{"s3Uri": s3_in}},
+    #        outputDataConfig={"s3OutputDataConfig":{"s3Uri": f"{s3_out}/{teacher.slug}/"}})
+    # 3. return jobArns; poll get_model_invocation_job until Completed.
+```
+The output `.jsonl.out` rows carry `recordId` (=state_id) + `modelOutput`; an EMR Serverless step joins all N teacher outputs back by `state_id` into the exact `list[TeacherCallResult]` shape `extract_dpo_pairs` already consumes — so `extract_dpo_pairs` and the entire DPOPair contract are **byte-for-byte untouched**.
+**Constraints to honor (from research):** Bedrock batch input file ≤ 1 GB, total job ≤ 5 GB, default 100k records/job (adjustable via Service Quotas), ≥ a minimum-records floor per model. The submitter must shard a >100k-state run into multiple JSONL files. A `BedrockBatchTeacherPool` is the right home for that sharding.
+**Where OpenRouter stays:** when N>3 teachers from labs Bedrock doesn't host (e.g. a brand-new frontier model), or when sub-24h latency is needed for a hot debugging loop. The `TeacherSpec` TypedDict already abstracts provider; we add a `provider: "bedrock"|"openrouter"` field and route. This keeps `replay_trace`'s async OpenRouter path as the low-latency escape hatch and Bedrock batch as the cheap bulk default.
+### Why NOT Glue-Ray / SM-Processing-with-Ray for the fan-out
+The facet floats "Glue Ray jobs / SageMaker Processing with Ray for the N-model parallel fan-out." **AWS Glue for Ray is end-of-support** — AWS now explicitly recommends Ray-on-EKS (KubeRay) instead (docs: *AWS Glue for Ray end of support*). And running the *teacher API calls* on a Ray cluster (Glue or SM Processing) means paying for idle CPU/instances while threads block on model latency — the anti-pattern above. Ray's place in this facet is **stage (c)** orchestration of sandbox fan-out *if* we ever want Ray semantics, but Batch array jobs are the simpler AWS-native fit there too. So: **Bedrock batch for the calls, EMR Serverless to fan-in.**
+---
+## Stage (c) — FeatureDeletion synthesis + test execution · AWS Batch array jobs on Spot
+This is the compute-heavy, genuinely-new part (report §8: "per-branch sandbox isolation is the throughput ceiling of the whole idea"). Two sub-steps:
+**c1. Schema inversion** — `SweBenchAdapter.to_task(instance)` (`substrates.py`) maps a SWE-bench-shaped row → `FeatureDeletionTask` (revert gold patch = broken repo; FAIL_TO_PASS = reward target; PASS_TO_PASS = guard). Pure CPU, no Docker. This runs inside the **Glue ingest job (a)** or a tiny Lambda — it's a dict transform. `is_redistributable()` license-gate (GPL/AGPL/LGPL filter) runs here.
+**c2. Materialize + validate the broken repo** — this is where it gets heavy. For each task: pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()` (the PRIMARY reward-hack control — strip `__pycache__`/`.git`/`*.pyc`), run the test command, confirm FAIL_TO_PASS actually fails and PASS_TO_PASS actually passes (the 4-gate `validator.py` solvability check). Each task = one untrusted, isolated, ~1-5 minute container run.
+**Service: AWS Batch array jobs, EC2 Spot compute environment, container = the existing `DockerSandbox` image.**
+```
+batch.submit_job(
+  jobName=f"fd-validate-{run_id}",
+  jobQueue="fd-sandbox-spot",
+  jobDefinition="fd-docker-sandbox:N",   # the DockerSandbox image in ECR
+  arrayProperties={"size": n_tasks},      # up to 10,000 children
+  containerOverrides={"environment": [{"name":"TASK_MANIFEST_S3","value": s3_manifest}]})
+```
+- Each array child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task at that line in the S3 task manifest (`tasks/v1/run_id=<id>/manifest.jsonl`), boots the broken image, runs `validator` + `FeatureDeletionEnv._grade()` against `LocalSubprocessSandbox`/`DockerSandbox`, writes one result row to `s3://.../task_grades/<run_id>/<task_id>.json`.
+- This is **exactly the SWE-bench harness `--max_workers` model**: SWE-bench runs 1 container per instance with N parallel workers; DeepSWE hit Docker-daemon limits at 512 containers/iteration and preloaded images onto NVMe (report §8). Batch array jobs give that fan-out with managed retry (`retryStrategy`), Spot interruption handling, and per-child CloudWatch logs — without us building a container scheduler.
+- **Isolation tier (report §8 layered posture):** the `DockerSandbox` already bakes the lockdown recipe (`network_mode=none`, `read_only`, `cap_drop=ALL`, `no-new-privileges`, `pids_limit`, gVisor `runtime='runsc'` when available). On Batch EC2 (not Fargate), we can install gVisor on the AMI and run untrusted model-generated fix attempts under `runsc` by default; Kata+Firecracker for adversarial code requires **self-managed** node groups (the EKS-MNG CPU-Options gotcha from §8 applies to Batch managed EC2 too — use a self-managed launch template if nested virt is needed).
+### Why NOT SageMaker Processing for the sandboxes
+SM Processing is one job = one fixed instance fleet with `ShardedByS3Key` splitting input across instances. It has **no per-task retry**, no Spot-native interruption recovery at task granularity, and no array-index primitive — a single poisoned task (infinite loop, fork bomb) can wedge a whole Processing instance's shard. Batch array jobs isolate each task to its own container with its own timeout (`DockerSandbox.exec_timeout_s` + Batch `timeout`), retry, and Spot replacement. For "10,000 untrusted code executions, each independent, some will hang," Batch is the textbook fit and SM Processing is not.
+### Why NOT plain Fargate
+Fargate caps at 4 vCPU / 30 GB without `--privileged`, cannot run gVisor/Kata, and is pricier than Spot for bursty bulk. Batch-on-EC2-Spot is 60-70% cheaper for this fault-tolerant fan-out (report §10 Spot savings).
+---
+## Stage (d) — Normalize → SFT corpus + DPO pairs · EMR Serverless (Spark)
+After (b) fan-in produces `list[TeacherCallResult]` and `extract_dpo_pairs` produces `DPOPair` rows, `replaysim/normalize.py::DJNormalizer` runs the **data-juicer op-graph** (`recipes/replaysim/default.yaml`: length filter ×2, words-num ×2, special-char ×2, document_deduplicator). ADR-004 chose data-juicer precisely because the op set is **CPU-only** (no NeMo-Curator GPU dep).
+**Service: EMR Serverless (Spark) application, Graviton.**
+- The normalize is Spark-shaped dataframe work: read all DPOPair rows for a run, run the data-juicer op-graph, dedup across the corpus, write the partitioned corpus.
+- data-juicer's `DefaultExecutor` runs in-process per Spark partition (the repo's `DJNormalizer.normalize()` already does file-in/file-out via `init_configs` + `DefaultExecutor`). On EMR Serverless we `mapPartitions` → run `DJNormalizer(skip_dj=False).normalize(partition_rows)` per partition, then a Spark `dropDuplicates` for cross-partition full-corpus dedup (the repo notes `document_deduplicator` is per-batch only — Spark closes that gap natively).
+- EMR Serverless auto-scales workers, bills per vCPU/GB-second on Graviton, and is the AWS-native Spark home for "dataframe normalize + dedup at scale." The repo's `replaysim` already *is* the op-graph — EMR Serverless just runs it distributed.
+Why EMR Serverless over Glue ETL here: the normalize is the heavier of the two Spark stages, data-juicer wants control over the Python runtime (custom `--py-files`/wheel), and EMR Serverless gives finer Spark tuning + Graviton price/perf than Glue's DPU model. Glue ETL stays the choice for the *light* ingest (a). Both are valid Spark; we split by weight.
+**Two output corpora** (the dataset contract below): the SFT corpus (clean winning trajectories — report §5 "SFT-first competence floor") and the DPO pairs (from `extract_dpo_pairs` + future execution-oracle near-miss rejects).
+---
+## The S3 dataset contract
+One bucket per environment (reuse the existing `amazon-sagemaker-386931836011-us-west-2-*` or a dedicated `composer-datagen-386931836011-us-west-2`). Layout is **Hive-partitioned by `run_id`** so each outer-loop generation is an immutable, addressable slice (the report's "generation" in the GA framing, §3/§5) and Athena/Glue Catalog can query across generations.
+```
+s3://composer-datagen-386931836011-us-west-2/
+  raw/claude_code/**/*.jsonl                         # (a) input: uploaded sessions
+  traces/v1/run_id=<id>/part-*.parquet               # (a) out: TraceState rows
+  tasks/v1/run_id=<id>/manifest.jsonl                # (c1) FeatureDeletionTask rows (1/line; array index = line)
+  replay/v1/run_id=<id>/
+      input/states.jsonl                             # (b) shared Bedrock batch input (recordId=state_id)
+      teacher=<slug>/*.jsonl.out                      # (b) per-teacher Bedrock batch output
+  task_grades/v1/run_id=<id>/<task_id>.json          # (c2) validator + _grade() results
+  corpus/v1/run_id=<id>/
+      sft/part-*.parquet                              # (d) SFT corpus (clean winners)
+      dpo/part-*.parquet                              # (d) DPO pairs (normalized DPOPair)
+  manifests/run_id=<id>.json                          # run-level manifest: counts, cost, lineage, schema_version
+  diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt     # (separate) inner-loop ObjectStoreAllReduce (already exists)
+```
+**Format decisions:**
+- **Parquet** for `traces/`, `corpus/sft/`, `corpus/dpo/` — columnar, compressed, Athena-queryable, the SFT/DPO training read path (TRL `load_dataset` reads Parquet directly). This is the durable corpus.
+- **JSONL** for `replay/input/`, `tasks/manifest`, `task_grades/` — because that's what Bedrock batch *requires* (`s3InputFormat=JSONL`), what AWS Batch array-index line-lookup wants, and what `DPOPair`/`TraceState` natively serialize to. JSONL is the wire format between stages; Parquet is the corpus-at-rest format.
+- **`recordId == state_id`** is the universal join key linking trace → replay → DPO pair. Already true: `TraceState.state_id` is `f"{path.stem}::{idx:04d}"` and `DPOPair.state_id` carries it through.
+- **`manifests/run_id.json`** is the run-level contract: `{run_id, schema_version, n_traces, n_tasks, n_dpo_pairs, teacher_pool, bedrock_job_arns, total_cost_usd, parent_run_id}`. `parent_run_id` threads the flywheel lineage (report §5: improved student regenerates next round's traces). The held-out-eval guard (`safety/HeldoutSplit`) reads `schema_version` + a `split` column to enforce the disjoint held-out set the report flags as the load-bearing gap (§7 Pushback 4).
+---
+## Orchestration: Step Functions (the non-cluster Argo analog)
+Report §8 uses Argo Workflows on EKS as the outer-loop controller. For the **AWS-native, no-persistent-cluster** path this facet targets, the equivalent is **AWS Step Functions** (Standard workflow) driving the DAG:
+```
+Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless)
+  → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless)
+  → WriteManifest(Lambda)
+```
+Step Functions has native `.sync` integrations for **Glue**, **EMR Serverless**, **Batch** (`arn:aws:states:::batch:submitJob.sync` — blocks until the whole array completes), and **SageMaker**, plus a `Map` state for the N-teacher Bedrock fan-out. This makes the whole outer loop one declarative state machine, retryable, with per-stage IAM. If/when the program moves to the EKS-primary path of §8, the same stage boundaries lift to Argo DAG nodes — the S3 contract is identical, so it's a controller swap, not a rewrite.
+---
+## Repo delta (what to build in `composer_replication/`)
+1. **`composer_replication/teacher_replay_bedrock.py`** (~180 LOC) — `BedrockBatchTeacherPool`: `submit_replay_batch()` writes the shared states JSONL, submits one `create_model_invocation_job` per teacher, polls, and parses `.jsonl.out` back into `list[TeacherCallResult]`. Add `provider`/`model_id` to `TeacherSpec`. `extract_dpo_pairs` untouched.
+2. **`composer_replication/datagen/aws/batch_validate.py`** (~120 LOC) — the Batch array-child entrypoint: read `AWS_BATCH_JOB_ARRAY_INDEX` → task manifest line → boot `DockerSandbox`/`LocalSubprocessSandbox` → run `validator` + `_grade()` → write `task_grades/.../{task_id}.json`. Plus a `submit_validate_array()` helper.
+3. **`composer_replication/datagen/aws/glue_ingest_job.py`** (~80 LOC) — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; writes `traces/` Parquet + Glue Catalog table.
+4. **`composer_replication/replaysim/emr_normalize_job.py`** (~100 LOC) — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; writes `corpus/dpo/` + `corpus/sft/` Parquet.
+5. **`composer_replication/datagen/aws/s3_contract.py`** (~120 LOC) — the S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers for `TraceState`/`FeatureDeletionTask`/`DPOPair`, the `recordId==state_id` join helpers, and `schema_version`/`split` column injection for the held-out guard.
+6. **`infra/datagen_stepfunctions.json`** (+ a thin CDK/`infra/datagen_stack.py`) — the Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless app, Glue role). ~250 LOC IaC.
+7. **`pyproject.toml`** — extend the `[aws]`/`[datagen]` extras with `boto3` for Bedrock/Batch/Glue/EMR clients (the `[serverless]` extra already needs `s3fs`/`boto3` per report §9).
+8. **Dockerfile** — the `DockerSandbox` ECR image (also the Batch job-definition image), baking gVisor `runsc` on the AMI/launch-template for default untrusted-code isolation.
+The trainer, loss, `FeatureDeletionEnv`, curriculum, monitor, `extract_dpo_pairs`, `DJNormalizer` op-graph, and the DiLoCo/`ObjectStoreAllReduce` inner loop are all **untouched** — every AWS-native piece wraps an existing repo entrypoint and reads/writes the S3 contract.
+---
+## Open questions
+- **Bedrock batch 24h SLA vs flywheel cadence.** Report §5 says outer-loop cadence is hours-to-days, so 24h batch turnaround is acceptable for bulk generations — but a fast bootstrap iteration may want the OpenRouter low-latency path. Need a `batch_vs_realtime` policy knob keyed on `max_total_usd` + urgency.
+- **Per-branch sandbox cold-start is the §8/§10 explicit falsifier.** If Batch array job startup + image pull dominates wall-clock at target fan-out even with gVisor, the report's demotion path applies: keep Batch for control/grading but move bulk sandbox execution to a container-free pool (SWE-MiniSandbox class) or a warm Spot fleet. Must instrument cold-start in `batch_validate.py`.
+- **Cross-region inference profiles for Bedrock batch** (`us.anthropic...`) route to multiple regions for throughput — confirm the batch service role + S3 bucket policy allow the destination regions, else throttling.
+- **DeepSeek on Bedrock batch in us-west-2 — RESOLVED LIVE:** `deepseek.v3.2` is batch-eligible single-region in `us-west-2`; `deepseek.r1-v1:0` is in the on-demand catalog. Claude batch IDs are the `us.` cross-region inference profiles (`us.anthropic.claude-sonnet-4-6`, `us.anthropic.claude-opus-4-6`). Still confirm **minimum-records floors per model** + the **records/job & file-size quotas** in Service Quotas at build time (adjustable, but the `BedrockBatchTeacherPool` must shard input JSONL to fit).
+- **Opus 4.7/4.8 are on-demand-only (no batch ID).** If the heterogeneity ablation (report §3) wants the newest Claude as a teacher, it costs on-demand pricing — budget for it or substitute the batchable Opus 4.6. GPT-5 (in `DEFAULT_TEACHERS`) is not on Bedrock at all → that one family member stays on the OpenRouter escape hatch or is replaced by Bedrock Llama 4 / DeepSeek as the third family.
+- **`golden_diff` must be ACL-isolated.** It is `repr=False` in `FeatureDeletionTask` and held out of the policy observation; on S3 it must live in a *separate* deny-by-default prefix (`tasks/golden/`), never co-located with the policy-visible `tasks/v1/...` — oracle-cleanliness Gate 1 (report §4).

research/design-F3-rl-sagemaker.md ADDED Viewed

	@@ -0,0 +1,299 @@

+# F3 — The RL System on AWS, Runnable NOW (SageMaker us-west-2)
+**Status:** design, runnable-today. **Account:** 386931836011, **region:** us-west-2, **role:** Admin (Isengard).
+**Live facts verified 2026-06-09** (see "Live AWS findings" below). Grounds in the deep-research report
+(`research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §9 SageMaker path / hybrid, §10 phased plan)
+and the actual repo code (`SageMakerExecutor`, `ComposerReplicationTrainer`, `ObjectStoreAllReduce`,
+`replica_entrypoint`, `examples/gsm8k_grpo`).
+---
+## 0. TL;DR (committed)
+1. **For the first runnable GPU smoke RIGHT NOW: a single SageMaker Training Job, `ml.g5.2xlarge`, BYO-container
+   extended from the AWS PyTorch DLC.** This is NOT the `SageMakerExecutor` N-replica path — it is plain GRPO on one
+   GPU. The executor's multi-replica DiLoCo rendezvous is the *next* step, not the smoke. Reason: the smoke's job is
+   to prove the trainer + reward + vLLM rollout works on a real GPU at minimum cost and zero quota friction.
+2. **GRPO rollout = vLLM colocated in the training container** (`use_vllm=True, vllm_mode="colocate"`). TRL 1.5's
+   default is colocate; it runs vLLM in the *same process* sharing the training GPU at `vllm_gpu_memory_utilization=0.3`.
+   No separate inference endpoint for the smoke. The `server` mode (`trl vllm-serve`) and VeRL's `AsyncServer` are the
+   scale answer for tool-heavy agentic rollouts later (report §8) — not for a 0.5B GSM8K smoke.
+3. **Platform decision:** Training Jobs for the bursty smoke and periodic small-model runs (this facet); **HyperPod
+   (attached to EKS)** for the long, resilience-bound inner GRPO loop (report §9). Both share the identical S3
+   `ObjectStoreAllReduce` rendezvous, so a run moves between them with zero trainer/loss/DiLoCo change.
+4. **The `SageMakerExecutor` (already built, mock-tested) drives N independent single-instance Training Jobs**, each
+   tagged `REPLICA_RANK=i`/`WORLD_SIZE=N` via the `Environment` map, all pointed at one `s3://.../rendezvous/` prefix.
+   It is the bursty-fallback DiLoCo backend. To make it run live we need a built+pushed container, real
+   `role_arn`/`image_uri`/`output_s3_path`, and a non-zero quota for N concurrent training jobs.
+---
+## 1. Live AWS findings (verified 2026-06-09, this account/region)
+| Fact | Value | Consequence |
+|---|---|---|
+| Caller identity | `arn:aws:sts::386931836011:assumed-role/Admin/baladita-Isengard` | Admin — can create roles, push ECR, run training jobs. |
+| SageMaker default bucket (us-west-2) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` | Use as the rendezvous + output bucket — already covered by `AmazonSageMakerFullAccess`. |
+| Existing exec roles | `AmazonSageMaker-ExecutionRole-20250725T133247` (and ...20241223T...) | `role_arn = arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247`. |
+| Exec role policies | `AmazonSageMakerFullAccess` + custom `AmazonSageMaker-ExecutionPolicy-...` | FullAccess grants S3 to buckets named `*SageMaker*`/`*sagemaker*` — so the rendezvous bucket MUST be the SageMaker bucket above (or a `sagemaker-*` bucket) or you must attach an explicit S3 policy. **This is the IAM gotcha the executor docstring flags.** |
+| `ml.g5.2xlarge for training job usage` | **1.0** (non-zero!) | Single-replica g5 smoke runs IMMEDIATELY, no quota request. |
+| `ml.g5.2xlarge for spot training job usage` | **1.0** | Spot smoke also available (70% cheaper). |
+| `ml.g5.12xlarge for training job usage` | **1.0** | One 4×A10G box available for a 7B run later. |
+| `ml.g6.2xlarge for training job usage` | **0.0** | g6 (L4) needs a Service Quotas increase first — prefer g5 for the smoke. |
+| g5.2xlarge EC2 offering | us-west-2a/b/c | Capacity exists across AZs. |
+| Already present | `sagemaker-dynamo-on-eks-hyperpod-*` bucket | Confirms HyperPod-on-EKS has been used here — the report's §9 hybrid is live-reachable. |
+| boto3 / sagemaker SDK locally | NOT installed | `pip install -e .[aws]` + `pip install sagemaker` on the launch host (laptop/Studio), not in the repo's hard deps. |
+**The single most important runnable-now fact:** g5.2xlarge training-job quota is already 1 — the smoke needs no
+quota ticket. (Default for these GPU types is 0; this account has been bumped to 1.)
+---
+## 2. Training Jobs vs HyperPod vs EKS — when each (report §9, §10)
+- **SageMaker Training Jobs (THIS facet, bursty inner loop / smoke).** Ephemeral, pay-per-second, `boto3.create_training_job`,
+  zero persistent cluster. Right for: the first GPU smoke, periodic/smaller-model runs, the `SageMakerExecutor`
+  DiLoCo fallback. re:Post guidance: Training Jobs fit *periodic / smaller-model / pay-per-use*. The 28-day max runtime
+  and per-job cold-start (instance provisioning ~3-6 min) are acceptable for bursty work. Warm pools
+  (`KeepAlivePeriodInSeconds`) cut cold-start on repeated launches — but note this account's `g5 training warm pool
+  usage` quota is 0, so warm pools need a quota bump.
+- **SageMaker HyperPod attached to EKS (long resilience-bound inner loop).** Report §9: HyperPod maps 1-to-1 to an EKS
+  control plane (one EKS cluster = one HyperPod node-group in a VPC), with auto-detect-and-replace of faulty
+  accelerators and PyTorch job auto-resume. Right for: continuous/large-model/persistent multi-day RL where a node
+  failure on a Training Job would lose the run. The `sagemaker-dynamo-on-eks-hyperpod-*` bucket shows this is already
+  exercised here. **"Use HyperPod for the inner loop" does NOT mean leaving EKS** — it is a node-group swap on the same
+  control plane. Build target: the future `EKSExecutor` targets both Karpenter GPU nodes and HyperPod nodes transparently.
+- **Plain EKS (primary for everything else — report §8).** Outer MCTS/sandbox/dataset loop, vLLM RayService rollout
+  groups, gVisor/Kata sandbox pods, Argo controller. The inner GRPO trainer is the one piece that swaps between a
+  Karpenter p5/g6e NodePool and a HyperPod node-group.
+**Decision for F3:** SageMaker Training Jobs now (smoke + `SageMakerExecutor` DiLoCo fallback); HyperPod-on-EKS later
+for the long inner run. Same S3 rendezvous throughout.
+---
+## 3. The first runnable smoke: Qwen2.5-0.5B GRPO on GSM8K, single g5.2xlarge Training Job
+### 3.1 Shape
+One Training Job, `InstanceCount=1`, `ml.g5.2xlarge` (1× A10G, 24 GB). GRPO with vLLM **colocated** in the
+training container. This is the `examples/gsm8k_grpo/run.py` recipe lifted from CPU to one real GPU, with vLLM turned on.
+It does **not** exercise the DiLoCo rendezvous (that's §4). It proves: container builds, trainer runs on GPU, vLLM
+rollout works, reward fires, checkpoint lands in S3.
+### 3.2 Container — BYO extended from the AWS PyTorch DLC (do NOT use the stock HF DLC)
+- **Base:** `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker`
+  (verified present in the us-west-2 DLC registry; 763104351884 is the AWS DLC account). The DLC already has the SageMaker
+  training toolkit, CUDA, and a working torch — so vLLM's CUDA wheels match.
+- **Why not the stock HF DLC (`huggingface-pytorch-training:4.49.0`)?** It pins transformers 4.49 and does NOT bundle
+  `trl` or `vllm`; you'd be pip-installing the whole RL stack anyway. Extending the PyTorch DLC gives a clean,
+  version-controlled layer.
+- **Why a prebuilt ECR image and not `source_dir`+`requirements.txt`?** Installing `vllm` + `trl` + `flash-attn` at job
+  start over `requirements.txt` adds 5-10 min of cold-start per job and is a flaky failure surface (wheel/CUDA mismatch).
+  Bake them into the image once, push to the account's private ECR. `source_dir` is fine for *just the training script*
+  layered on top, but the heavy deps must be baked.
+`docker/Dockerfile.sagemaker`:
+```dockerfile
+FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker
+# RL stack (baked, not pip-at-startup)
+RUN pip install --no-cache-dir \
+      "trl>=0.12" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
+      "vllm" "fsspec>=2024.6" "s3fs>=2024.6"
+# The framework itself
+COPY . /opt/composer_replication
+RUN pip install --no-cache-dir -e "/opt/composer_replication[train,serverless]"
+# SageMaker invokes the image; for the smoke we use a plain GRPO entry script,
+# for the DiLoCo path the executor passes ContainerEntrypoint explicitly.
+ENV HF_HOME=/opt/ml/input/hf_cache
+```
+Build + push (Admin, one-time):
+```bash
+aws ecr create-repository --repository-name composer-rl --region us-west-2
+aws ecr get-login-password --region us-west-2 | docker login --username AWS \
+  --password-stdin 386931836011.dkr.ecr.us-west-2.amazonaws.com
+docker build -f docker/Dockerfile.sagemaker -t 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke .
+docker push 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke
+```
+### 3.3 The smoke training script — `examples/gsm8k_grpo/run_sagemaker.py`
+A thin GPU variant of `examples/gsm8k_grpo/run.py`. Same `gsm8k_reward` (RLVR `#### NUMBER` regex), same
+`ComposerReplicationTrainer(alpha_sdpo=0, beta_replay=0)` (plain GRPO — channels 2/3 off). Differences from the CPU example:
+```python
+from trl import GRPOConfig
+config = GRPOConfig(
+    output_dir="/opt/ml/model",          # SageMaker uploads this to OutputDataConfig.S3OutputPath
+    per_device_train_batch_size=8,
+    num_generations=8,
+    max_prompt_length=512,
+    max_completion_length=256,
+    learning_rate=1e-5,
+    max_steps=20,                          # smoke — minutes, not hours
+    logging_steps=1,
+    save_strategy="no",
+    bf16=True,                             # A10G supports bf16
+    # --- the rollout path: vLLM colocated in-process on the same GPU ---
+    use_vllm=True,
+    vllm_mode="colocate",                  # TRL 1.5 default; same process, no server
+    vllm_gpu_memory_utilization=0.3,       # leave 70% for the 0.5B policy + grads + KV
+    vllm_tensor_parallel_size=1,
+    beta=0.04,                             # small KL-to-ref; or 0.0 for pure smoke
+    report_to=[],
+)
+```
+Read hyperparameters from `/opt/ml/input/config/hyperparameters.json` (SageMaker writes the estimator's
+`hyperparameters=` there) so the same script is config-driven.
+### 3.4 The launch — SageMaker Python SDK `Estimator` (run from the laptop / Studio)
+```python
+import sagemaker
+from sagemaker.estimator import Estimator
+sess = sagemaker.Session()  # picks up region us-west-2
+ROLE = "arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
+IMAGE = "386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke"
+BUCKET = "amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d"
+est = Estimator(
+    image_uri=IMAGE,
+    role=ROLE,
+    instance_type="ml.g5.2xlarge",        # quota = 1 (verified live) — no ticket needed
+    instance_count=1,
+    volume_size=100,
+    max_run=3600,                          # 1h cap for the smoke
+    output_path=f"s3://{BUCKET}/composer-rl/smoke/output",
+    base_job_name="composer-grpo-smoke",
+    environment={"HF_HUB_ENABLE_HF_TRANSFER": "1"},
+    hyperparameters={"model": "Qwen/Qwen2.5-0.5B-Instruct", "max_steps": 20},
+    entry_point="run_sagemaker.py",
+    source_dir="examples/gsm8k_grpo",      # script layered on the baked image
+    keep_alive_period_in_seconds=0,        # warm-pool quota is 0 in this acct; leave off
+    # use_spot_instances=True, max_wait=7200  # optional: spot quota is also 1
+)
+est.fit(wait=True, logs=True)              # GSM8K loads from HF inside the container
+```
+**Cost:** `ml.g5.2xlarge` is ~$1.52/hr on-demand in us-west-2; a 20-step 0.5B smoke is ~15-25 min ⇒ **well under $1**.
+On spot (quota=1) ~$0.45-0.60/hr ⇒ pennies. The CPU example proves the loop in ~60s; this proves it on a real GPU with
+the real vLLM rollout path, which the CPU example explicitly does not exercise.
+### 3.5 Gotchas baked into the recipe
+- **vLLM needs HF model download at job start.** Either set `HF_HUB_ENABLE_HF_TRANSFER=1` (done) or stage the model
+  into S3 and pass it as an input channel; for a 0.5B model the live download is fine. `EnableNetworkIsolation` MUST stay
+  False (the executor pins this) so the container can reach `huggingface.co` and S3.
+- **`vllm_gpu_memory_utilization=0.3` is the load-bearing knob on a 24 GB A10G.** Too high ⇒ OOM when the policy +
+  grads + optimizer also need the GPU; too low ⇒ tiny KV cache, slow rollout. 0.3 is the TRL/Ray reference default for
+  a small model on one GPU.
+- **GSM8K = `openai/gsm8k` config `main`.** Already what the example loads. No license blocker (MIT).
+---
+## 4. The DiLoCo N-replica path: how `SageMakerExecutor` drives the rendezvous
+This is the executor that already exists and is mock-tested (`tests/test_sagemaker_executor.py` — 20+ tests covering
+rank-ordered handles, env injection, status mapping, cancel idempotency, partial-launch rollback). It is the bursty
+DiLoCo backend, distinct from the §3 smoke.
+### 4.1 What it does (verified from source)
+- `launch_replicas(N, ...)` submits **N independent single-instance Training Jobs** (NOT one multi-instance job — that
+  would couple replicas through SageMaker's intra-job NCCL fabric and break the "each replica syncs only through S3"
+  design). Each job gets `Environment={"REPLICA_RANK": str(i), "WORLD_SIZE": str(N), "RENDEZVOUS_URI": s3uri}` and
+  `ContainerEntrypoint=["python","-m","composer_replication.diloco.serverless.replica_entrypoint"]` with
+  `ContainerArguments=["--rendezvous", s3uri, "--world-size", N, "--trainer-module", ..., "--trainer-fn", ...]`.
+- `replica_entrypoint.main` reads `REPLICA_RANK`, builds `ObjectStoreAllReduce(uri=s3://..., rank, world_size)`, wraps
+  it in `MockManager`, and calls the user's `train(manager=, rank=, world_size=, **trainer_kwargs)`. The trainer wires
+  `manager` into `make_diloco_outer_loop`; pseudo-gradients sync via `round_{NNNNNN}/rank_{RRRR}.pt` PUT-then-poll-then-mean
+  on S3. **DiLoCo math, loss, trainer untouched.**
+- `poll`/`collect` map `describe_training_job.TrainingJobStatus`; `stream_logs` reads
+  `/aws/sagemaker/TrainingJobs/<job>/algo-*`; `cancel` calls `stop_training_job` idempotently.
+### 4.2 The asymmetry that makes this clean (report §8)
+Gang scheduling is needed for *intra-replica* FSDP NCCL but NOT for *inter-replica* DiLoCo sync — replicas rendezvous
+through S3, so a straggler simply blocks at the poll loop (bounded by `timeout_s=1800`) instead of deadlocking. On
+SageMaker, N separate jobs have no mutual network path (`supports_inter_replica_network=False`), which is exactly right.
+### 4.3 What to wire to run it live (the deltas)
+1. **Same baked image** from §3.2 (it already `pip install -e .[serverless]`, so `replica_entrypoint`, `s3fs`, `fsspec`
+   are present). The executor passes `ContainerEntrypoint` explicitly, so a generic image works.
+2. **Rendezvous bucket = the SageMaker default bucket** (`amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d`) so the
+   exec role's `AmazonSageMakerFullAccess` already grants the live S3 PUT/GET the allreduce poll loop needs. Use
+   `rendezvous_uri = "s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/composer-rl/runs/<run_id>/rendezvous/"`.
+3. **Quota:** N concurrent jobs need `ml.g5.2xlarge for training job usage >= N`. Currently 1 ⇒ N=1 works today; for
+   N=2-4 DiLoCo, request a Service Quotas increase (Service Quotas console → SageMaker → "ml.g5.2xlarge for training job
+   usage"). The smoke proves the executor end-to-end at N=1 (one job, one rank — degenerate allreduce returns its own
+   tensor), then N=2 once quota lands.
+4. **Driver script** `examples/diloco_sagemaker/run.py` (~80 LOC): construct `SageMakerExecutor(role_arn=..., image_uri=...,
+   output_s3_path=..., region="us-west-2")`, call `launch_replicas(N, entrypoint="...replica_entrypoint",
+   entrypoint_args={"rendezvous_uri": s3uri, "trainer_module": "examples.gsm8k_grpo.diloco_train", "trainer_fn": "train",
+   "trainer_kwargs": {...}}, gpu="A10G", timeout=3600)`, then `collect(handles)`. `gpu="A10G"` maps to `ml.g5.2xlarge`
+   via the executor's `_GPU_INSTANCE_MAP`.
+---
+## 5. The GRPO rollout problem — colocated vLLM now, server/AsyncServer later
+TRL's `GRPOTrainer` needs a generation path each step. Three options, committed mapping:
+| Option | When | On SageMaker |
+|---|---|---|
+| `model.generate()` (no vLLM) | never for real runs — too slow | the CPU example uses this implicitly; fine only for the 0.5B CPU toy. |
+| **vLLM colocate** (`use_vllm=True, vllm_mode="colocate"`) | **the smoke + most single-GPU runs** | vLLM in the same process, shares the training GPU at `vllm_gpu_memory_utilization=0.3`. One container, one job, no endpoint. TRL 1.5 default. **This is the F3 answer.** |
+| vLLM server (`trl vllm-serve`) | multi-GPU where you dedicate GPUs to generation | a *second* SageMaker job or a SageMaker endpoint runs `trl vllm-serve`; the trainer job points `vllm_server_host/port` at it. Introduces inter-process comm + a network hop — only worth it when generation dominates and you have spare GPUs. |
+| VeRL `AsyncServer` | tool-heavy agentic tree-of-work rollouts (report §8) | the scale answer for the SWE-agent tree: async GPU-decoupled agent loop TRL lacks. A later facet; the engine should be a configurable backend, not hardcoded. |
+For the F3 smoke and the DiLoCo fallback, **colocate is correct and simplest**: it keeps everything in one
+self-contained training container, which is exactly what a single-instance SageMaker job wants. No separate inference
+endpoint to provision, secure, or pay for.
+**One subtlety the report flags (§8):** the SDPO channel (Channel 2) needs full-vocabulary *logits* (TRL-hosted, which
+the trainer's `_compute_sdpo_loss` does via `model(...).logits`), while Channel 3 needs only log-probs. Colocated vLLM
+handles *rollout generation*; the SDPO/replay logits/log-probs come from the policy forward pass in `_compute_loss`,
+not from vLLM. So turning on `alpha_sdpo>0` later does not change the rollout backend choice.
+---
+## 6. Concrete repo deltas (to make this runnable, not hand-wavy)
+| Path (~LOC) | What | Why |
+|---|---|---|
+| `docker/Dockerfile.sagemaker` (~15) | Extend PyTorch DLC 2.6.0-gpu-py312; bake trl+vllm+peft+accelerate+datasets+s3fs+fsspec + `pip install -e .[train,serverless]`. | The report (§9) names "a Dockerfile wrapping composer_replication" as a missing build artifact. This is it. |
+| `examples/gsm8k_grpo/run_sagemaker.py` (~120) | GPU+vLLM variant of `run.py`; reads `/opt/ml/input/config/hyperparameters.json`; writes to `/opt/ml/model`; `use_vllm=True, vllm_mode="colocate"`. | The runnable smoke entry. |
+| `examples/diloco_sagemaker/run.py` (~80) | Driver that builds `SageMakerExecutor` with the live role/image/bucket and calls `launch_replicas`/`collect` for N replicas over the S3 rendezvous. | Turns the mock-tested executor into a live driver — no executor code change needed. |
+| `examples/gsm8k_grpo/diloco_train.py` (~60) | A `train(manager, rank, world_size, **kw)` that wraps the GRPO trainer in `make_diloco_outer_loop(manager=...)`. | The `trainer_module:trainer_fn` the executor imports inside each replica. |
+| `scripts/build_and_push_ecr.sh` (~20) | ECR create-repo + login + build + push (the §3.2 commands). | One-command image publish. |
+| `docs/AWS_SAGEMAKER_QUICKSTART.md` (~120) | The §1 live facts + §3 estimator recipe + the quota/IAM gotchas + the spot variant. | So the next person runs it in one read. |
+| `pyproject.toml` `aws` extra (+1 line) | add `sagemaker>=2.200` alongside `boto3` (the SDK `Estimator` lives there; executor uses raw boto3 but the launch driver wants the SDK). | The launch host needs the SDK; currently only `boto3` is in the extra. |
+**Nothing in `SageMakerExecutor`, `ComposerReplicationTrainer`, `ObjectStoreAllReduce`, `replica_entrypoint`, or the
+loss changes.** The executor's design (N single-instance jobs, env-injected rank, S3-only rendezvous,
+`EnableNetworkIsolation=False`) is already correct for this environment — verified against the live IAM/quota facts.
+---
+## 7. Open questions / next gates
+- **N>1 DiLoCo quota:** `ml.g5.2xlarge for training job usage` = 1 today. N=2-4 needs a Service Quotas increase
+  (typically minutes-to-hours for g5; not guaranteed instant). Request before the N>1 run.
+- **Warm pools:** `g5 training warm pool usage` quota = 0 ⇒ each job pays ~3-6 min cold-start. For the bursty DiLoCo
+  fallback at small H (frequent re-launch) this matters; request warm-pool quota or accept the cold-start, or move the
+  long inner loop to HyperPod (which is persistent — no per-round cold-start).
+- **vLLM version pin:** the smoke leaves `vllm` unpinned in the Dockerfile; pin to a version whose CUDA matches the DLC
+  (cu124 / torch 2.6) before promoting past smoke, to avoid a silent wheel mismatch.
+- **HyperPod-on-EKS path:** the `sagemaker-dynamo-on-eks-hyperpod-*` bucket shows it's been used here; the future
+  `EKSExecutor` + HyperPod node-group attach is the report's §9 recommendation for the long inner run. Out of scope for
+  F3 (Training-Jobs facet) but the rendezvous makes the swap free.
+- **Spot interruption + DiLoCo:** spot g5 quota = 1; with `use_spot_instances=True` a replica can be reclaimed mid-round.
+  The bounded `timeout_s=1800` poll means a reclaimed replica stalls its peers up to 30 min then `TimeoutError`s. For
+  spot DiLoCo, add `save_freq` checkpointing + relaunch-on-interruption in the driver (report §10 failure modes).
+## 8. References
+- Repo: `composer_replication/diloco/serverless/sagemaker.py` (SageMakerExecutor), `replica_entrypoint.py`,
+  `allreduce.py` (ObjectStoreAllReduce + MockManager), `trainer/composer_trainer.py` (ComposerReplicationTrainer +
+  `make_po_config`), `examples/gsm8k_grpo/run.py`, `tests/test_sagemaker_executor.py`.
+- Report §8 (EKS primary), §9 (SageMaker path + HyperPod hybrid), §10 (cost / phased plan).
+- AWS DLC registry us-west-2: `pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker` @ 763104351884
+  (docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-us-west-2.html).
+- TRL vLLM colocate: GRPOConfig `use_vllm`/`vllm_mode`/`vllm_gpu_memory_utilization` (huggingface/trl
+  grpo_config.py; huggingface.co/blog/vllm-colocate; Ray TRL-GRPO example).
+- SageMaker quotas: g5/g6/p4d training-job usage default 0 (StackOverflow 71655321; re:Post) — verified live this
+  account: g5.2xlarge=1, g6.2xlarge=0.
+- HyperPod-EKS 1:1 mapping: docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html.
+- Live `aws sts get-caller-identity` / `aws service-quotas list-service-quotas` / `aws iam` (2026-06-09).

research/design-F4-decoupled-diloco-s3.md ADDED Viewed

	@@ -0,0 +1,143 @@

+# F4 — Decoupled DiLoCo on Serverless + S3 (AWS-native, concrete)
+**Facet:** the headline distributed-training question — how N independent SageMaker Training Jobs (or EKS Indexed-Job pods) each run inner DiLoCo (AdamW H steps) then sync pseudo-gradients ONCE per ~500 steps through S3 via `ObjectStoreAllReduce`, with no cross-job NCCL.
+**Environment (LIVE):** account `386931836011`, region `us-west-2`, `Admin`/Isengard. Verified live below:
+- Rendezvous-ready bucket exists: `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` (matches the `sagemaker`-name pattern that `AmazonSageMakerFullAccess` grants S3 on — load-bearing, see IAM).
+- Two ready execution roles: `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` (use this) and `...-20241223T082691`.
+- CPU training quota = **30** instances each for `ml.m5.xlarge` / `ml.m5.large` / `ml.c5.xlarge` in us-west-2 → the 2-replica smoke is trivially in budget.
+- Warm-pool quota = **0** for all (default). `KeepAlivePeriodInSeconds` set without a quota increase silently no-ops. This is THE cold-start gotcha.
+---
+## 1. The decision that already shipped (ADR-005) and why it's right for AWS
+ADR-005 chose **object-store rendezvous, not cross-job NCCL**, as the DiLoCo comm primitive. DiLoCo (Douillard et al. 2023, arXiv:2311.08105 §3.2) syncs once per H≈500–1000 inner steps — ~10–30 min wall-clock. The exchange per outer round is one `PutObject` of the pseudo-gradient (~2 GB for a 1B bf16 model) + (N−1) `GetObject`s. At N=8 that's ~128 GB read spread over 30 min ≈ 70 MB/s aggregate, ~$0.05/round on S3. The repo realizes exactly this in `ObjectStoreAllReduce.allreduce()` (`composer_replication/diloco/serverless/allreduce.py:131`): PUT `round_{NNNNNN}/rank_{RRRR}.pt`, poll-until-all-peers-exist, mean, `tensor.copy_(avg)`.
+**Why this is correct on AWS specifically (not just plausible):** S3 has been **strongly read-after-write consistent for PUT/GET/LIST in all regions since Dec 2 2020, at no extra cost** (`aws.amazon.com/s3/consistency`; `docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html`). The poll loop's correctness depends on exactly this: rank R PUTs `rank_0003.pt`, and the instant any peer's `_exists()` returns True the subsequent `_get()` is guaranteed to return that byte-complete object — no "object not yet visible" race, no eventual-consistency retry shim. The repo's `_put` even writes a `.tmp` then `os.replace` on the local path (atomic on POSIX); on S3 a single `PutObject` is atomic per key by definition, so a peer never sees a half-written `.pt`. **The substrate is consistency-correct by construction on S3.**
+The one-line architectural payoff on K8s/serverless (report §8): a straggler replica simply blocks peers at the poll loop (bounded by `timeout_s=1800`) instead of deadlocking an NCCL gang. Inter-replica DiLoCo sync needs **no gang scheduling** — only intra-replica FSDP does.
+---
+## 2. Exact lifecycle: `SageMakerExecutor.launch_replicas(N)` → collect
+`composer_replication/diloco/serverless/sagemaker.py` (`SageMakerExecutor`) already implements the full `ServerlessExecutor` Protocol. The end-to-end flow for one Decoupled DiLoCo run:
+```
+make_diloco_run(N=4)                          # orchestrator (NEW, ~120 LOC — see §6)
+  └─ exec = SageMakerExecutor(
+         role_arn="arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247",
+         image_uri="386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-diloco:latest",
+         output_s3_path="s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-out/run42/",
+         region="us-west-2")
+  └─ handles = exec.launch_replicas(
+         n_replicas=4, entrypoint=<ignored>,
+         entrypoint_args={
+           "rendezvous_uri": "s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/run42/",
+           "trainer_module": "composer_replication.trainer.composer_trainer",
+           "trainer_fn": "diloco_train",            # NEW thin wrapper — see §6
+           "trainer_kwargs": {"model_name":"Qwen/Qwen2.5-0.5B","sync_every":500,"total_steps":2000}},
+         gpu="A10G", timeout=86400)
+```
+**Per replica, `launch_replicas` submits ONE single-instance Training Job** (`sagemaker.py:309-369`):
+- `ResourceConfig.InstanceCount == 1` — deliberately NOT one multi-instance job (which would wire SageMaker's intra-job NCCL via `resourceconfig.json` and couple the replicas — the wrong model). N replicas = N separate jobs.
+- `AlgorithmSpecification.ContainerEntrypoint = ["python","-m","composer_replication.diloco.serverless.replica_entrypoint"]`, `ContainerArguments = ["--rendezvous", s3uri, "--world-size","4","--trainer-module",...,"--trainer-fn","diloco_train","--trainer-kwargs-json", "{...}"]` (each token a separate list element).
+- `Environment = {"REPLICA_RANK":"<rank>","WORLD_SIZE":"4","RENDEZVOUS_URI":s3uri}` — the rank channel.
+- `EnableNetworkIsolation=False` — **load-bearing**, pinned, never a knob: True severs the container's outbound S3 and dead-locks the allreduce poll until `timeout_s`. The bucket access is granted on `RoleArn` instead (SageMaker's IRSA analog).
+- On any rank's `create_training_job` failure, already-launched siblings are best-effort `cancel`ed then it raises (clean abort, no orphan jobs).
+**Inside each job, `replica_entrypoint.main`** (`replica_entrypoint.py:38`):
+1. reads `REPLICA_RANK` from env (falls back to argv via the dual contract — argv for SageMaker/Local, env for EKS),
+2. builds `store = ObjectStoreAllReduce(uri=s3uri, rank, world_size)`. For an `s3://` URI this hits `_init_fsspec()` → `fsspec.filesystem("s3")` (s3fs; in the `[serverless]` extra),
+3. `manager = MockManager(store)`,
+4. imports `trainer_module.trainer_fn`, injects `manager=`, `rank=`, `world_size=`, calls it.
+**Inside `diloco_train` (the trainer_fn):**
+```python
+diloco = make_diloco_outer_loop(
+    manager=manager, model_fragments=[model],
+    inner_optimizer=AdamW(model.parameters(), lr=1e-5),
+    outer_lr=0.7, outer_momentum=0.9, nesterov=True, sync_every=500)
+with diloco:
+    for step in range(total_steps):
+        inner_optim.zero_grad(); loss = composer_loss(...); loss.backward()
+        inner_optim.step()       # at step%500==0 DiLoCo's post-hook fires
+```
+At every 500th `inner_optim.step()`, torchft's DiLoCo post-hook fires `prepare_sync` → `perform_sync`. `perform_sync` computes the pseudo-gradient `θ_initial − θ_local` (sign convention pinned in `diloco/__init__.py:13-38`), calls `manager.allreduce(pseudograd)` → `MockManager.allreduce` (`allreduce.py:268`) → `ObjectStoreAllReduce.allreduce`: PUT `round_000000/rank_0003.pt`, poll for all 4 ranks' files, mean, `copy_` back. `MockManager.should_commit()` always returns True (no FT failover; replica failure is the orchestrator's job), then the **outer Nesterov SGD step** applies the averaged pseudo-gradient and redistributes the new global weights. `start_quorum()` bumps `_step` so `current_step()` advances exactly once per round (fragment-rotation math). Repeat for the next 500 inner steps → `round_000001/`, etc.
+**Collect:** `exec.collect(handles, timeout=...)` polls `describe_training_job` per handle until terminal (`Completed`/`Failed`/`Stopped`), returns rank-ordered result dicts incl. `S3ModelArtifacts`. `poll` maps `TrainingJobStatus` via `_STATUS_MAP` (refining `InProgress`+queued → `pending`). `stream_logs` reads `/aws/sagemaker/TrainingJobs` CloudWatch stream `<job>/algo-1-<epoch>`.
+The DiLoCo math, `MockManager`, `ObjectStoreAllReduce`, `make_diloco_outer_loop`, the trainer, and the loss are **all byte-for-byte unchanged** across Local→EKS→SageMaker. That is the whole point of the Protocol.
+---
+## 3. S3 rendezvous specifics for AWS
+- **Bucket/prefix:** `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/<run_id>/` for the rendezvous; a sibling `…/diloco-out/<run_id>/` for `OutputDataConfig`. Use the sagemaker-named bucket so the stock execution-role policy already grants access (see IAM). Layout written by the substrate: `…/diloco-rdv/<run_id>/round_{NNNNNN}/rank_{RRRR}.pt`.
+- **Consistency:** strong RAW + strong LIST, all regions, since 2020 — the poll loop (`_exists` then `_get`) needs no consistency shim. A peer's file becomes visible atomically on PUT completion.
+- **IAM (load-bearing):** the jobs need `s3:GetObject` + `s3:PutObject` (and ideally `s3:ListBucket`) on `…/diloco-rdv/<run_id>/*` on the **execution `RoleArn`**, NOT the caller. The stock `AmazonSageMakerFullAccess` on the existing roles grants S3 only on buckets whose name contains `sagemaker`/`Sagemaker`/`SageMaker`/`aws-glue` — which is exactly why the `amazon-sagemaker-…` bucket works out of the box and a custom-named `s3://composer-diloco-rdv` would silently 403 the first PUT and hang every peer until `timeout_s`. Either keep the rendezvous in a sagemaker-named bucket (recommended, zero IAM work) or attach an inline policy scoping `s3:GetObject/PutObject/ListBucket` to the custom prefix.
+- **Poll timeout vs stragglers:** `ObjectStoreAllReduce(timeout_s=1800, poll_interval_s=1.0)`. 30 min comfortably covers a SageMaker cold-started replica that joins late (3–5 min provision + first-500-steps lag). For Spot churn at larger N, raise to `timeout_s=3600`. A `TimeoutError` names the exact missing `rank_R` + `round_N` — the orchestrator can then cancel + relaunch that rank (DiLoCo's `should_commit==True` means a stalled round does not silently skip; the run aborts cleanly rather than averaging a partial set).
+- **Cost:** ~$0.05/round (ADR-005), negligible vs GPU. For a 2000-step / sync_every=500 run that's 4 rounds ≈ $0.20 of S3 for the whole run.
+---
+## 4. What's missing to run it for real (the deferred smoke) + the cheapest validating run
+**The gap.** ADR-005 §"Open/deferred" flags a "real serverless smoke" as never run; report §10 Phase 0 lists "EKSExecutor + S3 rendezvous + dep bump" as substrate hardening. Concretely, what exists vs what's missing:
+| Layer | State |
+|---|---|
+| `ObjectStoreAllReduce` over `file://` | **Proven** — `test_serverless_diloco_integration.py` runs a 2-process `LocalProcessExecutor` run and asserts cross-rank weight convergence after one outer round. |
+| `ObjectStoreAllReduce` over `s3://` | **Code path exists, never exercised against real S3.** `_init_fsspec()` is untested with live s3fs; the `_exists`/`_get`/`_put` S3 branches have only mock coverage. |
+| `SageMakerExecutor` against real boto3 | **Never submitted a real job.** Tests inject a `_MockSMClient`. The ECR `image_uri` does not yet exist (no Dockerfile baking `composer_replication`). |
+| `diloco_train` trainer_fn | **Missing.** `composer_trainer.py` has the trainer but no thin DiLoCo-wrapped entry that accepts the injected `manager`/`rank`/`world_size` kwargs. |
+**The exact remaining gaps to close, in order (each cheap, each decisive):**
+1. **s3:// smoke (≈$0, ~10 min):** point the *existing* `test_serverless_diloco_integration` multi-process test at `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/smoke-$(uuid)/` instead of a tmp dir, with 2 local processes. This exercises the real s3fs PUT/poll/GET/mean path and the strong-consistency assumption with zero GPU spend. Only `s3fs` + `boto3` (already in `[serverless]`) and ambient `Admin` creds are needed. Gate: both ranks converge to identical weights (the existing assertion).
+2. **Dockerfile + ECR push (~$0):** ~30-line image `FROM pytorch/pytorch`, `pip install -e .[serverless,diloco,train]`, entrypoint `python -m composer_replication.diloco.serverless.replica_entrypoint`. Push to `386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-diloco:latest`.
+3. **`diloco_train` trainer_fn (~40 LOC):** wraps the existing trainer in `make_diloco_outer_loop`, accepts `manager/rank/world_size`.
+4. **2 tiny CPU SageMaker jobs (the validating run, ~$0.10–0.30):** `SageMakerExecutor(gpu=None → ml.m5.xlarge, $0.23/hr on-demand)`, `n_replicas=2`, `nn.Linear` or a 0.5B model with `total_steps=4, sync_every=2`, `rendezvous_uri` in the sagemaker bucket. Two jobs × ~5 min cold-start + ~2 min run ≈ 0.24 instance-hours ≈ **$0.06–0.30**. Within the 30-instance CPU quota with massive headroom. Gate: `collect()` returns both `succeeded`, both ranks' `round_000000/rank_000{0,1}.pt` appear in S3, and the two model artifacts in `…/diloco-out/` are byte-identical (proves the cross-job allreduce ran through real S3, not just locally).
+That ladder — `file://` (done) → `s3://` local (step 1) → 2 real CPU jobs (step 4) — is the report's prescribed cheapest path and closes the deferred-smoke gap for well under the ADR's $2–5 estimate (the original estimate assumed GPU + Modal).
+---
+## 5. Streaming DiLoCo (`fragment_sync_delay>0`) on this substrate
+Streaming DiLoCo (Liu et al. 2025, "Eager Updates for Overlapped Communication in DiLoCo", arXiv:2501.18512; the DiLoCo line is arXiv:2311.08105) splits the model into fragments synced on staggered schedules so the allreduce of fragment k overlaps inner computation of the next steps. `make_diloco_outer_loop` already exposes the knobs: `fragment_sync_delay>0`, `fragment_update_alpha`, and `model_fragments=[frag_0,...,frag_M-1]` (`diloco/__init__.py:72-93`).
+From torchft's `local_sgd.py` (`_StreamingDiLoCoFragment`, verified via DeepWiki against the upstream source):
+- **prepare_sync** fires at inner step `sync_every − fragment_sync_delay`: computes pseudo-gradients (`_save_grads`) and **launches the allreduce without waiting**, recording the `Work` in `self._allreduce_work`, on a separate CUDA stream.
+- **perform_sync** fires at `sync_every`: **waits** on that `Work`, restores global params, `should_commit()`, applies the outer step, then `_merge_parameters` blends `local*alpha + global*(1−alpha)` (`fragment_update_alpha`; 0.0 = standard full-replacement DiLoCo).
+- **`_current_fragment() = current_step() % len(fragments)`** — round-robin; each fragment syncs on its own offset. This is exactly why `MockManager.start_quorum()` bumps `_step` once per round and `current_step()` is faithful: get this wrong and replicas pick different fragments and diverge.
+**The load-bearing gotcha for the object-store substrate.** Streaming's overlap depends on `manager.allreduce()` being **asynchronous** — prepare_sync launches it, `fragment_sync_delay` steps of inner compute run, then perform_sync waits. But the repo's `MockManager.allreduce` is **synchronous**: it calls `ObjectStoreAllReduce.allreduce`, which **blocks** on the poll-until-all-peers loop before returning `_ImmediateWork` (whose `.wait()` is a no-op). So on this substrate, **today, prepare_sync blocks for the full S3 rendezvous and `fragment_sync_delay` buys zero overlap** — Streaming degrades to vanilla, correctly but without the comm/compute overlap benefit. This is fine for correctness (and is why the same API "configures Streaming" per the docstring) but defeats the point on a 2 GB-per-fragment, 30-min-cadence S3 sync.
+**The fix to make Streaming real here (~60 LOC, deferred):** give `ObjectStoreAllReduce` a non-blocking mode: `allreduce_async(tensor)` PUTs `round_N/rank_R.pt` and returns immediately; the returned `Work.wait()` then runs the poll-GET-mean-copy. `MockManager.allreduce` returns that deferred `Work` instead of `_ImmediateWork`. Now prepare_sync's PUT returns instantly, the `fragment_sync_delay` inner steps run while peers are PUTting concurrently, and perform_sync's `.wait()` does the poll/mean. Because S3 is strongly consistent, by the time perform_sync waits `fragment_sync_delay` steps later, peer files are far more likely already present — the overlap is genuine. Per-fragment streaming further shrinks each PUT to (model_size / M), so the poll is over smaller objects. This is the natural Streaming-DiLoCo realization on object storage and is the right Phase-5 upgrade (report §10) for multi-fragment large-model runs; vanilla (`fragment_sync_delay=0`, single fragment) is correct and sufficient for Phases 0–4.
+---
+## 6. Repo delta (what to build in `composer_replication/`)
+| File | Delta | ~LOC |
+|---|---|---|
+| `composer_replication/trainer/composer_trainer.py` | NEW `diloco_train(*, manager, rank, world_size, model_name, sync_every=500, total_steps, **kw)` — wraps the existing trainer body in `make_diloco_outer_loop(manager=manager,...)`; this is the `trainer_fn` `replica_entrypoint` calls. | ~40 |
+| `composer_replication/diloco/serverless/run.py` | NEW thin orchestrator `make_diloco_run(executor, n, rendezvous_uri, trainer_module, trainer_fn, trainer_kwargs, gpu, timeout)` → `launch_replicas` + `collect`; surfaces straggler `TimeoutError` → relaunch-rank. | ~120 |
+| `docker/Dockerfile.diloco` | NEW — `FROM pytorch/pytorch:*-cuda*`, `pip install -e .[serverless,diloco,train]`, ENTRYPOINT to `replica_entrypoint`. Push to ECR `composer-diloco:latest`. | ~30 |
+| `composer_replication/diloco/serverless/sagemaker.py` | EXTEND: optional `keep_alive_period_s` → `ResourceConfig.KeepAlivePeriodInSeconds` (warm pool; **document the 0-default quota gotcha**); optional `use_spot` → `EnableManagedSpotTraining=True` + `MaxWaitTimeInSeconds` (> `MaxRuntimeInSeconds`) + `CheckpointConfig`. Both default off. | ~40 |
+| `composer_replication/diloco/serverless/allreduce.py` | EXTEND (Phase 5, for real Streaming): `allreduce_async` + a deferred `Work` whose `.wait()` runs the poll/mean; `MockManager` returns it. Vanilla path untouched. | ~60 |
+| `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py` | EXTEND: parametrize rendezvous over `file://` AND `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/smoke-<uuid>/` (gated on `AWS_SMOKE=1`) — closes the s3:// path gap with zero GPU. | ~30 |
+| `docs/adrs/ADR-005-serverless-diloco.md` | UPDATE the "Open/deferred — Real serverless smoke" clause: replace the Modal $2–5 estimate with the SageMaker 2×CPU-job ($0.06–0.30) plan above. | ~10 |
+Untouched: `loss.py`, `teacher_replay.py`, `safety/`, `make_diloco_outer_loop`, `MockManager` core, `ObjectStoreAllReduce` core, `replica_entrypoint` (its dual argv/env contract already supports both SageMaker and EKS).
+---
+## 7. Open questions / falsifiers
+- **Warm-pool quota:** default 0 in us-west-2 (verified). If iterative dev wants warm starts (skip the 3–5 min cold-start per round-of-jobs), request a `ml.<type> for training warm pool usage` quota increase first; otherwise `KeepAlivePeriodInSeconds` no-ops.
+- **Pseudo-gradient dtype/size at real model scale:** the smoke uses tiny tensors; a 0.5B–8B bf16 pseudo-gradient is 1–16 GB per PUT — confirm s3fs multipart upload throughput and that `torch.save({"rank","tensor"})` round-trips bf16 on CPU (the code casts peer tensors back to device+dtype on GET).
+- **Streaming overlap:** only real after the `allreduce_async` delta (§5); until then `fragment_sync_delay>0` is correct-but-no-overlap on S3. Measure round wall-clock with vs without before claiming the benefit.
+- **N>16 straggler fragility under Spot:** the bounded `timeout_s` poll is the mitigation; the HyperPod-attached-node-group path (report §9, same S3 rendezvous) is the resilience escalation for multi-day runs.

research/design-F5-fidelity-audit.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# F5 — Fidelity Audit: Are we FULLY taking advantage of the papers + the Composer 2.5 / Composer-2 methodology?
+> **Date:** 2026-06-09. **Scope:** rubric-style audit of every documented Composer 2.5 / Composer-2 ingredient AND each load-bearing paper against the *actual* code in `composer_replication/`, plus a prioritized "to fully replicate + extend" gap list. AWS-native where a gap needs cloud infra (account 386931836011, us-west-2).
+> **Grounding:** `docs/COMPOSER_RECIPE_MAPPING.md`, `research/01/09/10` (Composer 2.5 blog + Composer-2 techreport mining), `research/05-12`, ADR-008/014/015, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md`.
+## Headline verdict
+The repo is a **high-fidelity replication of Composer 2.5's two *published* channels** (Dr.GRPO base + SDPO textual-feedback self-distillation) plus the framework's *own* third channel (multi-teacher trace-replay-DPO). The datagen / curriculum / anti-hack / safety substrate is essentially complete. But there are **three classes of gap that block "taking a model to the next level"**:
+1. **One byte-level loss-math infidelity that the evidence says actually matters**: Channel 1 rides TRL's **k3-in-loss** KL estimator; Composer 2 explicitly chose **k1-in-reward** (`-log r`), and the 2025/26 literature (arXiv:2512.21852 "A Comedy of Estimators"; verl adopting k1-in-reward as the *only* reverse-KL option; TRL issue #4967) shows k1-in-reward **improves OOD generalization** while k3-in-reward can *collapse*. The repo documents this as a "small delta, not patched" — but OOD generalization is exactly the "next-level" axis. This is the single most concrete fidelity fix.
+2. **Composer-2's *non-hint* behavior-shaping recipe is entirely MISSING**: the **auxiliary scalar reward array** (style / communication / unfinished-todo penalties), the **nonlinear length/effort penalty** `C_length = ((1+kx)^{1-q}-1)/(k(1-q))`, and **self-summarization with reward-to-all-chain-tokens**. These are *fully specified with equations* in research/10 and are reproducible without the hint mystery. None is in code.
+3. **The novel extension (multi-model Monte-Carlo tree-of-work + world-model deliberation head) is 100% design, 0% code.** No tree controller, no env-step-between-branches recursion, no `SiblingBootstrapGenerator`, no `<deliberate>` token, no next-state head. The report's "core delta" (teacher-plurality → execution-oracle fitness, depth-1 → recursion) is unbuilt.
+The good news the report stresses: the substrate for all of this already exists. ~9/10 of the system is reuse. The fidelity gaps are *additive*, not architectural rewrites.
+---
+## RUBRIC A — Composer 2.5 / Composer-2 documented ingredients
+| # | Ingredient (source) | Status | Implementing file / evidence | Gap |
+|---|---|---|---|---|
+| (a) | **Targeted RL w/ textual feedback** = SDPO/OPSD self-distill (Ch2) — 2.5 blog | **FULLY-REPLICATED** | `opsd.py::generalized_jsd_loss` (byte-for-byte vs siyan-zhao/OPSD, re-aligned Wave 15); `trainer/composer_trainer.py::_compute_sdpo_loss` (full-logits, stop-grad teacher, post-hint masked, ADR-011 aligned indices); `trainer/data_collator.py` emits `ctx_teacher_input_ids` + `student/teacher_response_idx`; tests `test_sdpo_alignment_indices.py`, `test_opsd_parity.py`. | Loss + wiring are production-grade. The *hint source* is the open question — see (a′) below. SDPO not yet smoke-tested against a real `trl.GRPOTrainer` on GPU (ADR-008 note). |
+| (a′) | **Hint generation** (the #1 reproducibility gap — unstated in *every* Cursor artifact) | **PARTIAL (designed-around)** | `hint_generator.py` layered: `TemplateHintGenerator` → `RawErrorHintGenerator` (routed) → `LLMJudgeHintGenerator` (cached, clamped) → `default_composite()`. ADR-009/012. | Layers 1-3 built; **layer 4 (`SiblingBootstrapGenerator` = SDPO "successful-rollout-as-implicit-feedback")** is design-only. LLM-judge path needs a live model wired (Bedrock). |
+| (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
+| (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
+| (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
+| (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **PARTIAL — DOCUMENTED INFIDELITY** | `composer_trainer.py:496-509` documents that TRL's `_compute_loss` uses **k3-in-loss** (`exp(Δ)−Δ−1`), NOT k1. `test_dr_grpo_config_and_alignment.py::test_trl_kl_estimator_is_k3_not_k1` pins this. Honest delta, not patched. | **The evidence says this delta matters for the "next level":** arXiv:2512.21852 + TRL #4967 + verl (k1-in-reward only) show k1-in-reward ↑ OOD generalization; k3-in-reward can collapse. Composer chose k1 deliberately. Fix is implementable (see Gap #1). |
+| (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
+| (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
+| (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
+| (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
+| (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **MISSING** | Reward is pure test-pass-fraction (`env.py::_grade`). No auxiliary reward array. `integrations/altered_minds/reward.py` is an MMLU-format reward for ADR-013 ladder, not the Composer behavior-reward suite. | Fully specified in research/10; reproducible without the hint mystery. Build a `behavior_rewards.py` reward-fn bank. |
+| (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **MISSING** | — | Trivially implementable (≈30 LOC reward shaper over {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns}). Induces parallel tool calls per the report. |
+| (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
+| (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
+**Rubric A score:** 5 FULLY-REPLICATED of the *core* published recipe (a, b1, b2, c1, g), 1 partial-around (a′), 1 documented-infidelity (c2), 1 intentional-skip (d, e, k), and **3 missing-but-specified Composer-2 behavior-shaping items (h, i, j)** that are the cheapest available "next-level" wins.
+---
+## RUBRIC B — Load-bearing papers
+| Paper (cluster) | Status | Implementing file / evidence | Gap |
+|---|---|---|---|
+| **SDPO / OPSD** (2601.20802 / 2601.18734) | **FULLY-REPLICATED** | `opsd.py` (byte-for-byte OPSD JSD), `composer_trainer._compute_sdpo_loss`. | SDPO "successful-rollout-as-implicit-feedback" lever (=sibling bootstrap) not built. |
+| **Dr.GRPO** (2503.20783) | **FULLY-REPLICATED** | `make_dr_grpo_config`; tests pin length-std-off + no-std-norm. | KL is k3-not-k1 (see Rubric A c2). |
+| **DAPO / GSPO / CISPO / BNPO** (2503.14476 / 2507.18071 / 2506.13585) | **FULLY-REPLICATED (as menu)** | `PO_OBJECTIVES` presets + drift guards (ADR-014). Note Composer-2 *rejected* DAPO overlong-masking at small scale. | None — menu exceeds Composer's own single choice. |
+| **DiLoCo / Streaming-DiLoCo** (2311.08105 / 2501.18512) | **PARTIAL** | `diloco/__init__.py::make_diloco_outer_loop`, `diloco/serverless/allreduce.py::ObjectStoreAllReduce` (s3://), `MockManager` mirroring torchft.Manager. ADR-003/005. | Streaming-DiLoCo (overlapped/quantized comm, partial-param sync) not implemented — plain DiLoCo outer loop only. Real multi-replica AWS run unproven (executor skeletons). |
+| **World-model cluster** (MuZero 1911.08265, Dreamer 2301.04104, CWM 2510.02387, Chain-of-World 2603.03195) | **MISSING (design only)** | None. Report §2 designs a parameter-isolated next-state head + `<deliberate>` token as a *second SDPO mode*; calibration (ECE/Brier) primary, Foresight@k kill-ablation. | **Next-state head is NOT built — purely designed.** This is the project's stated end-goal. |
+| **SimPO / TAID / Entropy-OPD distillation** (ADR-007) | **FULLY-REPLICATED** | `distillation/{simpo.py,taid.py,entropy_aware_opd.py}`, wired into `loss.py::compose_loss(dpo_variant=, sdpo_wrapper=, taid_t=)`; tests `test_distillation_losses.py`, `test_taid_parity.py`. | Production trainer (`composer_trainer.py`) only exposes SDPO/DPO; SimPO/TAID/Entropy-OPD live in the verification-mirror `compose_loss`, not the live GRPO subclass. |
+| **PRIME-RL / verl / Monarch** (ADR-006) | **PARTIAL** | `recipes/prime_rl/composer_loss.py` (Ch1+Ch3, raises NotImplementedError for SDPO — needs full logits, ADR-008), `recipes/monarch/actors.py`, parity harness. | PRIME-RL can't host SDPO (logits gap upstream); Monarch is actor skeletons; verl `AsyncServer` (the report's scale answer for tool-heavy trees) not integrated. |
+| **Prune-vs-train-on-all evidence** (RAFT 2504.11343, neg-gradient 2505.18830, near-miss 2503.14391, expert-failure, CCA/PURE) | **PARTIAL (ladder scaffolded)** | ADR-013 A0–A4 isolated-channel ladder + `integrations/altered_minds/{ladder.py,kl_logging.py,reward.py}`. | The report's **P0–P6 "generate-once/route-many" branch-usage axis** (the experiment that *settles* prune-vs-all) is design-only. No typed-train-on-all routing. |
+**Rubric B score:** 4 FULLY-REPLICATED (SDPO/OPSD, Dr.GRPO, PO-menu, SimPO/TAID/Entropy-OPD), 3 PARTIAL (DiLoCo, PRIME-RL/verl/Monarch, prune-vs-all ladder), **1 MISSING (the entire world-model cluster — the stated goal).**
+---
+## THE NOVEL EXTENSION — multi-model Monte-Carlo tree-of-work + world-model deliberation head
+**Status: 100% design, 0% code.** Verified by grep: no `SiblingBootstrap*`, no `world_model`/`WorldModel`, no `<deliberate>`/`MCTS`/`next_state_head` in `composer_replication/` (only docstring uses of the word "deliberately"). Every reference lives in `research/` notes and the deep-research report.
+What exists today (the *ancestor*): `teacher_replay.py` is **flat depth-1** (N teachers query the same `state["messages"]`, nobody applies the action) with **teacher-plurality fitness** (`Counter` over normalized actions, `extract_dpo_pairs`). The report's two changes — (1) **recursion** (apply each candidate via `FeatureDeletionEnv.step()` → new state → branch again) and (2) **execution-oracle fitness** (`_grade()`'s masked pass-fraction replaces plurality) — are the entire idea and are unbuilt.
+**Minimal first build (matches report Phase 1, the "one cheap, clearly-worth-it change"):**
+- A tree controller that, between branches, calls `FeatureDeletionEnv.step(action)` and grades leaves with `_grade()` — turning Channel-3 depth-1 stars into a real tree **with NO new loss term**. The divergence signal enters as the SDPO teacher's privileged-info conditioning (the reserved `SiblingBootstrapGenerator` slot: select max-reward sibling, emit "a working approach looks like: …", feed the *same* `ctx_teacher` splice).
+- This is ~1 module (`datagen/tree_controller.py`, ~200-300 LOC) + `SiblingBootstrapGenerator` (~60 LOC in `hint_generator.py`). The collator and loss are untouched.
+- **Divergence-gating is mandatory** (report §3/§10): branch only where sibling next-action distributions already disagree, else collapse to one rollout — turns O(N^D) into O(N·decision-points). Ungated cost ≈$64/trace vs $0.98 flat.
+**The world-model head (report Phase 4, gated on the P0–P6 verdict):** a parameter-isolated next-state adapter + `<deliberate>` token as a *second SDPO mode* — splice the realized post-action observation (stdout, tool_error kind, signed FAIL_TO_PASS delta, one-line diff) into the teacher context as privileged info, distill the student toward the foreseen-outcome distribution. Carrier requires no new kernel (rides `generalized_jsd_loss`). Build only if Foresight@k ≠ 0.
+---
+## Prioritized "to fully replicate + extend" gap list (build order)
+Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
+**Tier 0 — cheap fidelity fixes the evidence says move OOD generalization (do first):**
+1. **k1-in-reward KL** (Rubric A c2). Add a `kl_estimator="k1"` + `use_kl_in_reward=True` path to the trainer: compute `−log r` per token, fold into the *advantage/reward* (not the loss), set TRL `beta=0.0` to disable its k3-in-loss term. Mirror TRL issue #4967 / verl's choice. `composer_trainer.py` ~60 LOC + test flipping the pinned k3 assertion. **This is the highest-fidelity-leverage single change.**
+2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — the aux scalar reward array (style/communication/unfinished-todo) + the nonlinear length/effort penalty `C_length` (exact eq. in research/10), as TRL `RewardFunc`s composable with `env.reward_fn`. ~120 LOC. Reproducible *without* the hint mystery; directly targets Composer's "communication style + effort calibration" goal.
+**Tier 1 — close the highest-value PARTIALs:**
+3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.
+4. **`SiblingBootstrapGenerator`** (Rubric A a′ layer 4 + the SDPO implicit-feedback lever): ~60 LOC in `hint_generator.py`, wired into `default_composite`. Unblocks the tree's "zero-new-loss-term" wiring.
+5. **SFT-first stage** (Rubric A d): a thin SFT recipe over clean winning trajectories before RL (report §5 "competence floor"). Reuse TRL `SFTTrainer`.
+**Tier 2 — the novel extension (the report's phased ladder):**
+6. **Tree controller + execution-oracle fitness** (recursion, the core delta): `datagen/tree_controller.py` — env-step between branches, `_grade()` leaves, divergence-gated expansion. Report Phase 1+2.
+7. **P0–P6 typed-train-on-all routing** (settles prune-vs-all): extend ADR-013's ladder with generate-once/route-many on a shared tree; primary metrics = near-miss calibration (ECE/Brier) + Foresight@k, pass@1 secondary. Report Phase 3.
+8. **World-model next-state head** (the stated goal): parameter-isolated adapter + `<deliberate>` token as a second SDPO mode. Build *only if* P4/P6 beat P0–P3 on foresight. Report Phase 4.
+**Tier 3 — production-fidelity infra (Anyrun analogue on AWS):**
+9. **Real AWS sandbox tiering** (Rubric A f): EKS gVisor `runsc` RuntimeClass default → Kata+Firecracker (self-managed node groups; EKS Managed Node Groups override the CPU-Options needed for nested virt) → container-free SWE-MiniSandbox for high fan-out. Egress-off. us-west-2, the live SageMaker S3 buckets as the trace/rendezvous store.
+10. **`EKSExecutor` + `SageMakerExecutor` flesh-out + `[serverless]` dep bump** (s3fs/boto3/kubernetes): make the DiLoCo-over-S3 substrate actually launch multi-replica on the live account. Report §9.
+11. **verl `AsyncServer` backend** for the tool-heavy tree (TRL has no async GPU-decoupled agent loop). Report §8/§10 Phase 5.
+**Tier 4 — defense-in-depth completion:**
+12. **Offline LLM-judge hack monitor** (EvilGenie-style, via Bedrock) as a flagging-only monitor (never the training reward); the report warns `HackMonitor` validated on constructed examples likely misses in-the-wild hacks.
+---
+## What "FULLY taking advantage" would mean, concretely
+- **Published Composer 2.5 recipe (Ch1 Dr.GRPO + Ch2 SDPO):** essentially done; the *only* fidelity infidelity is k3-vs-k1 KL (Gap #1) and the missing live-GPU SDPO smoke (Gap #3).
+- **Composer-2's reproducible behavior-shaping (aux rewards + length penalty + self-summarization):** the cheapest unrealized "next-level" wins — fully specified, zero hint-mystery, currently 0% built (Gaps #2 + self-summarization).
+- **The papers' frontier the repo *exceeds* Composer on:** the PO-objective menu (6 objectives vs Composer's 1), the distillation menu (SimPO/TAID/Entropy-OPD), and the run-level collapse kill-switch. These are genuine over-delivery.
+- **The novel bet (tree-of-work + world-model head):** the literature says build it *as a falsifiable ablation, not a premise* (report §3/§7). It is the project's differentiator and is entirely unbuilt — Tier 2 is where "next level" beyond Composer actually lives, conditioned on the P0–P6 verdict.