Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on
Commit
50e3f0e
·
1 Parent(s): c647fe9

Wave 20: 5-facet AWS-native architecture design (F1-F5)

Browse files

Answers the dataset-gen+SFT-vs-RL-vs-BOTH question with 5 grounded
facet designs from the aws-native-architecture-design workflow:

- F1 systems-framing: BOTH = two loops at two timescales, not two phases;
component->loop mapping table; SFT-first ordering; S3 dataset contract
(6 typed prefixes); repo delta (tree_controller, s3_layout, sft_floor).
- F2 aws-datagen: per-stage native verdict (Glue 5.0 Spark ingest ->
Bedrock batch replay -> AWS Batch array on Spot test-exec -> EMR
Serverless normalize), Step Functions orchestration.
- F3 rl-sagemaker: runnable-now g5.2xlarge GSM8K GRPO smoke (<$1),
colocated vLLM, BYO container from PyTorch DLC; live-verified quota.
- F4 decoupled-diloco-s3: N single-instance jobs, ObjectStoreAllReduce
PUT-poll-mean over S3, strong-consistency correctness, IAM gotcha,
cheapest validating-run ladder.
- F5 fidelity-audit: rubric vs Composer 2.5/Composer-2 + load-bearing
papers; 5 FULLY-REPLICATED, k3-vs-k1 KL documented infidelity,
missing behavior-shaping recipe, tree+world-model 100% design;
prioritized build order.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

research/design-F1-systems-framing.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F1 — Systems Framing: Dataset-Gen+SFT, RL, or BOTH?
2
+
3
+ **Facet question:** Is this a dataset-generation+SFT system, an RL system, or BOTH?
4
+
5
+ **Committed answer: BOTH — and decisively so — structured as TWO LOOPS at two
6
+ timescales, not two phases.** The repo already physically contains both halves.
7
+ The OUTER (slow) loop is dataset/curriculum *construction*; the INNER (fast)
8
+ loop is RL. They feed each other continuously: the inner loop's improved student
9
+ generates the outer loop's next seed traces, and the inner loop's learned
10
+ deliberation-confidence becomes the outer loop's branch gate. This is the report
11
+ §5 verdict ("two loops at different timescales, not two phases") made concrete
12
+ against the code.
13
+
14
+ This is not an architectural opinion — it is forced by what the modules *are*:
15
+
16
+ | Repo component | What it is, mechanically | Loop |
17
+ |---|---|---|
18
+ | `ingestion/claude_code.py` | Claude Code JSONL → `TraceState` (one node per assistant turn; `tool_error` flag; `strip_thinking=False`) — produces seed traces | **OUTER** (dataset) |
19
+ | `teacher_replay.py` | N-teacher OpenRouter replay + `extract_dpo_pairs` — OFFLINE dataset gen, flat depth-1 stars hung off a frozen trace | **OUTER** (dataset) |
20
+ | `datagen/substrates.py` (`SweBenchAdapter`) | SWE-bench tuple → `FeatureDeletionTask` (revert gold patch = synthesize broken repo) — task SYNTHESIS | **OUTER** (curriculum) |
21
+ | `datagen/env.py` `FeatureDeletionEnv.step()/_grade()` | execution oracle: runs action in sandbox, returns observation; `_grade()` = masked `FAIL_TO_PASS` pass-fraction — this is the **RL env reward kernel** | **OUTER** (fitness) AND consumed by **INNER** (`reward_fn`) |
22
+ | `datagen/curriculum.py` `DifficultyCurriculum` | p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02 — selection/sampling for the next round | **OUTER** (selection) |
23
+ | The multi-model MCTS tree controller (to build) | recursion: apply each model's action via `env.step()`, branch again — depth-1 stars → tree | **OUTER** (the core delta) |
24
+ | `trainer/composer_trainer.py` `ComposerReplicationTrainer` | a real `trl.GRPOTrainer` subclass; `total = grpo + α·sdpo + β·trace_replay_dpo` — **this is RL** | **INNER** (fast RL) |
25
+ | `datagen/env.py` `FeatureDeletionEnv.reward_fn` | TRL `RewardFunc` adapter (`prompts, completions → list[float]`) — the env wearing its RL face | **INNER** (RL reward) |
26
+ | `loss.py` `compose_loss` | TRL-free 3-channel harness; Channel 1 = LM cross-entropy on response tokens — **this is the SFT-able face** (BC limit GRPO converges to) | **INNER**, also the **SFT-first floor** |
27
+ | `safety/holdout.py` + `safety/kill_switch.py` (`HeldOutGuard`/`HeldoutSplit`), wired into the trainer | run-level collapse/reward-hacking tripwire on proxy-minus-realeval gap — an **RL-run safeguard** | **INNER** (safety) |
28
+
29
+ So: `FeatureDeletionEnv.reward_fn` is an RL env. `teacher_replay` is offline
30
+ dataset gen. `ComposerReplicationTrainer(trl.GRPOTrainer)` is RL.
31
+ `compose_loss`'s `lm_ce` channel is an SFT harness. `HeldOutGuard` is an RL-run
32
+ safeguard. The system *is* both, by construction.
33
+
34
+ ---
35
+
36
+ ## The SFT-first competence floor, THEN RL (mirrors Cursor's CPT+SFT→RL)
37
+
38
+ `docs/COMPOSER_RECIPE_MAPPING.md` confirms Cursor's ordering is **Continued
39
+ Pretraining → SFT → RL**, and that the repo deliberately *skips* CPT (starts
40
+ from an already-code-tuned base, e.g. `Qwen3-Coder-7B` / `Qwen3-Coder-30B-A3B`).
41
+ That leaves a two-step ordering this facet commits to:
42
+
43
+ 1. **SFT-first (competence floor).** Take the OUTER loop's *clean winning
44
+ trajectories* (oracle-clean `_grade()` passes only — Gate 1) and run standard
45
+ SFT. The carrier already exists: `compose_loss` with `alpha_sdpo=0,
46
+ beta_replay=0` reduces to `_lm_response_ce` — next-token cross-entropy masked
47
+ to assistant-response tokens. This is the "GRPO converges to BC under
48
+ deterministic rewards" limit stated in `loss.py`. It establishes a floor so
49
+ GRPO has a non-degenerate starting policy. (For a clean separation, an SFT
50
+ `Trainer` over the same corpus is equivalent; the point is the *corpus* —
51
+ winning leaves — and the *masking* — response tokens only.)
52
+ 2. **THEN RL.** `ComposerReplicationTrainer` runs GRPO/Dr.GRPO (`make_po_config`)
53
+ on the `FeatureDeletionEnv.reward_fn` execution oracle, with the optional
54
+ SDPO (α) and trace-replay-DPO (β) channels, plus the contested world-model
55
+ next-state head as a **second SDPO mode** (parameter-isolated; report §2/§4).
56
+
57
+ This is exactly the report §5 line: "SFT-first establishes a competence floor on
58
+ clean winning trajectories before RL — mirroring Cursor's CPT+SFT→RL ordering and
59
+ the repo's own outer (datagen/teacher_replay) / inner
60
+ (`ComposerReplicationTrainer`) split."
61
+
62
+ ---
63
+
64
+ ## The single diagram-able data flow
65
+
66
+ ```
67
+ ┌──────────────────────────────────────────────────────────────┐
68
+ │ OUTER LOOP (slow: hours→days; bursty, Spot-friendly, EKS) │
69
+ │ │
70
+ raw base model ───►│ ingestion/claude_code.py (seed traces, TraceState) │
71
+ + Claude traces │ │ │
72
+ │ ▼ │
73
+ │ multi-model MCTS tree controller ──► N models branch │
74
+ │ (divergence-gated; teacher_replay generalized flat→tree) │
75
+ │ │ │
76
+ │ ▼ apply each action │
77
+ │ FeatureDeletionEnv.step() ─► sandbox exec ─► new state │
78
+ │ │ │
79
+ │ ▼ leaf grade │
80
+ │ FeatureDeletionEnv._grade() (masked FAIL_TO_PASS pass-frac) │
81
+ │ │ + DifficultyCurriculum.update (p(1-p) selection) │
82
+ │ ▼ HARVEST + TYPE the divergence │
83
+ └────────────┼───────────────────────────────────────────────────┘
84
+
85
+ ┌──────────────────── S3 ────────────────────────────────────────────────┐
86
+ │ s3://<bucket>/runs/<run_id>/ │
87
+ │ sft_corpus/ ← clean WINNING trajectories (SFT-first) │
88
+ │ dpo_pairs/ ← near-miss (chosen=sibling winner, rejected=loser)│
89
+ │ rl_task_pool/ ← FeatureDeletionTask registry + curriculum priors │
90
+ │ wm_tuples/ ← (state, action, next_state, outcome) — ALL branches│
91
+ │ divergence_pairs/ ← divergence-annotated nodes (where siblings forked)│
92
+ │ holdout/ ← DISJOINT held-out eval anchor (never fed back) │
93
+ │ diloco_rendezvous/ ← round_{NNNNNN}/rank_{RRRR}.pt (ObjectStoreAllReduce)│
94
+ └────────────┬──────────────────────────────────────────────────────────────┘
95
+
96
+ ┌───────── SFT JOB ───────────────┐ (compose_loss lm_ce / SFT Trainer)
97
+ │ sft_corpus → competence-floor ckpt│ ──► seeds the RL init policy
98
+ └────────────┬──────────────────────┘
99
+
100
+ ┌──────────────────────────────────────────────────────────────┐
101
+ │ INNER LOOP (fast: minutes→steps; resilience-bound, GPU) │
102
+ │ ComposerReplicationTrainer (trl.GRPOTrainer subclass) │
103
+ │ total = grpo(reward_fn on _grade) │
104
+ │ + α·sdpo (hint-distill, JSD on aligned post-hint tokens) │
105
+ │ + β·trace_replay_dpo (dpo_pairs) │
106
+ │ + [world-model next-state head — 2nd SDPO mode, gated] │
107
+ │ HeldOutGuard tripwire on proxy−realeval gap (kill-switch) │
108
+ │ DiLoCo outer-sync every ~500-1000 steps via S3 diloco_rendezvous │
109
+ └────────────┬──────────────────────────────────────────────────┘
110
+
111
+ improved student model
112
+
113
+ ┌────────────────────────┴────────────────────────────────────────────────┐
114
+ │ FEEDBACK (why loops, not phases): │
115
+ │ 1. improved student generates the next round's seed traces (back to OUTER)│
116
+ │ 2. its learned deliberation-confidence becomes the next round's branch │
117
+ │ gate (the §3 bootstrap: cross-model disagreement early → learned │
118
+ │ deliberation-confidence later — same lever, two levels) │
119
+ └────────────────────────────────────────────────────────────────────────────┘
120
+ ```
121
+
122
+ ---
123
+
124
+ ## What goes in S3 between the loops (crisp)
125
+
126
+ The boundary between OUTER and INNER is S3 — and on AWS S3 *is* the DiLoCo
127
+ rendezvous backend with zero new code (the `ObjectStoreAllReduce` fsspec path,
128
+ `round_{NNNNNN}/rank_{RRRR}.pt`). The same bucket carries the dataset hand-off.
129
+ Live target bucket in this account/region:
130
+ `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/` (us-west-2;
131
+ a `sagemaker-dynamo-on-eks-hyperpod-*` bucket also already exists, evidence
132
+ HyperPod-on-EKS is provisioned here).
133
+
134
+ | S3 prefix | Contents | Producer (OUTER) | Consumer (INNER) |
135
+ |---|---|---|---|
136
+ | `sft_corpus/` | **SFT corpus** — clean winning trajectories (oracle-clean `_grade()` passes), masked to assistant-response tokens; JSONL | tree controller + `_grade()` Gate 1 | SFT job → `compose_loss` `lm_ce` (or SFT `Trainer`) |
137
+ | `dpo_pairs/` | **DPO pairs** — `{chosen=sibling-winner-or-teacher-consensus, rejected=student/losing-branch, state_messages}` (the `DPOPair` schema from `teacher_replay.py`) | `extract_dpo_pairs` generalized to sibling winners | Channel 3 (`β·trace_replay_dpo`) |
138
+ | `rl_task_pool/` | **RL task pool** — `FeatureDeletionTask` registry (`repo, broken_image, fail_to_pass, pass_to_pass, deleted_symbols, test_command`) + `DifficultyCurriculum` priors | `SweBenchAdapter` + curriculum | `FeatureDeletionEnv` registry → `reward_fn` |
139
+ | `divergence_pairs/` | **Divergence-annotated pairs** — nodes where sibling next-action distributions disagreed (the SDPO privileged-info conditioning variable + which sibling subtree separated first) | tree controller (records pre-expansion divergence) | Channel 2 SDPO `ctx_teacher` splice / `SiblingBootstrapGenerator` |
140
+ | `wm_tuples/` | **World-model next-state tuples** — `(state, action, next_state, outcome)` from **ALL** branches incl. failures (CWM "train-on-all" target; the safe home for failed-branch signal) | every `env.step()` observation, every leaf grade | world-model next-state head (2nd SDPO mode), gated |
141
+ | `holdout/` | **Disjoint held-out eval anchor** — `HeldoutSplit` partition; NEVER fed back to the generator | `HeldoutSplit.split(seed=…)` once | `HeldOutGuard.heldout_eval_fn()` |
142
+ | `diloco_rendezvous/` | DiLoCo pseudo-gradient exchange, `round_{NNNNNN}/rank_{RRRR}.pt` | inner replicas | inner replicas (`ObjectStoreAllReduce`) |
143
+
144
+ Note the keystone (report §4/§2): a *failed* branch is poison for the policy
145
+ gradient but **gold for the world model** — so a loser goes to `wm_tuples/`
146
+ (safe, no policy penalty) and optionally to `dpo_pairs/` as a contrastive
147
+ `rejected` against a sibling winner (never as raw negative gradient), but never
148
+ to `sft_corpus/`. That is "type the signal and route it" realized as S3 prefixes.
149
+
150
+ ---
151
+
152
+ ## AWS-native realization (concrete, this account)
153
+
154
+ - **OUTER loop on EKS** (single control plane): Argo Workflows DAG, one node =
155
+ one divergence-gated branch; vLLM RayService pods for open-weight model
156
+ families + API-egress pods for hosted teachers; gVisor (`runsc` RuntimeClass)
157
+ sandbox pods running `FeatureDeletionEnv._grade()`. Writes all six dataset
158
+ prefixes to S3 via IRSA.
159
+ - **SFT job** between loops: a SageMaker Training Job (`File` mode for the
160
+ `sft_corpus/` < 100 GB; `FastFile` if it grows past ~50–100 GB) OR an EKS GPU
161
+ pod; reads `sft_corpus/`, writes a competence-floor checkpoint to
162
+ `s3://…/runs/<id>/ckpt_sft/`.
163
+ - **INNER loop**: `ComposerReplicationTrainer` on a Karpenter p5/g6e NodePool,
164
+ swappable to a HyperPod-attached node-group (1:1 EKS↔HyperPod mapping, already
165
+ provisioned per the `sagemaker-dynamo-on-eks-hyperpod-*` bucket) for
166
+ resilience-bound long runs. DiLoCo replicas rendezvous only through S3 — no
167
+ cross-job NCCL — so a straggler blocks at the poll loop (`timeout_s=1800`)
168
+ instead of deadlocking a gang. `SageMakerExecutor` is the bursty fallback
169
+ (N single-instance Training Jobs, `EnableNetworkIsolation=False` so the
170
+ container can S3 PUT/GET the rendezvous).
171
+
172
+ The DiLoCo math, `MockManager`, `make_diloco_outer_loop`, the trainer, the loss,
173
+ and the env are **untouched** across all of this — the `ServerlessExecutor`
174
+ Protocol + `ObjectStoreAllReduce` are the entire portability contract.
175
+
176
+ ---
177
+
178
+ ## Repo delta to realize the framing
179
+
180
+ 1. `composer_replication/datagen/tree_controller.py` (NEW, ~250-350 LOC) — the
181
+ recursion: apply each model's candidate action via `FeatureDeletionEnv.step()`,
182
+ branch again, grade leaves, emit the six S3 prefixes. The core OUTER delta.
183
+ 2. `composer_replication/pipeline/s3_layout.py` (NEW, ~80 LOC) — typed writers
184
+ for `sft_corpus/ | dpo_pairs/ | rl_task_pool/ | divergence_pairs/ | wm_tuples/
185
+ | holdout/`; the OUTER→INNER contract in one place.
186
+ 3. `composer_replication/pipeline/sft_floor.py` (NEW, ~60 LOC) — SFT-first
187
+ driver: read `sft_corpus/`, run `compose_loss` (`alpha_sdpo=0, beta_replay=0`)
188
+ or an SFT `Trainer`, write `ckpt_sft/`. Wraps existing `_lm_response_ce`.
189
+ 4. `composer_replication/trainer/composer_trainer.py` (EDIT, ~40 LOC) — add the
190
+ world-model next-state head as a 2nd SDPO mode (parameter-isolated adapter +
191
+ `<deliberate>` token), reading `wm_tuples/`; gated OFF by default.
192
+ 5. `composer_replication/diloco/serverless/eks.py` (EDIT/FINISH) — `EKSExecutor`
193
+ Indexed Job mapping for the inner loop (a few hundred LOC; sibling of
194
+ `ModalSpawnExecutor`).
195
+ 6. `[serverless]` extra: add `s3fs`/`boto3`/`kubernetes` (the documented dep gap).
196
+
197
+ ---
198
+
199
+ ## Open questions
200
+
201
+ - Where exactly does SFT-first run — a dedicated SageMaker Training Job, or an
202
+ EKS pod sharing the inner NodePool? (Cost/latency tradeoff; corpus size gates
203
+ File vs FastFile.)
204
+ - Should the SFT-floor checkpoint be re-derived each outer generation
205
+ (full SFT→RL re-warm) or only once at gen-0 (RL-only thereafter)? The flywheel
206
+ feedback suggests gen-0-only, with RL carrying subsequent gains.
207
+ - Is the world-model head trained inside the inner GRPO trainer (shared step) or
208
+ as a separate offline pass over `wm_tuples/`? Report §2 favors
209
+ parameter-isolation; a separate pass is the strongest isolation.
210
+
211
+ ## Citations
212
+
213
+ - research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md §5 (two
214
+ loops), §2 (world-model head as 2nd SDPO mode), §4 (type-the-signal routing),
215
+ §6 (reuse/build table)
216
+ - composer_replication/datagen/env.py:63-94 (`step`/`_grade`/`reward_fn`)
217
+ - composer_replication/teacher_replay.py:162-262 (`replay_trace`,
218
+ `extract_dpo_pairs`, `DPOPair`)
219
+ - composer_replication/trainer/composer_trainer.py:54-178 (3-channel
220
+ `_compute_loss`), :184-251 (`HeldOutGuard` wiring)
221
+ - composer_replication/loss.py:71-261, :277-304 (`compose_loss`, `_lm_response_ce`)
222
+ - composer_replication/safety/holdout.py (`HeldoutSplit` disjointness)
223
+ - composer_replication/diloco/serverless/{executor.py,allreduce.py,sagemaker.py}
224
+ (Protocol + `ObjectStoreAllReduce` + S3 rendezvous)
225
+ - composer_replication/ingestion/claude_code.py:6-21 (one-node-per-turn,
226
+ `strip_thinking`)
227
+ - docs/COMPOSER_RECIPE_MAPPING.md:48-137 (CPT+SFT→RL ordering; repo skips CPT)
228
+ - Live: `aws s3 ls` → amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d,
229
+ sagemaker-dynamo-on-eks-hyperpod-* (us-west-2, acct 386931836011)
research/design-F2-aws-datagen.md ADDED
@@ -0,0 +1,191 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Facet F2 — AWS-Native Dataset Generation (the outer loop)
2
+
3
+ **Account 386931836011 · us-west-2 · Admin · 2026-06-09**
4
+
5
+ The outer (slow) loop of §5 — ingest → multi-teacher replay → invert SWE substrates into FeatureDeletion tasks → normalize to SFT/DPO corpus on S3 — is **embarrassingly parallel, bursty, fault-tolerant** (report §5, §8). The wrong move is to pick one service for the whole loop. Each of the four stages has a different compute shape, so each gets the right AWS-native primitive, all stitched by **Step Functions** (the AWS-native analog of the report's Argo controller for the *non-cluster* path) and all reading/writing one S3 dataset contract.
6
+
7
+ This note commits to a concrete service per stage and maps every choice to the repo file it backs.
8
+
9
+ ---
10
+
11
+ ## TL;DR — the per-stage verdict
12
+
13
+ | Outer-loop stage | Repo code | Compute shape | **AWS service (committed)** | Why not the others |
14
+ |---|---|---|---|---|
15
+ | (a) Ingest + clean Claude Code traces | `ingestion/claude_code.py` | Light CPU, JSONL → TraceState, IO-bound | **Glue Spark ETL job** (Glue 5.0) | Trivial scale; one Spark job over a JSONL prefix; Glue's Data Catalog + bookmarks are free here. SM Processing also fine but Glue is the cheaper IO-bound default. |
16
+ | (b) Replay each state across N teachers | `teacher_replay.py` | API-bound fan-out, N×T calls, no local GPU | **Bedrock batch inference** (`CreateModelInvocationJob`, one job per teacher) + a thin **EMR Serverless** aggregation step | NOT Glue-Ray (end-of-support). NOT SageMaker-Processing-with-Ray for the *calls* (you'd pay for idle instances waiting on API latency). Bedrock batch = 50% discount, S3-native, the fan-out IS the job. |
17
+ | (c) Synthesize FeatureDeletionEnv tasks (run tests in sandboxes) | `datagen/substrates.py`, `env.py`, `docker_sandbox.py`, `validator.py` | CPU-heavy, untrusted code, 1 container per task, fault-tolerant | **AWS Batch array jobs on EC2 Spot** (container = the existing `DockerSandbox` image) | NOT SM Processing (no per-task isolation/retry granularity; 1 job = 1 fleet). NOT Fargate-only (no `--privileged`/gVisor, 4-vCPU cap, no Spot-as-cheap). Batch array index = task index, matches SWE-bench `--max_workers` model exactly. |
18
+ | (d) Normalize → SFT corpus + DPO pairs | `replaysim/normalize.py` (data-juicer), `teacher_replay.extract_dpo_pairs` | CPU Spark/dataframe dedup + the data-juicer op-graph | **EMR Serverless (Spark)** running data-juicer's Ray/standalone executor in `--mode local` per partition, OR SM Processing for the single-node data-juicer path | EMR Serverless is the Spark-style dataframe normalize/dedup home; data-juicer runs CPU-only here. Glue ETL would also work but EMR Serverless gives finer Spark control + Graviton price/perf. |
19
+
20
+ The recurring principle: **API-bound and test-execution work must NOT live on a service that bills for idle wall-clock** (Glue/SM Processing keep instances hot while you wait on Bedrock or a 5-minute pytest). Bedrock batch is async-priced; Batch array jobs are per-task Spot. Spark-shaped dataframe work (ingest, dedup, normalize) goes to the Spark services (Glue ETL for the trivial ingest, EMR Serverless for the heavy normalize).
21
+
22
+ ---
23
+
24
+ ## Stage (a) — Ingest + clean traces · Glue 5.0 Spark ETL
25
+
26
+ `ingestion/claude_code.py::ClaudeCodeIngester.ingest()` is a pure JSONL→TraceState transform (one TraceState per assistant turn; `strip_thinking=False` per report §6 — 67% of error-recovery turns are pure thinking). This is IO-bound dataframe work over a prefix of session files.
27
+
28
+ **Service: AWS Glue 5.0 Spark ETL job.**
29
+ - Input: `s3://<datagen-bucket>/raw/claude_code/**/*.jsonl` (Claude Code session JSONL uploaded from `~/.claude/projects`).
30
+ - The Glue script `mapPartitions` over the JSONL files; each partition runs `ClaudeCodeIngester().ingest(path)` unchanged (it already yields `TraceState`), filters subagent/sidechain records, and writes Parquet.
31
+ - Output: `s3://<datagen-bucket>/traces/v1/run_id=<id>/part-*.parquet` partitioned `by run_id`.
32
+ - Glue Data Catalog table `composer_traces` makes the trace store queryable by Athena (debug: "how many error-site turns this run?").
33
+
34
+ Why Glue over SM Processing here: ingestion is small, periodic, IO-bound; Glue's per-DPU-second billing + job bookmarks (skip already-ingested sessions) fit better than spinning a Processing fleet. `ClaudeCodeIngester` runs verbatim inside the Spark UDF — zero repo change to the ingester itself.
35
+
36
+ > **Live caveat (worth flagging):** `ClaudeCodeIngester.ingest()` is a per-file, single-pass, *record-parallel* pure-Python iterator with no shuffle/join — i.e. it is **not** intrinsically Spark-shaped. **SageMaker Processing** with `ProcessingInput.S3DataDistributionType = ShardedByS3Key` (custom container, the ingester runs unchanged, slice config from `processingjobconfig.json`/`resourceconfig.json`) is the *cleaner* primitive for this exact shape and avoids any dependence on the Glue stack — relevant because **AWS Glue-for-Ray is now closed to new customers** (confirmed in the Glue engine docs), so leaning on Glue here means committing to Glue-for-Spark specifically. Verdict stands as Glue ETL for the trivial periodic ingest (catalog + bookmarks are a real convenience), with SM Processing as the equally-valid, Glue-independent fallback. Either way the ingester code is untouched.
37
+
38
+ ---
39
+
40
+ ## Stage (b) — N-teacher replay · Bedrock batch, NOT OpenRouter
41
+
42
+ Today `teacher_replay.py` calls **OpenRouter** over httpx with `DEFAULT_TEACHERS = [claude-opus, gpt-5, deepseek-v4]` and a `max_total_usd` cap. The facet question: stay API or move to Bedrock?
43
+
44
+ ### Verdict: move the AWS-native default to **Bedrock batch inference**, keep OpenRouter as a fallback adapter.
45
+
46
+ Reasoning grounded in research + repo:
47
+
48
+ 1. **Bedrock batch = 50% of on-demand pricing**, S3-in/S3-out, ~24h turnaround SLA. The replay is the single most expensive piece of the whole system (report §10: ~$0.98/trace at N=3, ~$64/trace at 8-teacher×1000-step). A 50% cut on the dominant cost line is the highest-leverage AWS-native move in this facet, and replay is *exactly* the latency-insensitive offline workload Bedrock batch targets.
49
+ 2. **The fan-out IS one job per teacher.** Bedrock batch does not multiplex models in a single job — `CreateModelInvocationJob` takes one `modelId`. So the N-teacher fan-out = N batch jobs over the *same* JSONL, each writing to its own S3 prefix. The flat `for state: for teacher:` loop in `replay_trace` becomes "emit one shared JSONL of all states, submit N jobs." This is structurally *simpler* than the current per-call gather.
50
+ 3. **Heterogeneity survives on Bedrock — VERIFIED LIVE in this account.** `aws bedrock list-foundation-models --region us-west-2` returns a genuinely multi-lab pool: `anthropic.claude-opus-4-8`, `anthropic.claude-opus-4-7`, `anthropic.claude-opus-4-6-v1`, `anthropic.claude-sonnet-4-6`, `anthropic.claude-haiku-4-5-...`, `deepseek.v3.2`, `deepseek.r1-v1:0`, `meta.llama4-maverick-17b-instruct`, `meta.llama3-3-70b-instruct`, etc. That is Claude + DeepSeek + Llama = three different families, satisfying the report's N≥3 heterogeneous-population anti-collapse safeguard (§5 #4) and the "drop same-family teachers if Claude is the student" leakage rule (§6).
51
+ - **CRITICAL batch-eligibility split (read live from the Bedrock "Supported Regions and models for batch inference" table):** not every model has a *batch* ID. **Batch-eligible** in/through `us-west-2`: **DeepSeek V3.2** (`deepseek.v3.2`, single-region `us-west-2`), **Claude Sonnet 4.6 / Opus 4.6** (via the `us.` **cross-region inference profile** whose destinations include `us-west-2`), and the Nova/Titan families. **On-demand ONLY (no ARN-versioned batch ID, so NOT batchable):** **Claude Opus 4.7 / 4.8** — reachable via `bedrock-runtime` `Converse`/`InvokeModel` and fanned out with a bounded async/thread pool exactly like today's `replay_trace`, but you pay on-demand. So the cheap bulk batch pool is `{deepseek.v3.2, us.anthropic.claude-sonnet-4-6, us.anthropic.claude-opus-4-6}`; if the teacher set must include 4.7/4.8 they ride the on-demand path or are substituted by batchable Opus 4.6.
52
+ 4. **Bedrock data does not train Anthropic models** (governance) — a real reason to prefer it over OpenRouter for a self-improving loop.
53
+
54
+ ### Concrete wiring
55
+
56
+ `teacher_replay.py` gains a `BedrockBatchTeacherPool` alongside the OpenRouter path (the `TeacherSpec` slug becomes a Bedrock `modelId` or inference-profile id like `us.anthropic.claude-haiku-4-5-...`):
57
+
58
+ ```
59
+ # new: teacher_replay_bedrock.py (~180 LOC)
60
+ def submit_replay_batch(states, teachers, s3_in, s3_out, role_arn) -> list[jobArn]:
61
+ # 1. write ONE shared abc.jsonl: {recordId: state_id, modelInput: {anthropic_version, messages, max_tokens}}
62
+ # one record per state (messages = state["messages"]). recordId == state_id is the join key.
63
+ # 2. for each teacher: bedrock.create_model_invocation_job(
64
+ # modelId=teacher.model_id, jobName=..., roleArn=role_arn,
65
+ # inputDataConfig={"s3InputDataConfig":{"s3Uri": s3_in}},
66
+ # outputDataConfig={"s3OutputDataConfig":{"s3Uri": f"{s3_out}/{teacher.slug}/"}})
67
+ # 3. return jobArns; poll get_model_invocation_job until Completed.
68
+ ```
69
+
70
+ The output `.jsonl.out` rows carry `recordId` (=state_id) + `modelOutput`; an EMR Serverless step joins all N teacher outputs back by `state_id` into the exact `list[TeacherCallResult]` shape `extract_dpo_pairs` already consumes — so `extract_dpo_pairs` and the entire DPOPair contract are **byte-for-byte untouched**.
71
+
72
+ **Constraints to honor (from research):** Bedrock batch input file ≤ 1 GB, total job ≤ 5 GB, default 100k records/job (adjustable via Service Quotas), ≥ a minimum-records floor per model. The submitter must shard a >100k-state run into multiple JSONL files. A `BedrockBatchTeacherPool` is the right home for that sharding.
73
+
74
+ **Where OpenRouter stays:** when N>3 teachers from labs Bedrock doesn't host (e.g. a brand-new frontier model), or when sub-24h latency is needed for a hot debugging loop. The `TeacherSpec` TypedDict already abstracts provider; we add a `provider: "bedrock"|"openrouter"` field and route. This keeps `replay_trace`'s async OpenRouter path as the low-latency escape hatch and Bedrock batch as the cheap bulk default.
75
+
76
+ ### Why NOT Glue-Ray / SM-Processing-with-Ray for the fan-out
77
+ The facet floats "Glue Ray jobs / SageMaker Processing with Ray for the N-model parallel fan-out." **AWS Glue for Ray is end-of-support** — AWS now explicitly recommends Ray-on-EKS (KubeRay) instead (docs: *AWS Glue for Ray end of support*). And running the *teacher API calls* on a Ray cluster (Glue or SM Processing) means paying for idle CPU/instances while threads block on model latency — the anti-pattern above. Ray's place in this facet is **stage (c)** orchestration of sandbox fan-out *if* we ever want Ray semantics, but Batch array jobs are the simpler AWS-native fit there too. So: **Bedrock batch for the calls, EMR Serverless to fan-in.**
78
+
79
+ ---
80
+
81
+ ## Stage (c) — FeatureDeletion synthesis + test execution · AWS Batch array jobs on Spot
82
+
83
+ This is the compute-heavy, genuinely-new part (report §8: "per-branch sandbox isolation is the throughput ceiling of the whole idea"). Two sub-steps:
84
+
85
+ **c1. Schema inversion** — `SweBenchAdapter.to_task(instance)` (`substrates.py`) maps a SWE-bench-shaped row → `FeatureDeletionTask` (revert gold patch = broken repo; FAIL_TO_PASS = reward target; PASS_TO_PASS = guard). Pure CPU, no Docker. This runs inside the **Glue ingest job (a)** or a tiny Lambda — it's a dict transform. `is_redistributable()` license-gate (GPL/AGPL/LGPL filter) runs here.
86
+
87
+ **c2. Materialize + validate the broken repo** — this is where it gets heavy. For each task: pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()` (the PRIMARY reward-hack control — strip `__pycache__`/`.git`/`*.pyc`), run the test command, confirm FAIL_TO_PASS actually fails and PASS_TO_PASS actually passes (the 4-gate `validator.py` solvability check). Each task = one untrusted, isolated, ~1-5 minute container run.
88
+
89
+ **Service: AWS Batch array jobs, EC2 Spot compute environment, container = the existing `DockerSandbox` image.**
90
+
91
+ ```
92
+ batch.submit_job(
93
+ jobName=f"fd-validate-{run_id}",
94
+ jobQueue="fd-sandbox-spot",
95
+ jobDefinition="fd-docker-sandbox:N", # the DockerSandbox image in ECR
96
+ arrayProperties={"size": n_tasks}, # up to 10,000 children
97
+ containerOverrides={"environment": [{"name":"TASK_MANIFEST_S3","value": s3_manifest}]})
98
+ ```
99
+
100
+ - Each array child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task at that line in the S3 task manifest (`tasks/v1/run_id=<id>/manifest.jsonl`), boots the broken image, runs `validator` + `FeatureDeletionEnv._grade()` against `LocalSubprocessSandbox`/`DockerSandbox`, writes one result row to `s3://.../task_grades/<run_id>/<task_id>.json`.
101
+ - This is **exactly the SWE-bench harness `--max_workers` model**: SWE-bench runs 1 container per instance with N parallel workers; DeepSWE hit Docker-daemon limits at 512 containers/iteration and preloaded images onto NVMe (report §8). Batch array jobs give that fan-out with managed retry (`retryStrategy`), Spot interruption handling, and per-child CloudWatch logs — without us building a container scheduler.
102
+ - **Isolation tier (report §8 layered posture):** the `DockerSandbox` already bakes the lockdown recipe (`network_mode=none`, `read_only`, `cap_drop=ALL`, `no-new-privileges`, `pids_limit`, gVisor `runtime='runsc'` when available). On Batch EC2 (not Fargate), we can install gVisor on the AMI and run untrusted model-generated fix attempts under `runsc` by default; Kata+Firecracker for adversarial code requires **self-managed** node groups (the EKS-MNG CPU-Options gotcha from §8 applies to Batch managed EC2 too — use a self-managed launch template if nested virt is needed).
103
+
104
+ ### Why NOT SageMaker Processing for the sandboxes
105
+ SM Processing is one job = one fixed instance fleet with `ShardedByS3Key` splitting input across instances. It has **no per-task retry**, no Spot-native interruption recovery at task granularity, and no array-index primitive — a single poisoned task (infinite loop, fork bomb) can wedge a whole Processing instance's shard. Batch array jobs isolate each task to its own container with its own timeout (`DockerSandbox.exec_timeout_s` + Batch `timeout`), retry, and Spot replacement. For "10,000 untrusted code executions, each independent, some will hang," Batch is the textbook fit and SM Processing is not.
106
+
107
+ ### Why NOT plain Fargate
108
+ Fargate caps at 4 vCPU / 30 GB without `--privileged`, cannot run gVisor/Kata, and is pricier than Spot for bursty bulk. Batch-on-EC2-Spot is 60-70% cheaper for this fault-tolerant fan-out (report §10 Spot savings).
109
+
110
+ ---
111
+
112
+ ## Stage (d) — Normalize → SFT corpus + DPO pairs · EMR Serverless (Spark)
113
+
114
+ After (b) fan-in produces `list[TeacherCallResult]` and `extract_dpo_pairs` produces `DPOPair` rows, `replaysim/normalize.py::DJNormalizer` runs the **data-juicer op-graph** (`recipes/replaysim/default.yaml`: length filter ×2, words-num ×2, special-char ×2, document_deduplicator). ADR-004 chose data-juicer precisely because the op set is **CPU-only** (no NeMo-Curator GPU dep).
115
+
116
+ **Service: EMR Serverless (Spark) application, Graviton.**
117
+ - The normalize is Spark-shaped dataframe work: read all DPOPair rows for a run, run the data-juicer op-graph, dedup across the corpus, write the partitioned corpus.
118
+ - data-juicer's `DefaultExecutor` runs in-process per Spark partition (the repo's `DJNormalizer.normalize()` already does file-in/file-out via `init_configs` + `DefaultExecutor`). On EMR Serverless we `mapPartitions` → run `DJNormalizer(skip_dj=False).normalize(partition_rows)` per partition, then a Spark `dropDuplicates` for cross-partition full-corpus dedup (the repo notes `document_deduplicator` is per-batch only — Spark closes that gap natively).
119
+ - EMR Serverless auto-scales workers, bills per vCPU/GB-second on Graviton, and is the AWS-native Spark home for "dataframe normalize + dedup at scale." The repo's `replaysim` already *is* the op-graph — EMR Serverless just runs it distributed.
120
+
121
+ Why EMR Serverless over Glue ETL here: the normalize is the heavier of the two Spark stages, data-juicer wants control over the Python runtime (custom `--py-files`/wheel), and EMR Serverless gives finer Spark tuning + Graviton price/perf than Glue's DPU model. Glue ETL stays the choice for the *light* ingest (a). Both are valid Spark; we split by weight.
122
+
123
+ **Two output corpora** (the dataset contract below): the SFT corpus (clean winning trajectories — report §5 "SFT-first competence floor") and the DPO pairs (from `extract_dpo_pairs` + future execution-oracle near-miss rejects).
124
+
125
+ ---
126
+
127
+ ## The S3 dataset contract
128
+
129
+ One bucket per environment (reuse the existing `amazon-sagemaker-386931836011-us-west-2-*` or a dedicated `composer-datagen-386931836011-us-west-2`). Layout is **Hive-partitioned by `run_id`** so each outer-loop generation is an immutable, addressable slice (the report's "generation" in the GA framing, §3/§5) and Athena/Glue Catalog can query across generations.
130
+
131
+ ```
132
+ s3://composer-datagen-386931836011-us-west-2/
133
+ raw/claude_code/**/*.jsonl # (a) input: uploaded sessions
134
+ traces/v1/run_id=<id>/part-*.parquet # (a) out: TraceState rows
135
+ tasks/v1/run_id=<id>/manifest.jsonl # (c1) FeatureDeletionTask rows (1/line; array index = line)
136
+ replay/v1/run_id=<id>/
137
+ input/states.jsonl # (b) shared Bedrock batch input (recordId=state_id)
138
+ teacher=<slug>/*.jsonl.out # (b) per-teacher Bedrock batch output
139
+ task_grades/v1/run_id=<id>/<task_id>.json # (c2) validator + _grade() results
140
+ corpus/v1/run_id=<id>/
141
+ sft/part-*.parquet # (d) SFT corpus (clean winners)
142
+ dpo/part-*.parquet # (d) DPO pairs (normalized DPOPair)
143
+ manifests/run_id=<id>.json # run-level manifest: counts, cost, lineage, schema_version
144
+ diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt # (separate) inner-loop ObjectStoreAllReduce (already exists)
145
+ ```
146
+
147
+ **Format decisions:**
148
+ - **Parquet** for `traces/`, `corpus/sft/`, `corpus/dpo/` — columnar, compressed, Athena-queryable, the SFT/DPO training read path (TRL `load_dataset` reads Parquet directly). This is the durable corpus.
149
+ - **JSONL** for `replay/input/`, `tasks/manifest`, `task_grades/` — because that's what Bedrock batch *requires* (`s3InputFormat=JSONL`), what AWS Batch array-index line-lookup wants, and what `DPOPair`/`TraceState` natively serialize to. JSONL is the wire format between stages; Parquet is the corpus-at-rest format.
150
+ - **`recordId == state_id`** is the universal join key linking trace → replay → DPO pair. Already true: `TraceState.state_id` is `f"{path.stem}::{idx:04d}"` and `DPOPair.state_id` carries it through.
151
+ - **`manifests/run_id.json`** is the run-level contract: `{run_id, schema_version, n_traces, n_tasks, n_dpo_pairs, teacher_pool, bedrock_job_arns, total_cost_usd, parent_run_id}`. `parent_run_id` threads the flywheel lineage (report §5: improved student regenerates next round's traces). The held-out-eval guard (`safety/HeldoutSplit`) reads `schema_version` + a `split` column to enforce the disjoint held-out set the report flags as the load-bearing gap (§7 Pushback 4).
152
+
153
+ ---
154
+
155
+ ## Orchestration: Step Functions (the non-cluster Argo analog)
156
+
157
+ Report §8 uses Argo Workflows on EKS as the outer-loop controller. For the **AWS-native, no-persistent-cluster** path this facet targets, the equivalent is **AWS Step Functions** (Standard workflow) driving the DAG:
158
+
159
+ ```
160
+ Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless)
161
+ → ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless)
162
+ → WriteManifest(Lambda)
163
+ ```
164
+
165
+ Step Functions has native `.sync` integrations for **Glue**, **EMR Serverless**, **Batch** (`arn:aws:states:::batch:submitJob.sync` — blocks until the whole array completes), and **SageMaker**, plus a `Map` state for the N-teacher Bedrock fan-out. This makes the whole outer loop one declarative state machine, retryable, with per-stage IAM. If/when the program moves to the EKS-primary path of §8, the same stage boundaries lift to Argo DAG nodes — the S3 contract is identical, so it's a controller swap, not a rewrite.
166
+
167
+ ---
168
+
169
+ ## Repo delta (what to build in `composer_replication/`)
170
+
171
+ 1. **`composer_replication/teacher_replay_bedrock.py`** (~180 LOC) — `BedrockBatchTeacherPool`: `submit_replay_batch()` writes the shared states JSONL, submits one `create_model_invocation_job` per teacher, polls, and parses `.jsonl.out` back into `list[TeacherCallResult]`. Add `provider`/`model_id` to `TeacherSpec`. `extract_dpo_pairs` untouched.
172
+ 2. **`composer_replication/datagen/aws/batch_validate.py`** (~120 LOC) — the Batch array-child entrypoint: read `AWS_BATCH_JOB_ARRAY_INDEX` → task manifest line → boot `DockerSandbox`/`LocalSubprocessSandbox` → run `validator` + `_grade()` → write `task_grades/.../{task_id}.json`. Plus a `submit_validate_array()` helper.
173
+ 3. **`composer_replication/datagen/aws/glue_ingest_job.py`** (~80 LOC) — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; writes `traces/` Parquet + Glue Catalog table.
174
+ 4. **`composer_replication/replaysim/emr_normalize_job.py`** (~100 LOC) — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; writes `corpus/dpo/` + `corpus/sft/` Parquet.
175
+ 5. **`composer_replication/datagen/aws/s3_contract.py`** (~120 LOC) — the S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers for `TraceState`/`FeatureDeletionTask`/`DPOPair`, the `recordId==state_id` join helpers, and `schema_version`/`split` column injection for the held-out guard.
176
+ 6. **`infra/datagen_stepfunctions.json`** (+ a thin CDK/`infra/datagen_stack.py`) — the Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless app, Glue role). ~250 LOC IaC.
177
+ 7. **`pyproject.toml`** — extend the `[aws]`/`[datagen]` extras with `boto3` for Bedrock/Batch/Glue/EMR clients (the `[serverless]` extra already needs `s3fs`/`boto3` per report §9).
178
+ 8. **Dockerfile** — the `DockerSandbox` ECR image (also the Batch job-definition image), baking gVisor `runsc` on the AMI/launch-template for default untrusted-code isolation.
179
+
180
+ The trainer, loss, `FeatureDeletionEnv`, curriculum, monitor, `extract_dpo_pairs`, `DJNormalizer` op-graph, and the DiLoCo/`ObjectStoreAllReduce` inner loop are all **untouched** — every AWS-native piece wraps an existing repo entrypoint and reads/writes the S3 contract.
181
+
182
+ ---
183
+
184
+ ## Open questions
185
+
186
+ - **Bedrock batch 24h SLA vs flywheel cadence.** Report §5 says outer-loop cadence is hours-to-days, so 24h batch turnaround is acceptable for bulk generations — but a fast bootstrap iteration may want the OpenRouter low-latency path. Need a `batch_vs_realtime` policy knob keyed on `max_total_usd` + urgency.
187
+ - **Per-branch sandbox cold-start is the §8/§10 explicit falsifier.** If Batch array job startup + image pull dominates wall-clock at target fan-out even with gVisor, the report's demotion path applies: keep Batch for control/grading but move bulk sandbox execution to a container-free pool (SWE-MiniSandbox class) or a warm Spot fleet. Must instrument cold-start in `batch_validate.py`.
188
+ - **Cross-region inference profiles for Bedrock batch** (`us.anthropic...`) route to multiple regions for throughput — confirm the batch service role + S3 bucket policy allow the destination regions, else throttling.
189
+ - **DeepSeek on Bedrock batch in us-west-2 — RESOLVED LIVE:** `deepseek.v3.2` is batch-eligible single-region in `us-west-2`; `deepseek.r1-v1:0` is in the on-demand catalog. Claude batch IDs are the `us.` cross-region inference profiles (`us.anthropic.claude-sonnet-4-6`, `us.anthropic.claude-opus-4-6`). Still confirm **minimum-records floors per model** + the **records/job & file-size quotas** in Service Quotas at build time (adjustable, but the `BedrockBatchTeacherPool` must shard input JSONL to fit).
190
+ - **Opus 4.7/4.8 are on-demand-only (no batch ID).** If the heterogeneity ablation (report §3) wants the newest Claude as a teacher, it costs on-demand pricing — budget for it or substitute the batchable Opus 4.6. GPT-5 (in `DEFAULT_TEACHERS`) is not on Bedrock at all → that one family member stays on the OpenRouter escape hatch or is replaced by Bedrock Llama 4 / DeepSeek as the third family.
191
+ - **`golden_diff` must be ACL-isolated.** It is `repr=False` in `FeatureDeletionTask` and held out of the policy observation; on S3 it must live in a *separate* deny-by-default prefix (`tasks/golden/`), never co-located with the policy-visible `tasks/v1/...` — oracle-cleanliness Gate 1 (report §4).
research/design-F3-rl-sagemaker.md ADDED
@@ -0,0 +1,299 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F3 — The RL System on AWS, Runnable NOW (SageMaker us-west-2)
2
+
3
+ **Status:** design, runnable-today. **Account:** 386931836011, **region:** us-west-2, **role:** Admin (Isengard).
4
+ **Live facts verified 2026-06-09** (see "Live AWS findings" below). Grounds in the deep-research report
5
+ (`research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §9 SageMaker path / hybrid, §10 phased plan)
6
+ and the actual repo code (`SageMakerExecutor`, `ComposerReplicationTrainer`, `ObjectStoreAllReduce`,
7
+ `replica_entrypoint`, `examples/gsm8k_grpo`).
8
+
9
+ ---
10
+
11
+ ## 0. TL;DR (committed)
12
+
13
+ 1. **For the first runnable GPU smoke RIGHT NOW: a single SageMaker Training Job, `ml.g5.2xlarge`, BYO-container
14
+ extended from the AWS PyTorch DLC.** This is NOT the `SageMakerExecutor` N-replica path — it is plain GRPO on one
15
+ GPU. The executor's multi-replica DiLoCo rendezvous is the *next* step, not the smoke. Reason: the smoke's job is
16
+ to prove the trainer + reward + vLLM rollout works on a real GPU at minimum cost and zero quota friction.
17
+ 2. **GRPO rollout = vLLM colocated in the training container** (`use_vllm=True, vllm_mode="colocate"`). TRL 1.5's
18
+ default is colocate; it runs vLLM in the *same process* sharing the training GPU at `vllm_gpu_memory_utilization=0.3`.
19
+ No separate inference endpoint for the smoke. The `server` mode (`trl vllm-serve`) and VeRL's `AsyncServer` are the
20
+ scale answer for tool-heavy agentic rollouts later (report §8) — not for a 0.5B GSM8K smoke.
21
+ 3. **Platform decision:** Training Jobs for the bursty smoke and periodic small-model runs (this facet); **HyperPod
22
+ (attached to EKS)** for the long, resilience-bound inner GRPO loop (report §9). Both share the identical S3
23
+ `ObjectStoreAllReduce` rendezvous, so a run moves between them with zero trainer/loss/DiLoCo change.
24
+ 4. **The `SageMakerExecutor` (already built, mock-tested) drives N independent single-instance Training Jobs**, each
25
+ tagged `REPLICA_RANK=i`/`WORLD_SIZE=N` via the `Environment` map, all pointed at one `s3://.../rendezvous/` prefix.
26
+ It is the bursty-fallback DiLoCo backend. To make it run live we need a built+pushed container, real
27
+ `role_arn`/`image_uri`/`output_s3_path`, and a non-zero quota for N concurrent training jobs.
28
+
29
+ ---
30
+
31
+ ## 1. Live AWS findings (verified 2026-06-09, this account/region)
32
+
33
+ | Fact | Value | Consequence |
34
+ |---|---|---|
35
+ | Caller identity | `arn:aws:sts::386931836011:assumed-role/Admin/baladita-Isengard` | Admin — can create roles, push ECR, run training jobs. |
36
+ | SageMaker default bucket (us-west-2) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` | Use as the rendezvous + output bucket — already covered by `AmazonSageMakerFullAccess`. |
37
+ | Existing exec roles | `AmazonSageMaker-ExecutionRole-20250725T133247` (and ...20241223T...) | `role_arn = arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247`. |
38
+ | Exec role policies | `AmazonSageMakerFullAccess` + custom `AmazonSageMaker-ExecutionPolicy-...` | FullAccess grants S3 to buckets named `*SageMaker*`/`*sagemaker*` — so the rendezvous bucket MUST be the SageMaker bucket above (or a `sagemaker-*` bucket) or you must attach an explicit S3 policy. **This is the IAM gotcha the executor docstring flags.** |
39
+ | `ml.g5.2xlarge for training job usage` | **1.0** (non-zero!) | Single-replica g5 smoke runs IMMEDIATELY, no quota request. |
40
+ | `ml.g5.2xlarge for spot training job usage` | **1.0** | Spot smoke also available (70% cheaper). |
41
+ | `ml.g5.12xlarge for training job usage` | **1.0** | One 4×A10G box available for a 7B run later. |
42
+ | `ml.g6.2xlarge for training job usage` | **0.0** | g6 (L4) needs a Service Quotas increase first — prefer g5 for the smoke. |
43
+ | g5.2xlarge EC2 offering | us-west-2a/b/c | Capacity exists across AZs. |
44
+ | Already present | `sagemaker-dynamo-on-eks-hyperpod-*` bucket | Confirms HyperPod-on-EKS has been used here — the report's §9 hybrid is live-reachable. |
45
+ | boto3 / sagemaker SDK locally | NOT installed | `pip install -e .[aws]` + `pip install sagemaker` on the launch host (laptop/Studio), not in the repo's hard deps. |
46
+
47
+ **The single most important runnable-now fact:** g5.2xlarge training-job quota is already 1 — the smoke needs no
48
+ quota ticket. (Default for these GPU types is 0; this account has been bumped to 1.)
49
+
50
+ ---
51
+
52
+ ## 2. Training Jobs vs HyperPod vs EKS — when each (report §9, §10)
53
+
54
+ - **SageMaker Training Jobs (THIS facet, bursty inner loop / smoke).** Ephemeral, pay-per-second, `boto3.create_training_job`,
55
+ zero persistent cluster. Right for: the first GPU smoke, periodic/smaller-model runs, the `SageMakerExecutor`
56
+ DiLoCo fallback. re:Post guidance: Training Jobs fit *periodic / smaller-model / pay-per-use*. The 28-day max runtime
57
+ and per-job cold-start (instance provisioning ~3-6 min) are acceptable for bursty work. Warm pools
58
+ (`KeepAlivePeriodInSeconds`) cut cold-start on repeated launches — but note this account's `g5 training warm pool
59
+ usage` quota is 0, so warm pools need a quota bump.
60
+ - **SageMaker HyperPod attached to EKS (long resilience-bound inner loop).** Report §9: HyperPod maps 1-to-1 to an EKS
61
+ control plane (one EKS cluster = one HyperPod node-group in a VPC), with auto-detect-and-replace of faulty
62
+ accelerators and PyTorch job auto-resume. Right for: continuous/large-model/persistent multi-day RL where a node
63
+ failure on a Training Job would lose the run. The `sagemaker-dynamo-on-eks-hyperpod-*` bucket shows this is already
64
+ exercised here. **"Use HyperPod for the inner loop" does NOT mean leaving EKS** — it is a node-group swap on the same
65
+ control plane. Build target: the future `EKSExecutor` targets both Karpenter GPU nodes and HyperPod nodes transparently.
66
+ - **Plain EKS (primary for everything else — report §8).** Outer MCTS/sandbox/dataset loop, vLLM RayService rollout
67
+ groups, gVisor/Kata sandbox pods, Argo controller. The inner GRPO trainer is the one piece that swaps between a
68
+ Karpenter p5/g6e NodePool and a HyperPod node-group.
69
+
70
+ **Decision for F3:** SageMaker Training Jobs now (smoke + `SageMakerExecutor` DiLoCo fallback); HyperPod-on-EKS later
71
+ for the long inner run. Same S3 rendezvous throughout.
72
+
73
+ ---
74
+
75
+ ## 3. The first runnable smoke: Qwen2.5-0.5B GRPO on GSM8K, single g5.2xlarge Training Job
76
+
77
+ ### 3.1 Shape
78
+ One Training Job, `InstanceCount=1`, `ml.g5.2xlarge` (1× A10G, 24 GB). GRPO with vLLM **colocated** in the
79
+ training container. This is the `examples/gsm8k_grpo/run.py` recipe lifted from CPU to one real GPU, with vLLM turned on.
80
+ It does **not** exercise the DiLoCo rendezvous (that's §4). It proves: container builds, trainer runs on GPU, vLLM
81
+ rollout works, reward fires, checkpoint lands in S3.
82
+
83
+ ### 3.2 Container — BYO extended from the AWS PyTorch DLC (do NOT use the stock HF DLC)
84
+ - **Base:** `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker`
85
+ (verified present in the us-west-2 DLC registry; 763104351884 is the AWS DLC account). The DLC already has the SageMaker
86
+ training toolkit, CUDA, and a working torch — so vLLM's CUDA wheels match.
87
+ - **Why not the stock HF DLC (`huggingface-pytorch-training:4.49.0`)?** It pins transformers 4.49 and does NOT bundle
88
+ `trl` or `vllm`; you'd be pip-installing the whole RL stack anyway. Extending the PyTorch DLC gives a clean,
89
+ version-controlled layer.
90
+ - **Why a prebuilt ECR image and not `source_dir`+`requirements.txt`?** Installing `vllm` + `trl` + `flash-attn` at job
91
+ start over `requirements.txt` adds 5-10 min of cold-start per job and is a flaky failure surface (wheel/CUDA mismatch).
92
+ Bake them into the image once, push to the account's private ECR. `source_dir` is fine for *just the training script*
93
+ layered on top, but the heavy deps must be baked.
94
+
95
+ `docker/Dockerfile.sagemaker`:
96
+ ```dockerfile
97
+ FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker
98
+ # RL stack (baked, not pip-at-startup)
99
+ RUN pip install --no-cache-dir \
100
+ "trl>=0.12" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
101
+ "vllm" "fsspec>=2024.6" "s3fs>=2024.6"
102
+ # The framework itself
103
+ COPY . /opt/composer_replication
104
+ RUN pip install --no-cache-dir -e "/opt/composer_replication[train,serverless]"
105
+ # SageMaker invokes the image; for the smoke we use a plain GRPO entry script,
106
+ # for the DiLoCo path the executor passes ContainerEntrypoint explicitly.
107
+ ENV HF_HOME=/opt/ml/input/hf_cache
108
+ ```
109
+ Build + push (Admin, one-time):
110
+ ```bash
111
+ aws ecr create-repository --repository-name composer-rl --region us-west-2
112
+ aws ecr get-login-password --region us-west-2 | docker login --username AWS \
113
+ --password-stdin 386931836011.dkr.ecr.us-west-2.amazonaws.com
114
+ docker build -f docker/Dockerfile.sagemaker -t 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke .
115
+ docker push 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke
116
+ ```
117
+
118
+ ### 3.3 The smoke training script — `examples/gsm8k_grpo/run_sagemaker.py`
119
+ A thin GPU variant of `examples/gsm8k_grpo/run.py`. Same `gsm8k_reward` (RLVR `#### NUMBER` regex), same
120
+ `ComposerReplicationTrainer(alpha_sdpo=0, beta_replay=0)` (plain GRPO — channels 2/3 off). Differences from the CPU example:
121
+ ```python
122
+ from trl import GRPOConfig
123
+ config = GRPOConfig(
124
+ output_dir="/opt/ml/model", # SageMaker uploads this to OutputDataConfig.S3OutputPath
125
+ per_device_train_batch_size=8,
126
+ num_generations=8,
127
+ max_prompt_length=512,
128
+ max_completion_length=256,
129
+ learning_rate=1e-5,
130
+ max_steps=20, # smoke — minutes, not hours
131
+ logging_steps=1,
132
+ save_strategy="no",
133
+ bf16=True, # A10G supports bf16
134
+ # --- the rollout path: vLLM colocated in-process on the same GPU ---
135
+ use_vllm=True,
136
+ vllm_mode="colocate", # TRL 1.5 default; same process, no server
137
+ vllm_gpu_memory_utilization=0.3, # leave 70% for the 0.5B policy + grads + KV
138
+ vllm_tensor_parallel_size=1,
139
+ beta=0.04, # small KL-to-ref; or 0.0 for pure smoke
140
+ report_to=[],
141
+ )
142
+ ```
143
+ Read hyperparameters from `/opt/ml/input/config/hyperparameters.json` (SageMaker writes the estimator's
144
+ `hyperparameters=` there) so the same script is config-driven.
145
+
146
+ ### 3.4 The launch — SageMaker Python SDK `Estimator` (run from the laptop / Studio)
147
+ ```python
148
+ import sagemaker
149
+ from sagemaker.estimator import Estimator
150
+
151
+ sess = sagemaker.Session() # picks up region us-west-2
152
+ ROLE = "arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
153
+ IMAGE = "386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke"
154
+ BUCKET = "amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d"
155
+
156
+ est = Estimator(
157
+ image_uri=IMAGE,
158
+ role=ROLE,
159
+ instance_type="ml.g5.2xlarge", # quota = 1 (verified live) — no ticket needed
160
+ instance_count=1,
161
+ volume_size=100,
162
+ max_run=3600, # 1h cap for the smoke
163
+ output_path=f"s3://{BUCKET}/composer-rl/smoke/output",
164
+ base_job_name="composer-grpo-smoke",
165
+ environment={"HF_HUB_ENABLE_HF_TRANSFER": "1"},
166
+ hyperparameters={"model": "Qwen/Qwen2.5-0.5B-Instruct", "max_steps": 20},
167
+ entry_point="run_sagemaker.py",
168
+ source_dir="examples/gsm8k_grpo", # script layered on the baked image
169
+ keep_alive_period_in_seconds=0, # warm-pool quota is 0 in this acct; leave off
170
+ # use_spot_instances=True, max_wait=7200 # optional: spot quota is also 1
171
+ )
172
+ est.fit(wait=True, logs=True) # GSM8K loads from HF inside the container
173
+ ```
174
+ **Cost:** `ml.g5.2xlarge` is ~$1.52/hr on-demand in us-west-2; a 20-step 0.5B smoke is ~15-25 min ⇒ **well under $1**.
175
+ On spot (quota=1) ~$0.45-0.60/hr ⇒ pennies. The CPU example proves the loop in ~60s; this proves it on a real GPU with
176
+ the real vLLM rollout path, which the CPU example explicitly does not exercise.
177
+
178
+ ### 3.5 Gotchas baked into the recipe
179
+ - **vLLM needs HF model download at job start.** Either set `HF_HUB_ENABLE_HF_TRANSFER=1` (done) or stage the model
180
+ into S3 and pass it as an input channel; for a 0.5B model the live download is fine. `EnableNetworkIsolation` MUST stay
181
+ False (the executor pins this) so the container can reach `huggingface.co` and S3.
182
+ - **`vllm_gpu_memory_utilization=0.3` is the load-bearing knob on a 24 GB A10G.** Too high ⇒ OOM when the policy +
183
+ grads + optimizer also need the GPU; too low ⇒ tiny KV cache, slow rollout. 0.3 is the TRL/Ray reference default for
184
+ a small model on one GPU.
185
+ - **GSM8K = `openai/gsm8k` config `main`.** Already what the example loads. No license blocker (MIT).
186
+
187
+ ---
188
+
189
+ ## 4. The DiLoCo N-replica path: how `SageMakerExecutor` drives the rendezvous
190
+
191
+ This is the executor that already exists and is mock-tested (`tests/test_sagemaker_executor.py` — 20+ tests covering
192
+ rank-ordered handles, env injection, status mapping, cancel idempotency, partial-launch rollback). It is the bursty
193
+ DiLoCo backend, distinct from the §3 smoke.
194
+
195
+ ### 4.1 What it does (verified from source)
196
+ - `launch_replicas(N, ...)` submits **N independent single-instance Training Jobs** (NOT one multi-instance job — that
197
+ would couple replicas through SageMaker's intra-job NCCL fabric and break the "each replica syncs only through S3"
198
+ design). Each job gets `Environment={"REPLICA_RANK": str(i), "WORLD_SIZE": str(N), "RENDEZVOUS_URI": s3uri}` and
199
+ `ContainerEntrypoint=["python","-m","composer_replication.diloco.serverless.replica_entrypoint"]` with
200
+ `ContainerArguments=["--rendezvous", s3uri, "--world-size", N, "--trainer-module", ..., "--trainer-fn", ...]`.
201
+ - `replica_entrypoint.main` reads `REPLICA_RANK`, builds `ObjectStoreAllReduce(uri=s3://..., rank, world_size)`, wraps
202
+ it in `MockManager`, and calls the user's `train(manager=, rank=, world_size=, **trainer_kwargs)`. The trainer wires
203
+ `manager` into `make_diloco_outer_loop`; pseudo-gradients sync via `round_{NNNNNN}/rank_{RRRR}.pt` PUT-then-poll-then-mean
204
+ on S3. **DiLoCo math, loss, trainer untouched.**
205
+ - `poll`/`collect` map `describe_training_job.TrainingJobStatus`; `stream_logs` reads
206
+ `/aws/sagemaker/TrainingJobs/<job>/algo-*`; `cancel` calls `stop_training_job` idempotently.
207
+
208
+ ### 4.2 The asymmetry that makes this clean (report §8)
209
+ Gang scheduling is needed for *intra-replica* FSDP NCCL but NOT for *inter-replica* DiLoCo sync — replicas rendezvous
210
+ through S3, so a straggler simply blocks at the poll loop (bounded by `timeout_s=1800`) instead of deadlocking. On
211
+ SageMaker, N separate jobs have no mutual network path (`supports_inter_replica_network=False`), which is exactly right.
212
+
213
+ ### 4.3 What to wire to run it live (the deltas)
214
+ 1. **Same baked image** from §3.2 (it already `pip install -e .[serverless]`, so `replica_entrypoint`, `s3fs`, `fsspec`
215
+ are present). The executor passes `ContainerEntrypoint` explicitly, so a generic image works.
216
+ 2. **Rendezvous bucket = the SageMaker default bucket** (`amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d`) so the
217
+ exec role's `AmazonSageMakerFullAccess` already grants the live S3 PUT/GET the allreduce poll loop needs. Use
218
+ `rendezvous_uri = "s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/composer-rl/runs/<run_id>/rendezvous/"`.
219
+ 3. **Quota:** N concurrent jobs need `ml.g5.2xlarge for training job usage >= N`. Currently 1 ⇒ N=1 works today; for
220
+ N=2-4 DiLoCo, request a Service Quotas increase (Service Quotas console → SageMaker → "ml.g5.2xlarge for training job
221
+ usage"). The smoke proves the executor end-to-end at N=1 (one job, one rank — degenerate allreduce returns its own
222
+ tensor), then N=2 once quota lands.
223
+ 4. **Driver script** `examples/diloco_sagemaker/run.py` (~80 LOC): construct `SageMakerExecutor(role_arn=..., image_uri=...,
224
+ output_s3_path=..., region="us-west-2")`, call `launch_replicas(N, entrypoint="...replica_entrypoint",
225
+ entrypoint_args={"rendezvous_uri": s3uri, "trainer_module": "examples.gsm8k_grpo.diloco_train", "trainer_fn": "train",
226
+ "trainer_kwargs": {...}}, gpu="A10G", timeout=3600)`, then `collect(handles)`. `gpu="A10G"` maps to `ml.g5.2xlarge`
227
+ via the executor's `_GPU_INSTANCE_MAP`.
228
+
229
+ ---
230
+
231
+ ## 5. The GRPO rollout problem — colocated vLLM now, server/AsyncServer later
232
+
233
+ TRL's `GRPOTrainer` needs a generation path each step. Three options, committed mapping:
234
+
235
+ | Option | When | On SageMaker |
236
+ |---|---|---|
237
+ | `model.generate()` (no vLLM) | never for real runs — too slow | the CPU example uses this implicitly; fine only for the 0.5B CPU toy. |
238
+ | **vLLM colocate** (`use_vllm=True, vllm_mode="colocate"`) | **the smoke + most single-GPU runs** | vLLM in the same process, shares the training GPU at `vllm_gpu_memory_utilization=0.3`. One container, one job, no endpoint. TRL 1.5 default. **This is the F3 answer.** |
239
+ | vLLM server (`trl vllm-serve`) | multi-GPU where you dedicate GPUs to generation | a *second* SageMaker job or a SageMaker endpoint runs `trl vllm-serve`; the trainer job points `vllm_server_host/port` at it. Introduces inter-process comm + a network hop — only worth it when generation dominates and you have spare GPUs. |
240
+ | VeRL `AsyncServer` | tool-heavy agentic tree-of-work rollouts (report §8) | the scale answer for the SWE-agent tree: async GPU-decoupled agent loop TRL lacks. A later facet; the engine should be a configurable backend, not hardcoded. |
241
+
242
+ For the F3 smoke and the DiLoCo fallback, **colocate is correct and simplest**: it keeps everything in one
243
+ self-contained training container, which is exactly what a single-instance SageMaker job wants. No separate inference
244
+ endpoint to provision, secure, or pay for.
245
+
246
+ **One subtlety the report flags (§8):** the SDPO channel (Channel 2) needs full-vocabulary *logits* (TRL-hosted, which
247
+ the trainer's `_compute_sdpo_loss` does via `model(...).logits`), while Channel 3 needs only log-probs. Colocated vLLM
248
+ handles *rollout generation*; the SDPO/replay logits/log-probs come from the policy forward pass in `_compute_loss`,
249
+ not from vLLM. So turning on `alpha_sdpo>0` later does not change the rollout backend choice.
250
+
251
+ ---
252
+
253
+ ## 6. Concrete repo deltas (to make this runnable, not hand-wavy)
254
+
255
+ | Path (~LOC) | What | Why |
256
+ |---|---|---|
257
+ | `docker/Dockerfile.sagemaker` (~15) | Extend PyTorch DLC 2.6.0-gpu-py312; bake trl+vllm+peft+accelerate+datasets+s3fs+fsspec + `pip install -e .[train,serverless]`. | The report (§9) names "a Dockerfile wrapping composer_replication" as a missing build artifact. This is it. |
258
+ | `examples/gsm8k_grpo/run_sagemaker.py` (~120) | GPU+vLLM variant of `run.py`; reads `/opt/ml/input/config/hyperparameters.json`; writes to `/opt/ml/model`; `use_vllm=True, vllm_mode="colocate"`. | The runnable smoke entry. |
259
+ | `examples/diloco_sagemaker/run.py` (~80) | Driver that builds `SageMakerExecutor` with the live role/image/bucket and calls `launch_replicas`/`collect` for N replicas over the S3 rendezvous. | Turns the mock-tested executor into a live driver — no executor code change needed. |
260
+ | `examples/gsm8k_grpo/diloco_train.py` (~60) | A `train(manager, rank, world_size, **kw)` that wraps the GRPO trainer in `make_diloco_outer_loop(manager=...)`. | The `trainer_module:trainer_fn` the executor imports inside each replica. |
261
+ | `scripts/build_and_push_ecr.sh` (~20) | ECR create-repo + login + build + push (the §3.2 commands). | One-command image publish. |
262
+ | `docs/AWS_SAGEMAKER_QUICKSTART.md` (~120) | The §1 live facts + §3 estimator recipe + the quota/IAM gotchas + the spot variant. | So the next person runs it in one read. |
263
+ | `pyproject.toml` `aws` extra (+1 line) | add `sagemaker>=2.200` alongside `boto3` (the SDK `Estimator` lives there; executor uses raw boto3 but the launch driver wants the SDK). | The launch host needs the SDK; currently only `boto3` is in the extra. |
264
+
265
+ **Nothing in `SageMakerExecutor`, `ComposerReplicationTrainer`, `ObjectStoreAllReduce`, `replica_entrypoint`, or the
266
+ loss changes.** The executor's design (N single-instance jobs, env-injected rank, S3-only rendezvous,
267
+ `EnableNetworkIsolation=False`) is already correct for this environment — verified against the live IAM/quota facts.
268
+
269
+ ---
270
+
271
+ ## 7. Open questions / next gates
272
+
273
+ - **N>1 DiLoCo quota:** `ml.g5.2xlarge for training job usage` = 1 today. N=2-4 needs a Service Quotas increase
274
+ (typically minutes-to-hours for g5; not guaranteed instant). Request before the N>1 run.
275
+ - **Warm pools:** `g5 training warm pool usage` quota = 0 ⇒ each job pays ~3-6 min cold-start. For the bursty DiLoCo
276
+ fallback at small H (frequent re-launch) this matters; request warm-pool quota or accept the cold-start, or move the
277
+ long inner loop to HyperPod (which is persistent — no per-round cold-start).
278
+ - **vLLM version pin:** the smoke leaves `vllm` unpinned in the Dockerfile; pin to a version whose CUDA matches the DLC
279
+ (cu124 / torch 2.6) before promoting past smoke, to avoid a silent wheel mismatch.
280
+ - **HyperPod-on-EKS path:** the `sagemaker-dynamo-on-eks-hyperpod-*` bucket shows it's been used here; the future
281
+ `EKSExecutor` + HyperPod node-group attach is the report's §9 recommendation for the long inner run. Out of scope for
282
+ F3 (Training-Jobs facet) but the rendezvous makes the swap free.
283
+ - **Spot interruption + DiLoCo:** spot g5 quota = 1; with `use_spot_instances=True` a replica can be reclaimed mid-round.
284
+ The bounded `timeout_s=1800` poll means a reclaimed replica stalls its peers up to 30 min then `TimeoutError`s. For
285
+ spot DiLoCo, add `save_freq` checkpointing + relaunch-on-interruption in the driver (report §10 failure modes).
286
+
287
+ ## 8. References
288
+ - Repo: `composer_replication/diloco/serverless/sagemaker.py` (SageMakerExecutor), `replica_entrypoint.py`,
289
+ `allreduce.py` (ObjectStoreAllReduce + MockManager), `trainer/composer_trainer.py` (ComposerReplicationTrainer +
290
+ `make_po_config`), `examples/gsm8k_grpo/run.py`, `tests/test_sagemaker_executor.py`.
291
+ - Report §8 (EKS primary), §9 (SageMaker path + HyperPod hybrid), §10 (cost / phased plan).
292
+ - AWS DLC registry us-west-2: `pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker` @ 763104351884
293
+ (docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-us-west-2.html).
294
+ - TRL vLLM colocate: GRPOConfig `use_vllm`/`vllm_mode`/`vllm_gpu_memory_utilization` (huggingface/trl
295
+ grpo_config.py; huggingface.co/blog/vllm-colocate; Ray TRL-GRPO example).
296
+ - SageMaker quotas: g5/g6/p4d training-job usage default 0 (StackOverflow 71655321; re:Post) — verified live this
297
+ account: g5.2xlarge=1, g6.2xlarge=0.
298
+ - HyperPod-EKS 1:1 mapping: docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html.
299
+ - Live `aws sts get-caller-identity` / `aws service-quotas list-service-quotas` / `aws iam` (2026-06-09).
research/design-F4-decoupled-diloco-s3.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F4 — Decoupled DiLoCo on Serverless + S3 (AWS-native, concrete)
2
+
3
+ **Facet:** the headline distributed-training question — how N independent SageMaker Training Jobs (or EKS Indexed-Job pods) each run inner DiLoCo (AdamW H steps) then sync pseudo-gradients ONCE per ~500 steps through S3 via `ObjectStoreAllReduce`, with no cross-job NCCL.
4
+
5
+ **Environment (LIVE):** account `386931836011`, region `us-west-2`, `Admin`/Isengard. Verified live below:
6
+ - Rendezvous-ready bucket exists: `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` (matches the `sagemaker`-name pattern that `AmazonSageMakerFullAccess` grants S3 on — load-bearing, see IAM).
7
+ - Two ready execution roles: `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` (use this) and `...-20241223T082691`.
8
+ - CPU training quota = **30** instances each for `ml.m5.xlarge` / `ml.m5.large` / `ml.c5.xlarge` in us-west-2 → the 2-replica smoke is trivially in budget.
9
+ - Warm-pool quota = **0** for all (default). `KeepAlivePeriodInSeconds` set without a quota increase silently no-ops. This is THE cold-start gotcha.
10
+
11
+ ---
12
+
13
+ ## 1. The decision that already shipped (ADR-005) and why it's right for AWS
14
+
15
+ ADR-005 chose **object-store rendezvous, not cross-job NCCL**, as the DiLoCo comm primitive. DiLoCo (Douillard et al. 2023, arXiv:2311.08105 §3.2) syncs once per H≈500–1000 inner steps — ~10–30 min wall-clock. The exchange per outer round is one `PutObject` of the pseudo-gradient (~2 GB for a 1B bf16 model) + (N−1) `GetObject`s. At N=8 that's ~128 GB read spread over 30 min ≈ 70 MB/s aggregate, ~$0.05/round on S3. The repo realizes exactly this in `ObjectStoreAllReduce.allreduce()` (`composer_replication/diloco/serverless/allreduce.py:131`): PUT `round_{NNNNNN}/rank_{RRRR}.pt`, poll-until-all-peers-exist, mean, `tensor.copy_(avg)`.
16
+
17
+ **Why this is correct on AWS specifically (not just plausible):** S3 has been **strongly read-after-write consistent for PUT/GET/LIST in all regions since Dec 2 2020, at no extra cost** (`aws.amazon.com/s3/consistency`; `docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html`). The poll loop's correctness depends on exactly this: rank R PUTs `rank_0003.pt`, and the instant any peer's `_exists()` returns True the subsequent `_get()` is guaranteed to return that byte-complete object — no "object not yet visible" race, no eventual-consistency retry shim. The repo's `_put` even writes a `.tmp` then `os.replace` on the local path (atomic on POSIX); on S3 a single `PutObject` is atomic per key by definition, so a peer never sees a half-written `.pt`. **The substrate is consistency-correct by construction on S3.**
18
+
19
+ The one-line architectural payoff on K8s/serverless (report §8): a straggler replica simply blocks peers at the poll loop (bounded by `timeout_s=1800`) instead of deadlocking an NCCL gang. Inter-replica DiLoCo sync needs **no gang scheduling** — only intra-replica FSDP does.
20
+
21
+ ---
22
+
23
+ ## 2. Exact lifecycle: `SageMakerExecutor.launch_replicas(N)` → collect
24
+
25
+ `composer_replication/diloco/serverless/sagemaker.py` (`SageMakerExecutor`) already implements the full `ServerlessExecutor` Protocol. The end-to-end flow for one Decoupled DiLoCo run:
26
+
27
+ ```
28
+ make_diloco_run(N=4) # orchestrator (NEW, ~120 LOC — see §6)
29
+ └─ exec = SageMakerExecutor(
30
+ role_arn="arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247",
31
+ image_uri="386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-diloco:latest",
32
+ output_s3_path="s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-out/run42/",
33
+ region="us-west-2")
34
+ └─ handles = exec.launch_replicas(
35
+ n_replicas=4, entrypoint=<ignored>,
36
+ entrypoint_args={
37
+ "rendezvous_uri": "s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/run42/",
38
+ "trainer_module": "composer_replication.trainer.composer_trainer",
39
+ "trainer_fn": "diloco_train", # NEW thin wrapper — see §6
40
+ "trainer_kwargs": {"model_name":"Qwen/Qwen2.5-0.5B","sync_every":500,"total_steps":2000}},
41
+ gpu="A10G", timeout=86400)
42
+ ```
43
+
44
+ **Per replica, `launch_replicas` submits ONE single-instance Training Job** (`sagemaker.py:309-369`):
45
+ - `ResourceConfig.InstanceCount == 1` — deliberately NOT one multi-instance job (which would wire SageMaker's intra-job NCCL via `resourceconfig.json` and couple the replicas — the wrong model). N replicas = N separate jobs.
46
+ - `AlgorithmSpecification.ContainerEntrypoint = ["python","-m","composer_replication.diloco.serverless.replica_entrypoint"]`, `ContainerArguments = ["--rendezvous", s3uri, "--world-size","4","--trainer-module",...,"--trainer-fn","diloco_train","--trainer-kwargs-json", "{...}"]` (each token a separate list element).
47
+ - `Environment = {"REPLICA_RANK":"<rank>","WORLD_SIZE":"4","RENDEZVOUS_URI":s3uri}` — the rank channel.
48
+ - `EnableNetworkIsolation=False` — **load-bearing**, pinned, never a knob: True severs the container's outbound S3 and dead-locks the allreduce poll until `timeout_s`. The bucket access is granted on `RoleArn` instead (SageMaker's IRSA analog).
49
+ - On any rank's `create_training_job` failure, already-launched siblings are best-effort `cancel`ed then it raises (clean abort, no orphan jobs).
50
+
51
+ **Inside each job, `replica_entrypoint.main`** (`replica_entrypoint.py:38`):
52
+ 1. reads `REPLICA_RANK` from env (falls back to argv via the dual contract — argv for SageMaker/Local, env for EKS),
53
+ 2. builds `store = ObjectStoreAllReduce(uri=s3uri, rank, world_size)`. For an `s3://` URI this hits `_init_fsspec()` → `fsspec.filesystem("s3")` (s3fs; in the `[serverless]` extra),
54
+ 3. `manager = MockManager(store)`,
55
+ 4. imports `trainer_module.trainer_fn`, injects `manager=`, `rank=`, `world_size=`, calls it.
56
+
57
+ **Inside `diloco_train` (the trainer_fn):**
58
+ ```python
59
+ diloco = make_diloco_outer_loop(
60
+ manager=manager, model_fragments=[model],
61
+ inner_optimizer=AdamW(model.parameters(), lr=1e-5),
62
+ outer_lr=0.7, outer_momentum=0.9, nesterov=True, sync_every=500)
63
+ with diloco:
64
+ for step in range(total_steps):
65
+ inner_optim.zero_grad(); loss = composer_loss(...); loss.backward()
66
+ inner_optim.step() # at step%500==0 DiLoCo's post-hook fires
67
+ ```
68
+ At every 500th `inner_optim.step()`, torchft's DiLoCo post-hook fires `prepare_sync` → `perform_sync`. `perform_sync` computes the pseudo-gradient `θ_initial − θ_local` (sign convention pinned in `diloco/__init__.py:13-38`), calls `manager.allreduce(pseudograd)` → `MockManager.allreduce` (`allreduce.py:268`) → `ObjectStoreAllReduce.allreduce`: PUT `round_000000/rank_0003.pt`, poll for all 4 ranks' files, mean, `copy_` back. `MockManager.should_commit()` always returns True (no FT failover; replica failure is the orchestrator's job), then the **outer Nesterov SGD step** applies the averaged pseudo-gradient and redistributes the new global weights. `start_quorum()` bumps `_step` so `current_step()` advances exactly once per round (fragment-rotation math). Repeat for the next 500 inner steps → `round_000001/`, etc.
69
+
70
+ **Collect:** `exec.collect(handles, timeout=...)` polls `describe_training_job` per handle until terminal (`Completed`/`Failed`/`Stopped`), returns rank-ordered result dicts incl. `S3ModelArtifacts`. `poll` maps `TrainingJobStatus` via `_STATUS_MAP` (refining `InProgress`+queued → `pending`). `stream_logs` reads `/aws/sagemaker/TrainingJobs` CloudWatch stream `<job>/algo-1-<epoch>`.
71
+
72
+ The DiLoCo math, `MockManager`, `ObjectStoreAllReduce`, `make_diloco_outer_loop`, the trainer, and the loss are **all byte-for-byte unchanged** across Local→EKS→SageMaker. That is the whole point of the Protocol.
73
+
74
+ ---
75
+
76
+ ## 3. S3 rendezvous specifics for AWS
77
+
78
+ - **Bucket/prefix:** `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/<run_id>/` for the rendezvous; a sibling `…/diloco-out/<run_id>/` for `OutputDataConfig`. Use the sagemaker-named bucket so the stock execution-role policy already grants access (see IAM). Layout written by the substrate: `…/diloco-rdv/<run_id>/round_{NNNNNN}/rank_{RRRR}.pt`.
79
+ - **Consistency:** strong RAW + strong LIST, all regions, since 2020 — the poll loop (`_exists` then `_get`) needs no consistency shim. A peer's file becomes visible atomically on PUT completion.
80
+ - **IAM (load-bearing):** the jobs need `s3:GetObject` + `s3:PutObject` (and ideally `s3:ListBucket`) on `…/diloco-rdv/<run_id>/*` on the **execution `RoleArn`**, NOT the caller. The stock `AmazonSageMakerFullAccess` on the existing roles grants S3 only on buckets whose name contains `sagemaker`/`Sagemaker`/`SageMaker`/`aws-glue` — which is exactly why the `amazon-sagemaker-…` bucket works out of the box and a custom-named `s3://composer-diloco-rdv` would silently 403 the first PUT and hang every peer until `timeout_s`. Either keep the rendezvous in a sagemaker-named bucket (recommended, zero IAM work) or attach an inline policy scoping `s3:GetObject/PutObject/ListBucket` to the custom prefix.
81
+ - **Poll timeout vs stragglers:** `ObjectStoreAllReduce(timeout_s=1800, poll_interval_s=1.0)`. 30 min comfortably covers a SageMaker cold-started replica that joins late (3–5 min provision + first-500-steps lag). For Spot churn at larger N, raise to `timeout_s=3600`. A `TimeoutError` names the exact missing `rank_R` + `round_N` — the orchestrator can then cancel + relaunch that rank (DiLoCo's `should_commit==True` means a stalled round does not silently skip; the run aborts cleanly rather than averaging a partial set).
82
+ - **Cost:** ~$0.05/round (ADR-005), negligible vs GPU. For a 2000-step / sync_every=500 run that's 4 rounds ≈ $0.20 of S3 for the whole run.
83
+
84
+ ---
85
+
86
+ ## 4. What's missing to run it for real (the deferred smoke) + the cheapest validating run
87
+
88
+ **The gap.** ADR-005 §"Open/deferred" flags a "real serverless smoke" as never run; report §10 Phase 0 lists "EKSExecutor + S3 rendezvous + dep bump" as substrate hardening. Concretely, what exists vs what's missing:
89
+
90
+ | Layer | State |
91
+ |---|---|
92
+ | `ObjectStoreAllReduce` over `file://` | **Proven** — `test_serverless_diloco_integration.py` runs a 2-process `LocalProcessExecutor` run and asserts cross-rank weight convergence after one outer round. |
93
+ | `ObjectStoreAllReduce` over `s3://` | **Code path exists, never exercised against real S3.** `_init_fsspec()` is untested with live s3fs; the `_exists`/`_get`/`_put` S3 branches have only mock coverage. |
94
+ | `SageMakerExecutor` against real boto3 | **Never submitted a real job.** Tests inject a `_MockSMClient`. The ECR `image_uri` does not yet exist (no Dockerfile baking `composer_replication`). |
95
+ | `diloco_train` trainer_fn | **Missing.** `composer_trainer.py` has the trainer but no thin DiLoCo-wrapped entry that accepts the injected `manager`/`rank`/`world_size` kwargs. |
96
+
97
+ **The exact remaining gaps to close, in order (each cheap, each decisive):**
98
+ 1. **s3:// smoke (≈$0, ~10 min):** point the *existing* `test_serverless_diloco_integration` multi-process test at `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/smoke-$(uuid)/` instead of a tmp dir, with 2 local processes. This exercises the real s3fs PUT/poll/GET/mean path and the strong-consistency assumption with zero GPU spend. Only `s3fs` + `boto3` (already in `[serverless]`) and ambient `Admin` creds are needed. Gate: both ranks converge to identical weights (the existing assertion).
99
+ 2. **Dockerfile + ECR push (~$0):** ~30-line image `FROM pytorch/pytorch`, `pip install -e .[serverless,diloco,train]`, entrypoint `python -m composer_replication.diloco.serverless.replica_entrypoint`. Push to `386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-diloco:latest`.
100
+ 3. **`diloco_train` trainer_fn (~40 LOC):** wraps the existing trainer in `make_diloco_outer_loop`, accepts `manager/rank/world_size`.
101
+ 4. **2 tiny CPU SageMaker jobs (the validating run, ~$0.10–0.30):** `SageMakerExecutor(gpu=None → ml.m5.xlarge, $0.23/hr on-demand)`, `n_replicas=2`, `nn.Linear` or a 0.5B model with `total_steps=4, sync_every=2`, `rendezvous_uri` in the sagemaker bucket. Two jobs × ~5 min cold-start + ~2 min run ≈ 0.24 instance-hours ≈ **$0.06–0.30**. Within the 30-instance CPU quota with massive headroom. Gate: `collect()` returns both `succeeded`, both ranks' `round_000000/rank_000{0,1}.pt` appear in S3, and the two model artifacts in `…/diloco-out/` are byte-identical (proves the cross-job allreduce ran through real S3, not just locally).
102
+
103
+ That ladder — `file://` (done) → `s3://` local (step 1) → 2 real CPU jobs (step 4) — is the report's prescribed cheapest path and closes the deferred-smoke gap for well under the ADR's $2–5 estimate (the original estimate assumed GPU + Modal).
104
+
105
+ ---
106
+
107
+ ## 5. Streaming DiLoCo (`fragment_sync_delay>0`) on this substrate
108
+
109
+ Streaming DiLoCo (Liu et al. 2025, "Eager Updates for Overlapped Communication in DiLoCo", arXiv:2501.18512; the DiLoCo line is arXiv:2311.08105) splits the model into fragments synced on staggered schedules so the allreduce of fragment k overlaps inner computation of the next steps. `make_diloco_outer_loop` already exposes the knobs: `fragment_sync_delay>0`, `fragment_update_alpha`, and `model_fragments=[frag_0,...,frag_M-1]` (`diloco/__init__.py:72-93`).
110
+
111
+ From torchft's `local_sgd.py` (`_StreamingDiLoCoFragment`, verified via DeepWiki against the upstream source):
112
+ - **prepare_sync** fires at inner step `sync_every − fragment_sync_delay`: computes pseudo-gradients (`_save_grads`) and **launches the allreduce without waiting**, recording the `Work` in `self._allreduce_work`, on a separate CUDA stream.
113
+ - **perform_sync** fires at `sync_every`: **waits** on that `Work`, restores global params, `should_commit()`, applies the outer step, then `_merge_parameters` blends `local*alpha + global*(1−alpha)` (`fragment_update_alpha`; 0.0 = standard full-replacement DiLoCo).
114
+ - **`_current_fragment() = current_step() % len(fragments)`** — round-robin; each fragment syncs on its own offset. This is exactly why `MockManager.start_quorum()` bumps `_step` once per round and `current_step()` is faithful: get this wrong and replicas pick different fragments and diverge.
115
+
116
+ **The load-bearing gotcha for the object-store substrate.** Streaming's overlap depends on `manager.allreduce()` being **asynchronous** — prepare_sync launches it, `fragment_sync_delay` steps of inner compute run, then perform_sync waits. But the repo's `MockManager.allreduce` is **synchronous**: it calls `ObjectStoreAllReduce.allreduce`, which **blocks** on the poll-until-all-peers loop before returning `_ImmediateWork` (whose `.wait()` is a no-op). So on this substrate, **today, prepare_sync blocks for the full S3 rendezvous and `fragment_sync_delay` buys zero overlap** — Streaming degrades to vanilla, correctly but without the comm/compute overlap benefit. This is fine for correctness (and is why the same API "configures Streaming" per the docstring) but defeats the point on a 2 GB-per-fragment, 30-min-cadence S3 sync.
117
+
118
+ **The fix to make Streaming real here (~60 LOC, deferred):** give `ObjectStoreAllReduce` a non-blocking mode: `allreduce_async(tensor)` PUTs `round_N/rank_R.pt` and returns immediately; the returned `Work.wait()` then runs the poll-GET-mean-copy. `MockManager.allreduce` returns that deferred `Work` instead of `_ImmediateWork`. Now prepare_sync's PUT returns instantly, the `fragment_sync_delay` inner steps run while peers are PUTting concurrently, and perform_sync's `.wait()` does the poll/mean. Because S3 is strongly consistent, by the time perform_sync waits `fragment_sync_delay` steps later, peer files are far more likely already present — the overlap is genuine. Per-fragment streaming further shrinks each PUT to (model_size / M), so the poll is over smaller objects. This is the natural Streaming-DiLoCo realization on object storage and is the right Phase-5 upgrade (report §10) for multi-fragment large-model runs; vanilla (`fragment_sync_delay=0`, single fragment) is correct and sufficient for Phases 0–4.
119
+
120
+ ---
121
+
122
+ ## 6. Repo delta (what to build in `composer_replication/`)
123
+
124
+ | File | Delta | ~LOC |
125
+ |---|---|---|
126
+ | `composer_replication/trainer/composer_trainer.py` | NEW `diloco_train(*, manager, rank, world_size, model_name, sync_every=500, total_steps, **kw)` — wraps the existing trainer body in `make_diloco_outer_loop(manager=manager,...)`; this is the `trainer_fn` `replica_entrypoint` calls. | ~40 |
127
+ | `composer_replication/diloco/serverless/run.py` | NEW thin orchestrator `make_diloco_run(executor, n, rendezvous_uri, trainer_module, trainer_fn, trainer_kwargs, gpu, timeout)` → `launch_replicas` + `collect`; surfaces straggler `TimeoutError` → relaunch-rank. | ~120 |
128
+ | `docker/Dockerfile.diloco` | NEW — `FROM pytorch/pytorch:*-cuda*`, `pip install -e .[serverless,diloco,train]`, ENTRYPOINT to `replica_entrypoint`. Push to ECR `composer-diloco:latest`. | ~30 |
129
+ | `composer_replication/diloco/serverless/sagemaker.py` | EXTEND: optional `keep_alive_period_s` → `ResourceConfig.KeepAlivePeriodInSeconds` (warm pool; **document the 0-default quota gotcha**); optional `use_spot` → `EnableManagedSpotTraining=True` + `MaxWaitTimeInSeconds` (> `MaxRuntimeInSeconds`) + `CheckpointConfig`. Both default off. | ~40 |
130
+ | `composer_replication/diloco/serverless/allreduce.py` | EXTEND (Phase 5, for real Streaming): `allreduce_async` + a deferred `Work` whose `.wait()` runs the poll/mean; `MockManager` returns it. Vanilla path untouched. | ~60 |
131
+ | `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py` | EXTEND: parametrize rendezvous over `file://` AND `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/smoke-<uuid>/` (gated on `AWS_SMOKE=1`) — closes the s3:// path gap with zero GPU. | ~30 |
132
+ | `docs/adrs/ADR-005-serverless-diloco.md` | UPDATE the "Open/deferred — Real serverless smoke" clause: replace the Modal $2–5 estimate with the SageMaker 2×CPU-job ($0.06–0.30) plan above. | ~10 |
133
+
134
+ Untouched: `loss.py`, `teacher_replay.py`, `safety/`, `make_diloco_outer_loop`, `MockManager` core, `ObjectStoreAllReduce` core, `replica_entrypoint` (its dual argv/env contract already supports both SageMaker and EKS).
135
+
136
+ ---
137
+
138
+ ## 7. Open questions / falsifiers
139
+
140
+ - **Warm-pool quota:** default 0 in us-west-2 (verified). If iterative dev wants warm starts (skip the 3–5 min cold-start per round-of-jobs), request a `ml.<type> for training warm pool usage` quota increase first; otherwise `KeepAlivePeriodInSeconds` no-ops.
141
+ - **Pseudo-gradient dtype/size at real model scale:** the smoke uses tiny tensors; a 0.5B–8B bf16 pseudo-gradient is 1–16 GB per PUT — confirm s3fs multipart upload throughput and that `torch.save({"rank","tensor"})` round-trips bf16 on CPU (the code casts peer tensors back to device+dtype on GET).
142
+ - **Streaming overlap:** only real after the `allreduce_async` delta (§5); until then `fragment_sync_delay>0` is correct-but-no-overlap on S3. Measure round wall-clock with vs without before claiming the benefit.
143
+ - **N>16 straggler fragility under Spot:** the bounded `timeout_s` poll is the mitigation; the HyperPod-attached-node-group path (report §9, same S3 rendezvous) is the resilience escalation for multi-day runs.
research/design-F5-fidelity-audit.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # F5 — Fidelity Audit: Are we FULLY taking advantage of the papers + the Composer 2.5 / Composer-2 methodology?
2
+
3
+ > **Date:** 2026-06-09. **Scope:** rubric-style audit of every documented Composer 2.5 / Composer-2 ingredient AND each load-bearing paper against the *actual* code in `composer_replication/`, plus a prioritized "to fully replicate + extend" gap list. AWS-native where a gap needs cloud infra (account 386931836011, us-west-2).
4
+ > **Grounding:** `docs/COMPOSER_RECIPE_MAPPING.md`, `research/01/09/10` (Composer 2.5 blog + Composer-2 techreport mining), `research/05-12`, ADR-008/014/015, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md`.
5
+
6
+ ## Headline verdict
7
+
8
+ The repo is a **high-fidelity replication of Composer 2.5's two *published* channels** (Dr.GRPO base + SDPO textual-feedback self-distillation) plus the framework's *own* third channel (multi-teacher trace-replay-DPO). The datagen / curriculum / anti-hack / safety substrate is essentially complete. But there are **three classes of gap that block "taking a model to the next level"**:
9
+
10
+ 1. **One byte-level loss-math infidelity that the evidence says actually matters**: Channel 1 rides TRL's **k3-in-loss** KL estimator; Composer 2 explicitly chose **k1-in-reward** (`-log r`), and the 2025/26 literature (arXiv:2512.21852 "A Comedy of Estimators"; verl adopting k1-in-reward as the *only* reverse-KL option; TRL issue #4967) shows k1-in-reward **improves OOD generalization** while k3-in-reward can *collapse*. The repo documents this as a "small delta, not patched" — but OOD generalization is exactly the "next-level" axis. This is the single most concrete fidelity fix.
11
+ 2. **Composer-2's *non-hint* behavior-shaping recipe is entirely MISSING**: the **auxiliary scalar reward array** (style / communication / unfinished-todo penalties), the **nonlinear length/effort penalty** `C_length = ((1+kx)^{1-q}-1)/(k(1-q))`, and **self-summarization with reward-to-all-chain-tokens**. These are *fully specified with equations* in research/10 and are reproducible without the hint mystery. None is in code.
12
+ 3. **The novel extension (multi-model Monte-Carlo tree-of-work + world-model deliberation head) is 100% design, 0% code.** No tree controller, no env-step-between-branches recursion, no `SiblingBootstrapGenerator`, no `<deliberate>` token, no next-state head. The report's "core delta" (teacher-plurality → execution-oracle fitness, depth-1 → recursion) is unbuilt.
13
+
14
+ The good news the report stresses: the substrate for all of this already exists. ~9/10 of the system is reuse. The fidelity gaps are *additive*, not architectural rewrites.
15
+
16
+ ---
17
+
18
+ ## RUBRIC A — Composer 2.5 / Composer-2 documented ingredients
19
+
20
+ | # | Ingredient (source) | Status | Implementing file / evidence | Gap |
21
+ |---|---|---|---|---|
22
+ | (a) | **Targeted RL w/ textual feedback** = SDPO/OPSD self-distill (Ch2) — 2.5 blog | **FULLY-REPLICATED** | `opsd.py::generalized_jsd_loss` (byte-for-byte vs siyan-zhao/OPSD, re-aligned Wave 15); `trainer/composer_trainer.py::_compute_sdpo_loss` (full-logits, stop-grad teacher, post-hint masked, ADR-011 aligned indices); `trainer/data_collator.py` emits `ctx_teacher_input_ids` + `student/teacher_response_idx`; tests `test_sdpo_alignment_indices.py`, `test_opsd_parity.py`. | Loss + wiring are production-grade. The *hint source* is the open question — see (a′) below. SDPO not yet smoke-tested against a real `trl.GRPOTrainer` on GPU (ADR-008 note). |
23
+ | (a′) | **Hint generation** (the #1 reproducibility gap — unstated in *every* Cursor artifact) | **PARTIAL (designed-around)** | `hint_generator.py` layered: `TemplateHintGenerator` → `RawErrorHintGenerator` (routed) → `LLMJudgeHintGenerator` (cached, clamped) → `default_composite()`. ADR-009/012. | Layers 1-3 built; **layer 4 (`SiblingBootstrapGenerator` = SDPO "successful-rollout-as-implicit-feedback")** is design-only. LLM-judge path needs a live model wired (Bedrock). |
24
+ | (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
25
+ | (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
26
+ | (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
27
+ | (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **PARTIAL — DOCUMENTED INFIDELITY** | `composer_trainer.py:496-509` documents that TRL's `_compute_loss` uses **k3-in-loss** (`exp(Δ)−Δ−1`), NOT k1. `test_dr_grpo_config_and_alignment.py::test_trl_kl_estimator_is_k3_not_k1` pins this. Honest delta, not patched. | **The evidence says this delta matters for the "next level":** arXiv:2512.21852 + TRL #4967 + verl (k1-in-reward only) show k1-in-reward ↑ OOD generalization; k3-in-reward can collapse. Composer chose k1 deliberately. Fix is implementable (see Gap #1). |
28
+ | (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
29
+ | (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
30
+ | (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
31
+ | (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
32
+ | (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **MISSING** | Reward is pure test-pass-fraction (`env.py::_grade`). No auxiliary reward array. `integrations/altered_minds/reward.py` is an MMLU-format reward for ADR-013 ladder, not the Composer behavior-reward suite. | Fully specified in research/10; reproducible without the hint mystery. Build a `behavior_rewards.py` reward-fn bank. |
33
+ | (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **MISSING** | — | Trivially implementable (≈30 LOC reward shaper over {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns}). Induces parallel tool calls per the report. |
34
+ | (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
35
+ | (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
36
+
37
+ **Rubric A score:** 5 FULLY-REPLICATED of the *core* published recipe (a, b1, b2, c1, g), 1 partial-around (a′), 1 documented-infidelity (c2), 1 intentional-skip (d, e, k), and **3 missing-but-specified Composer-2 behavior-shaping items (h, i, j)** that are the cheapest available "next-level" wins.
38
+
39
+ ---
40
+
41
+ ## RUBRIC B — Load-bearing papers
42
+
43
+ | Paper (cluster) | Status | Implementing file / evidence | Gap |
44
+ |---|---|---|---|
45
+ | **SDPO / OPSD** (2601.20802 / 2601.18734) | **FULLY-REPLICATED** | `opsd.py` (byte-for-byte OPSD JSD), `composer_trainer._compute_sdpo_loss`. | SDPO "successful-rollout-as-implicit-feedback" lever (=sibling bootstrap) not built. |
46
+ | **Dr.GRPO** (2503.20783) | **FULLY-REPLICATED** | `make_dr_grpo_config`; tests pin length-std-off + no-std-norm. | KL is k3-not-k1 (see Rubric A c2). |
47
+ | **DAPO / GSPO / CISPO / BNPO** (2503.14476 / 2507.18071 / 2506.13585) | **FULLY-REPLICATED (as menu)** | `PO_OBJECTIVES` presets + drift guards (ADR-014). Note Composer-2 *rejected* DAPO overlong-masking at small scale. | None — menu exceeds Composer's own single choice. |
48
+ | **DiLoCo / Streaming-DiLoCo** (2311.08105 / 2501.18512) | **PARTIAL** | `diloco/__init__.py::make_diloco_outer_loop`, `diloco/serverless/allreduce.py::ObjectStoreAllReduce` (s3://), `MockManager` mirroring torchft.Manager. ADR-003/005. | Streaming-DiLoCo (overlapped/quantized comm, partial-param sync) not implemented — plain DiLoCo outer loop only. Real multi-replica AWS run unproven (executor skeletons). |
49
+ | **World-model cluster** (MuZero 1911.08265, Dreamer 2301.04104, CWM 2510.02387, Chain-of-World 2603.03195) | **MISSING (design only)** | None. Report §2 designs a parameter-isolated next-state head + `<deliberate>` token as a *second SDPO mode*; calibration (ECE/Brier) primary, Foresight@k kill-ablation. | **Next-state head is NOT built — purely designed.** This is the project's stated end-goal. |
50
+ | **SimPO / TAID / Entropy-OPD distillation** (ADR-007) | **FULLY-REPLICATED** | `distillation/{simpo.py,taid.py,entropy_aware_opd.py}`, wired into `loss.py::compose_loss(dpo_variant=, sdpo_wrapper=, taid_t=)`; tests `test_distillation_losses.py`, `test_taid_parity.py`. | Production trainer (`composer_trainer.py`) only exposes SDPO/DPO; SimPO/TAID/Entropy-OPD live in the verification-mirror `compose_loss`, not the live GRPO subclass. |
51
+ | **PRIME-RL / verl / Monarch** (ADR-006) | **PARTIAL** | `recipes/prime_rl/composer_loss.py` (Ch1+Ch3, raises NotImplementedError for SDPO — needs full logits, ADR-008), `recipes/monarch/actors.py`, parity harness. | PRIME-RL can't host SDPO (logits gap upstream); Monarch is actor skeletons; verl `AsyncServer` (the report's scale answer for tool-heavy trees) not integrated. |
52
+ | **Prune-vs-train-on-all evidence** (RAFT 2504.11343, neg-gradient 2505.18830, near-miss 2503.14391, expert-failure, CCA/PURE) | **PARTIAL (ladder scaffolded)** | ADR-013 A0–A4 isolated-channel ladder + `integrations/altered_minds/{ladder.py,kl_logging.py,reward.py}`. | The report's **P0–P6 "generate-once/route-many" branch-usage axis** (the experiment that *settles* prune-vs-all) is design-only. No typed-train-on-all routing. |
53
+
54
+ **Rubric B score:** 4 FULLY-REPLICATED (SDPO/OPSD, Dr.GRPO, PO-menu, SimPO/TAID/Entropy-OPD), 3 PARTIAL (DiLoCo, PRIME-RL/verl/Monarch, prune-vs-all ladder), **1 MISSING (the entire world-model cluster — the stated goal).**
55
+
56
+ ---
57
+
58
+ ## THE NOVEL EXTENSION — multi-model Monte-Carlo tree-of-work + world-model deliberation head
59
+
60
+ **Status: 100% design, 0% code.** Verified by grep: no `SiblingBootstrap*`, no `world_model`/`WorldModel`, no `<deliberate>`/`MCTS`/`next_state_head` in `composer_replication/` (only docstring uses of the word "deliberately"). Every reference lives in `research/` notes and the deep-research report.
61
+
62
+ What exists today (the *ancestor*): `teacher_replay.py` is **flat depth-1** (N teachers query the same `state["messages"]`, nobody applies the action) with **teacher-plurality fitness** (`Counter` over normalized actions, `extract_dpo_pairs`). The report's two changes — (1) **recursion** (apply each candidate via `FeatureDeletionEnv.step()` → new state → branch again) and (2) **execution-oracle fitness** (`_grade()`'s masked pass-fraction replaces plurality) — are the entire idea and are unbuilt.
63
+
64
+ **Minimal first build (matches report Phase 1, the "one cheap, clearly-worth-it change"):**
65
+ - A tree controller that, between branches, calls `FeatureDeletionEnv.step(action)` and grades leaves with `_grade()` — turning Channel-3 depth-1 stars into a real tree **with NO new loss term**. The divergence signal enters as the SDPO teacher's privileged-info conditioning (the reserved `SiblingBootstrapGenerator` slot: select max-reward sibling, emit "a working approach looks like: …", feed the *same* `ctx_teacher` splice).
66
+ - This is ~1 module (`datagen/tree_controller.py`, ~200-300 LOC) + `SiblingBootstrapGenerator` (~60 LOC in `hint_generator.py`). The collator and loss are untouched.
67
+ - **Divergence-gating is mandatory** (report §3/§10): branch only where sibling next-action distributions already disagree, else collapse to one rollout — turns O(N^D) into O(N·decision-points). Ungated cost ≈$64/trace vs $0.98 flat.
68
+
69
+ **The world-model head (report Phase 4, gated on the P0–P6 verdict):** a parameter-isolated next-state adapter + `<deliberate>` token as a *second SDPO mode* — splice the realized post-action observation (stdout, tool_error kind, signed FAIL_TO_PASS delta, one-line diff) into the teacher context as privileged info, distill the student toward the foreseen-outcome distribution. Carrier requires no new kernel (rides `generalized_jsd_loss`). Build only if Foresight@k ≠ 0.
70
+
71
+ ---
72
+
73
+ ## Prioritized "to fully replicate + extend" gap list (build order)
74
+
75
+ Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
76
+
77
+ **Tier 0 — cheap fidelity fixes the evidence says move OOD generalization (do first):**
78
+ 1. **k1-in-reward KL** (Rubric A c2). Add a `kl_estimator="k1"` + `use_kl_in_reward=True` path to the trainer: compute `−log r` per token, fold into the *advantage/reward* (not the loss), set TRL `beta=0.0` to disable its k3-in-loss term. Mirror TRL issue #4967 / verl's choice. `composer_trainer.py` ~60 LOC + test flipping the pinned k3 assertion. **This is the highest-fidelity-leverage single change.**
79
+ 2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — the aux scalar reward array (style/communication/unfinished-todo) + the nonlinear length/effort penalty `C_length` (exact eq. in research/10), as TRL `RewardFunc`s composable with `env.reward_fn`. ~120 LOC. Reproducible *without* the hint mystery; directly targets Composer's "communication style + effort calibration" goal.
80
+
81
+ **Tier 1 — close the highest-value PARTIALs:**
82
+ 3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.
83
+ 4. **`SiblingBootstrapGenerator`** (Rubric A a′ layer 4 + the SDPO implicit-feedback lever): ~60 LOC in `hint_generator.py`, wired into `default_composite`. Unblocks the tree's "zero-new-loss-term" wiring.
84
+ 5. **SFT-first stage** (Rubric A d): a thin SFT recipe over clean winning trajectories before RL (report §5 "competence floor"). Reuse TRL `SFTTrainer`.
85
+
86
+ **Tier 2 — the novel extension (the report's phased ladder):**
87
+ 6. **Tree controller + execution-oracle fitness** (recursion, the core delta): `datagen/tree_controller.py` — env-step between branches, `_grade()` leaves, divergence-gated expansion. Report Phase 1+2.
88
+ 7. **P0–P6 typed-train-on-all routing** (settles prune-vs-all): extend ADR-013's ladder with generate-once/route-many on a shared tree; primary metrics = near-miss calibration (ECE/Brier) + Foresight@k, pass@1 secondary. Report Phase 3.
89
+ 8. **World-model next-state head** (the stated goal): parameter-isolated adapter + `<deliberate>` token as a second SDPO mode. Build *only if* P4/P6 beat P0–P3 on foresight. Report Phase 4.
90
+
91
+ **Tier 3 — production-fidelity infra (Anyrun analogue on AWS):**
92
+ 9. **Real AWS sandbox tiering** (Rubric A f): EKS gVisor `runsc` RuntimeClass default → Kata+Firecracker (self-managed node groups; EKS Managed Node Groups override the CPU-Options needed for nested virt) → container-free SWE-MiniSandbox for high fan-out. Egress-off. us-west-2, the live SageMaker S3 buckets as the trace/rendezvous store.
93
+ 10. **`EKSExecutor` + `SageMakerExecutor` flesh-out + `[serverless]` dep bump** (s3fs/boto3/kubernetes): make the DiLoCo-over-S3 substrate actually launch multi-replica on the live account. Report §9.
94
+ 11. **verl `AsyncServer` backend** for the tool-heavy tree (TRL has no async GPU-decoupled agent loop). Report §8/§10 Phase 5.
95
+
96
+ **Tier 4 — defense-in-depth completion:**
97
+ 12. **Offline LLM-judge hack monitor** (EvilGenie-style, via Bedrock) as a flagging-only monitor (never the training reward); the report warns `HackMonitor` validated on constructed examples likely misses in-the-wild hacks.
98
+
99
+ ---
100
+
101
+ ## What "FULLY taking advantage" would mean, concretely
102
+
103
+ - **Published Composer 2.5 recipe (Ch1 Dr.GRPO + Ch2 SDPO):** essentially done; the *only* fidelity infidelity is k3-vs-k1 KL (Gap #1) and the missing live-GPU SDPO smoke (Gap #3).
104
+ - **Composer-2's reproducible behavior-shaping (aux rewards + length penalty + self-summarization):** the cheapest unrealized "next-level" wins — fully specified, zero hint-mystery, currently 0% built (Gaps #2 + self-summarization).
105
+ - **The papers' frontier the repo *exceeds* Composer on:** the PO-objective menu (6 objectives vs Composer's 1), the distillation menu (SimPO/TAID/Entropy-OPD), and the run-level collapse kill-switch. These are genuine over-delivery.
106
+ - **The novel bet (tree-of-work + world-model head):** the literature says build it *as a falsifiable ablation, not a premise* (report §3/§7). It is the project's differentiator and is entirely unbuilt — Tier 2 is where "next level" beyond Composer actually lives, conditioned on the P0–P6 verdict.