Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 20: 5-facet AWS-native architecture design (F1-F5)
Browse filesAnswers the dataset-gen+SFT-vs-RL-vs-BOTH question with 5 grounded
facet designs from the aws-native-architecture-design workflow:
- F1 systems-framing: BOTH = two loops at two timescales, not two phases;
component->loop mapping table; SFT-first ordering; S3 dataset contract
(6 typed prefixes); repo delta (tree_controller, s3_layout, sft_floor).
- F2 aws-datagen: per-stage native verdict (Glue 5.0 Spark ingest ->
Bedrock batch replay -> AWS Batch array on Spot test-exec -> EMR
Serverless normalize), Step Functions orchestration.
- F3 rl-sagemaker: runnable-now g5.2xlarge GSM8K GRPO smoke (<$1),
colocated vLLM, BYO container from PyTorch DLC; live-verified quota.
- F4 decoupled-diloco-s3: N single-instance jobs, ObjectStoreAllReduce
PUT-poll-mean over S3, strong-consistency correctness, IAM gotcha,
cheapest validating-run ladder.
- F5 fidelity-audit: rubric vs Composer 2.5/Composer-2 + load-bearing
papers; 5 FULLY-REPLICATED, k3-vs-k1 KL documented infidelity,
missing behavior-shaping recipe, tree+world-model 100% design;
prioritized build order.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# F1 — Systems Framing: Dataset-Gen+SFT, RL, or BOTH?
|
| 2 |
+
|
| 3 |
+
**Facet question:** Is this a dataset-generation+SFT system, an RL system, or BOTH?
|
| 4 |
+
|
| 5 |
+
**Committed answer: BOTH — and decisively so — structured as TWO LOOPS at two
|
| 6 |
+
timescales, not two phases.** The repo already physically contains both halves.
|
| 7 |
+
The OUTER (slow) loop is dataset/curriculum *construction*; the INNER (fast)
|
| 8 |
+
loop is RL. They feed each other continuously: the inner loop's improved student
|
| 9 |
+
generates the outer loop's next seed traces, and the inner loop's learned
|
| 10 |
+
deliberation-confidence becomes the outer loop's branch gate. This is the report
|
| 11 |
+
§5 verdict ("two loops at different timescales, not two phases") made concrete
|
| 12 |
+
against the code.
|
| 13 |
+
|
| 14 |
+
This is not an architectural opinion — it is forced by what the modules *are*:
|
| 15 |
+
|
| 16 |
+
| Repo component | What it is, mechanically | Loop |
|
| 17 |
+
|---|---|---|
|
| 18 |
+
| `ingestion/claude_code.py` | Claude Code JSONL → `TraceState` (one node per assistant turn; `tool_error` flag; `strip_thinking=False`) — produces seed traces | **OUTER** (dataset) |
|
| 19 |
+
| `teacher_replay.py` | N-teacher OpenRouter replay + `extract_dpo_pairs` — OFFLINE dataset gen, flat depth-1 stars hung off a frozen trace | **OUTER** (dataset) |
|
| 20 |
+
| `datagen/substrates.py` (`SweBenchAdapter`) | SWE-bench tuple → `FeatureDeletionTask` (revert gold patch = synthesize broken repo) — task SYNTHESIS | **OUTER** (curriculum) |
|
| 21 |
+
| `datagen/env.py` `FeatureDeletionEnv.step()/_grade()` | execution oracle: runs action in sandbox, returns observation; `_grade()` = masked `FAIL_TO_PASS` pass-fraction — this is the **RL env reward kernel** | **OUTER** (fitness) AND consumed by **INNER** (`reward_fn`) |
|
| 22 |
+
| `datagen/curriculum.py` `DifficultyCurriculum` | p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02 — selection/sampling for the next round | **OUTER** (selection) |
|
| 23 |
+
| The multi-model MCTS tree controller (to build) | recursion: apply each model's action via `env.step()`, branch again — depth-1 stars → tree | **OUTER** (the core delta) |
|
| 24 |
+
| `trainer/composer_trainer.py` `ComposerReplicationTrainer` | a real `trl.GRPOTrainer` subclass; `total = grpo + α·sdpo + β·trace_replay_dpo` — **this is RL** | **INNER** (fast RL) |
|
| 25 |
+
| `datagen/env.py` `FeatureDeletionEnv.reward_fn` | TRL `RewardFunc` adapter (`prompts, completions → list[float]`) — the env wearing its RL face | **INNER** (RL reward) |
|
| 26 |
+
| `loss.py` `compose_loss` | TRL-free 3-channel harness; Channel 1 = LM cross-entropy on response tokens — **this is the SFT-able face** (BC limit GRPO converges to) | **INNER**, also the **SFT-first floor** |
|
| 27 |
+
| `safety/holdout.py` + `safety/kill_switch.py` (`HeldOutGuard`/`HeldoutSplit`), wired into the trainer | run-level collapse/reward-hacking tripwire on proxy-minus-realeval gap — an **RL-run safeguard** | **INNER** (safety) |
|
| 28 |
+
|
| 29 |
+
So: `FeatureDeletionEnv.reward_fn` is an RL env. `teacher_replay` is offline
|
| 30 |
+
dataset gen. `ComposerReplicationTrainer(trl.GRPOTrainer)` is RL.
|
| 31 |
+
`compose_loss`'s `lm_ce` channel is an SFT harness. `HeldOutGuard` is an RL-run
|
| 32 |
+
safeguard. The system *is* both, by construction.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## The SFT-first competence floor, THEN RL (mirrors Cursor's CPT+SFT→RL)
|
| 37 |
+
|
| 38 |
+
`docs/COMPOSER_RECIPE_MAPPING.md` confirms Cursor's ordering is **Continued
|
| 39 |
+
Pretraining → SFT → RL**, and that the repo deliberately *skips* CPT (starts
|
| 40 |
+
from an already-code-tuned base, e.g. `Qwen3-Coder-7B` / `Qwen3-Coder-30B-A3B`).
|
| 41 |
+
That leaves a two-step ordering this facet commits to:
|
| 42 |
+
|
| 43 |
+
1. **SFT-first (competence floor).** Take the OUTER loop's *clean winning
|
| 44 |
+
trajectories* (oracle-clean `_grade()` passes only — Gate 1) and run standard
|
| 45 |
+
SFT. The carrier already exists: `compose_loss` with `alpha_sdpo=0,
|
| 46 |
+
beta_replay=0` reduces to `_lm_response_ce` — next-token cross-entropy masked
|
| 47 |
+
to assistant-response tokens. This is the "GRPO converges to BC under
|
| 48 |
+
deterministic rewards" limit stated in `loss.py`. It establishes a floor so
|
| 49 |
+
GRPO has a non-degenerate starting policy. (For a clean separation, an SFT
|
| 50 |
+
`Trainer` over the same corpus is equivalent; the point is the *corpus* —
|
| 51 |
+
winning leaves — and the *masking* — response tokens only.)
|
| 52 |
+
2. **THEN RL.** `ComposerReplicationTrainer` runs GRPO/Dr.GRPO (`make_po_config`)
|
| 53 |
+
on the `FeatureDeletionEnv.reward_fn` execution oracle, with the optional
|
| 54 |
+
SDPO (α) and trace-replay-DPO (β) channels, plus the contested world-model
|
| 55 |
+
next-state head as a **second SDPO mode** (parameter-isolated; report §2/§4).
|
| 56 |
+
|
| 57 |
+
This is exactly the report §5 line: "SFT-first establishes a competence floor on
|
| 58 |
+
clean winning trajectories before RL — mirroring Cursor's CPT+SFT→RL ordering and
|
| 59 |
+
the repo's own outer (datagen/teacher_replay) / inner
|
| 60 |
+
(`ComposerReplicationTrainer`) split."
|
| 61 |
+
|
| 62 |
+
---
|
| 63 |
+
|
| 64 |
+
## The single diagram-able data flow
|
| 65 |
+
|
| 66 |
+
```
|
| 67 |
+
┌──────────────────────────────────────────────────────────────┐
|
| 68 |
+
│ OUTER LOOP (slow: hours→days; bursty, Spot-friendly, EKS) │
|
| 69 |
+
│ │
|
| 70 |
+
raw base model ───►│ ingestion/claude_code.py (seed traces, TraceState) │
|
| 71 |
+
+ Claude traces │ │ │
|
| 72 |
+
│ ▼ │
|
| 73 |
+
│ multi-model MCTS tree controller ──► N models branch │
|
| 74 |
+
│ (divergence-gated; teacher_replay generalized flat→tree) │
|
| 75 |
+
│ │ │
|
| 76 |
+
│ ▼ apply each action │
|
| 77 |
+
│ FeatureDeletionEnv.step() ─► sandbox exec ─► new state │
|
| 78 |
+
│ │ │
|
| 79 |
+
│ ▼ leaf grade │
|
| 80 |
+
│ FeatureDeletionEnv._grade() (masked FAIL_TO_PASS pass-frac) │
|
| 81 |
+
│ │ + DifficultyCurriculum.update (p(1-p) selection) │
|
| 82 |
+
│ ▼ HARVEST + TYPE the divergence │
|
| 83 |
+
└────────────┼───────────────────────────────────────────────────┘
|
| 84 |
+
▼
|
| 85 |
+
┌──────────────────── S3 ────────────────────────────────────────────────┐
|
| 86 |
+
│ s3://<bucket>/runs/<run_id>/ │
|
| 87 |
+
│ sft_corpus/ ← clean WINNING trajectories (SFT-first) │
|
| 88 |
+
│ dpo_pairs/ ← near-miss (chosen=sibling winner, rejected=loser)│
|
| 89 |
+
│ rl_task_pool/ ← FeatureDeletionTask registry + curriculum priors │
|
| 90 |
+
│ wm_tuples/ ← (state, action, next_state, outcome) — ALL branches│
|
| 91 |
+
│ divergence_pairs/ ← divergence-annotated nodes (where siblings forked)│
|
| 92 |
+
│ holdout/ ← DISJOINT held-out eval anchor (never fed back) │
|
| 93 |
+
│ diloco_rendezvous/ ← round_{NNNNNN}/rank_{RRRR}.pt (ObjectStoreAllReduce)│
|
| 94 |
+
└────────────┬──────────────────────────────────────────────────────────────┘
|
| 95 |
+
▼
|
| 96 |
+
┌───────── SFT JOB ───────────────┐ (compose_loss lm_ce / SFT Trainer)
|
| 97 |
+
│ sft_corpus → competence-floor ckpt│ ──► seeds the RL init policy
|
| 98 |
+
└────────────┬──────────────────────┘
|
| 99 |
+
▼
|
| 100 |
+
┌──────────────────────────────────────────────────────────────┐
|
| 101 |
+
│ INNER LOOP (fast: minutes→steps; resilience-bound, GPU) │
|
| 102 |
+
│ ComposerReplicationTrainer (trl.GRPOTrainer subclass) │
|
| 103 |
+
│ total = grpo(reward_fn on _grade) │
|
| 104 |
+
│ + α·sdpo (hint-distill, JSD on aligned post-hint tokens) │
|
| 105 |
+
│ + β·trace_replay_dpo (dpo_pairs) │
|
| 106 |
+
│ + [world-model next-state head — 2nd SDPO mode, gated] │
|
| 107 |
+
│ HeldOutGuard tripwire on proxy−realeval gap (kill-switch) │
|
| 108 |
+
│ DiLoCo outer-sync every ~500-1000 steps via S3 diloco_rendezvous │
|
| 109 |
+
└────────────┬──────────────────────────────────────────────────┘
|
| 110 |
+
▼
|
| 111 |
+
improved student model
|
| 112 |
+
│
|
| 113 |
+
┌────────────────────────┴────────────────────────────────────────────────┐
|
| 114 |
+
│ FEEDBACK (why loops, not phases): │
|
| 115 |
+
│ 1. improved student generates the next round's seed traces (back to OUTER)│
|
| 116 |
+
│ 2. its learned deliberation-confidence becomes the next round's branch │
|
| 117 |
+
│ gate (the §3 bootstrap: cross-model disagreement early → learned │
|
| 118 |
+
│ deliberation-confidence later — same lever, two levels) │
|
| 119 |
+
└────────────────────────────────────────────────────────────────────────────┘
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## What goes in S3 between the loops (crisp)
|
| 125 |
+
|
| 126 |
+
The boundary between OUTER and INNER is S3 — and on AWS S3 *is* the DiLoCo
|
| 127 |
+
rendezvous backend with zero new code (the `ObjectStoreAllReduce` fsspec path,
|
| 128 |
+
`round_{NNNNNN}/rank_{RRRR}.pt`). The same bucket carries the dataset hand-off.
|
| 129 |
+
Live target bucket in this account/region:
|
| 130 |
+
`s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/` (us-west-2;
|
| 131 |
+
a `sagemaker-dynamo-on-eks-hyperpod-*` bucket also already exists, evidence
|
| 132 |
+
HyperPod-on-EKS is provisioned here).
|
| 133 |
+
|
| 134 |
+
| S3 prefix | Contents | Producer (OUTER) | Consumer (INNER) |
|
| 135 |
+
|---|---|---|---|
|
| 136 |
+
| `sft_corpus/` | **SFT corpus** — clean winning trajectories (oracle-clean `_grade()` passes), masked to assistant-response tokens; JSONL | tree controller + `_grade()` Gate 1 | SFT job → `compose_loss` `lm_ce` (or SFT `Trainer`) |
|
| 137 |
+
| `dpo_pairs/` | **DPO pairs** — `{chosen=sibling-winner-or-teacher-consensus, rejected=student/losing-branch, state_messages}` (the `DPOPair` schema from `teacher_replay.py`) | `extract_dpo_pairs` generalized to sibling winners | Channel 3 (`β·trace_replay_dpo`) |
|
| 138 |
+
| `rl_task_pool/` | **RL task pool** — `FeatureDeletionTask` registry (`repo, broken_image, fail_to_pass, pass_to_pass, deleted_symbols, test_command`) + `DifficultyCurriculum` priors | `SweBenchAdapter` + curriculum | `FeatureDeletionEnv` registry → `reward_fn` |
|
| 139 |
+
| `divergence_pairs/` | **Divergence-annotated pairs** — nodes where sibling next-action distributions disagreed (the SDPO privileged-info conditioning variable + which sibling subtree separated first) | tree controller (records pre-expansion divergence) | Channel 2 SDPO `ctx_teacher` splice / `SiblingBootstrapGenerator` |
|
| 140 |
+
| `wm_tuples/` | **World-model next-state tuples** — `(state, action, next_state, outcome)` from **ALL** branches incl. failures (CWM "train-on-all" target; the safe home for failed-branch signal) | every `env.step()` observation, every leaf grade | world-model next-state head (2nd SDPO mode), gated |
|
| 141 |
+
| `holdout/` | **Disjoint held-out eval anchor** — `HeldoutSplit` partition; NEVER fed back to the generator | `HeldoutSplit.split(seed=…)` once | `HeldOutGuard.heldout_eval_fn()` |
|
| 142 |
+
| `diloco_rendezvous/` | DiLoCo pseudo-gradient exchange, `round_{NNNNNN}/rank_{RRRR}.pt` | inner replicas | inner replicas (`ObjectStoreAllReduce`) |
|
| 143 |
+
|
| 144 |
+
Note the keystone (report §4/§2): a *failed* branch is poison for the policy
|
| 145 |
+
gradient but **gold for the world model** — so a loser goes to `wm_tuples/`
|
| 146 |
+
(safe, no policy penalty) and optionally to `dpo_pairs/` as a contrastive
|
| 147 |
+
`rejected` against a sibling winner (never as raw negative gradient), but never
|
| 148 |
+
to `sft_corpus/`. That is "type the signal and route it" realized as S3 prefixes.
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## AWS-native realization (concrete, this account)
|
| 153 |
+
|
| 154 |
+
- **OUTER loop on EKS** (single control plane): Argo Workflows DAG, one node =
|
| 155 |
+
one divergence-gated branch; vLLM RayService pods for open-weight model
|
| 156 |
+
families + API-egress pods for hosted teachers; gVisor (`runsc` RuntimeClass)
|
| 157 |
+
sandbox pods running `FeatureDeletionEnv._grade()`. Writes all six dataset
|
| 158 |
+
prefixes to S3 via IRSA.
|
| 159 |
+
- **SFT job** between loops: a SageMaker Training Job (`File` mode for the
|
| 160 |
+
`sft_corpus/` < 100 GB; `FastFile` if it grows past ~50–100 GB) OR an EKS GPU
|
| 161 |
+
pod; reads `sft_corpus/`, writes a competence-floor checkpoint to
|
| 162 |
+
`s3://…/runs/<id>/ckpt_sft/`.
|
| 163 |
+
- **INNER loop**: `ComposerReplicationTrainer` on a Karpenter p5/g6e NodePool,
|
| 164 |
+
swappable to a HyperPod-attached node-group (1:1 EKS↔HyperPod mapping, already
|
| 165 |
+
provisioned per the `sagemaker-dynamo-on-eks-hyperpod-*` bucket) for
|
| 166 |
+
resilience-bound long runs. DiLoCo replicas rendezvous only through S3 — no
|
| 167 |
+
cross-job NCCL — so a straggler blocks at the poll loop (`timeout_s=1800`)
|
| 168 |
+
instead of deadlocking a gang. `SageMakerExecutor` is the bursty fallback
|
| 169 |
+
(N single-instance Training Jobs, `EnableNetworkIsolation=False` so the
|
| 170 |
+
container can S3 PUT/GET the rendezvous).
|
| 171 |
+
|
| 172 |
+
The DiLoCo math, `MockManager`, `make_diloco_outer_loop`, the trainer, the loss,
|
| 173 |
+
and the env are **untouched** across all of this — the `ServerlessExecutor`
|
| 174 |
+
Protocol + `ObjectStoreAllReduce` are the entire portability contract.
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## Repo delta to realize the framing
|
| 179 |
+
|
| 180 |
+
1. `composer_replication/datagen/tree_controller.py` (NEW, ~250-350 LOC) — the
|
| 181 |
+
recursion: apply each model's candidate action via `FeatureDeletionEnv.step()`,
|
| 182 |
+
branch again, grade leaves, emit the six S3 prefixes. The core OUTER delta.
|
| 183 |
+
2. `composer_replication/pipeline/s3_layout.py` (NEW, ~80 LOC) — typed writers
|
| 184 |
+
for `sft_corpus/ | dpo_pairs/ | rl_task_pool/ | divergence_pairs/ | wm_tuples/
|
| 185 |
+
| holdout/`; the OUTER→INNER contract in one place.
|
| 186 |
+
3. `composer_replication/pipeline/sft_floor.py` (NEW, ~60 LOC) — SFT-first
|
| 187 |
+
driver: read `sft_corpus/`, run `compose_loss` (`alpha_sdpo=0, beta_replay=0`)
|
| 188 |
+
or an SFT `Trainer`, write `ckpt_sft/`. Wraps existing `_lm_response_ce`.
|
| 189 |
+
4. `composer_replication/trainer/composer_trainer.py` (EDIT, ~40 LOC) — add the
|
| 190 |
+
world-model next-state head as a 2nd SDPO mode (parameter-isolated adapter +
|
| 191 |
+
`<deliberate>` token), reading `wm_tuples/`; gated OFF by default.
|
| 192 |
+
5. `composer_replication/diloco/serverless/eks.py` (EDIT/FINISH) — `EKSExecutor`
|
| 193 |
+
Indexed Job mapping for the inner loop (a few hundred LOC; sibling of
|
| 194 |
+
`ModalSpawnExecutor`).
|
| 195 |
+
6. `[serverless]` extra: add `s3fs`/`boto3`/`kubernetes` (the documented dep gap).
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## Open questions
|
| 200 |
+
|
| 201 |
+
- Where exactly does SFT-first run — a dedicated SageMaker Training Job, or an
|
| 202 |
+
EKS pod sharing the inner NodePool? (Cost/latency tradeoff; corpus size gates
|
| 203 |
+
File vs FastFile.)
|
| 204 |
+
- Should the SFT-floor checkpoint be re-derived each outer generation
|
| 205 |
+
(full SFT→RL re-warm) or only once at gen-0 (RL-only thereafter)? The flywheel
|
| 206 |
+
feedback suggests gen-0-only, with RL carrying subsequent gains.
|
| 207 |
+
- Is the world-model head trained inside the inner GRPO trainer (shared step) or
|
| 208 |
+
as a separate offline pass over `wm_tuples/`? Report §2 favors
|
| 209 |
+
parameter-isolation; a separate pass is the strongest isolation.
|
| 210 |
+
|
| 211 |
+
## Citations
|
| 212 |
+
|
| 213 |
+
- research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md §5 (two
|
| 214 |
+
loops), §2 (world-model head as 2nd SDPO mode), §4 (type-the-signal routing),
|
| 215 |
+
§6 (reuse/build table)
|
| 216 |
+
- composer_replication/datagen/env.py:63-94 (`step`/`_grade`/`reward_fn`)
|
| 217 |
+
- composer_replication/teacher_replay.py:162-262 (`replay_trace`,
|
| 218 |
+
`extract_dpo_pairs`, `DPOPair`)
|
| 219 |
+
- composer_replication/trainer/composer_trainer.py:54-178 (3-channel
|
| 220 |
+
`_compute_loss`), :184-251 (`HeldOutGuard` wiring)
|
| 221 |
+
- composer_replication/loss.py:71-261, :277-304 (`compose_loss`, `_lm_response_ce`)
|
| 222 |
+
- composer_replication/safety/holdout.py (`HeldoutSplit` disjointness)
|
| 223 |
+
- composer_replication/diloco/serverless/{executor.py,allreduce.py,sagemaker.py}
|
| 224 |
+
(Protocol + `ObjectStoreAllReduce` + S3 rendezvous)
|
| 225 |
+
- composer_replication/ingestion/claude_code.py:6-21 (one-node-per-turn,
|
| 226 |
+
`strip_thinking`)
|
| 227 |
+
- docs/COMPOSER_RECIPE_MAPPING.md:48-137 (CPT+SFT→RL ordering; repo skips CPT)
|
| 228 |
+
- Live: `aws s3 ls` → amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d,
|
| 229 |
+
sagemaker-dynamo-on-eks-hyperpod-* (us-west-2, acct 386931836011)
|
|
@@ -0,0 +1,191 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Facet F2 — AWS-Native Dataset Generation (the outer loop)
|
| 2 |
+
|
| 3 |
+
**Account 386931836011 · us-west-2 · Admin · 2026-06-09**
|
| 4 |
+
|
| 5 |
+
The outer (slow) loop of §5 — ingest → multi-teacher replay → invert SWE substrates into FeatureDeletion tasks → normalize to SFT/DPO corpus on S3 — is **embarrassingly parallel, bursty, fault-tolerant** (report §5, §8). The wrong move is to pick one service for the whole loop. Each of the four stages has a different compute shape, so each gets the right AWS-native primitive, all stitched by **Step Functions** (the AWS-native analog of the report's Argo controller for the *non-cluster* path) and all reading/writing one S3 dataset contract.
|
| 6 |
+
|
| 7 |
+
This note commits to a concrete service per stage and maps every choice to the repo file it backs.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## TL;DR — the per-stage verdict
|
| 12 |
+
|
| 13 |
+
| Outer-loop stage | Repo code | Compute shape | **AWS service (committed)** | Why not the others |
|
| 14 |
+
|---|---|---|---|---|
|
| 15 |
+
| (a) Ingest + clean Claude Code traces | `ingestion/claude_code.py` | Light CPU, JSONL → TraceState, IO-bound | **Glue Spark ETL job** (Glue 5.0) | Trivial scale; one Spark job over a JSONL prefix; Glue's Data Catalog + bookmarks are free here. SM Processing also fine but Glue is the cheaper IO-bound default. |
|
| 16 |
+
| (b) Replay each state across N teachers | `teacher_replay.py` | API-bound fan-out, N×T calls, no local GPU | **Bedrock batch inference** (`CreateModelInvocationJob`, one job per teacher) + a thin **EMR Serverless** aggregation step | NOT Glue-Ray (end-of-support). NOT SageMaker-Processing-with-Ray for the *calls* (you'd pay for idle instances waiting on API latency). Bedrock batch = 50% discount, S3-native, the fan-out IS the job. |
|
| 17 |
+
| (c) Synthesize FeatureDeletionEnv tasks (run tests in sandboxes) | `datagen/substrates.py`, `env.py`, `docker_sandbox.py`, `validator.py` | CPU-heavy, untrusted code, 1 container per task, fault-tolerant | **AWS Batch array jobs on EC2 Spot** (container = the existing `DockerSandbox` image) | NOT SM Processing (no per-task isolation/retry granularity; 1 job = 1 fleet). NOT Fargate-only (no `--privileged`/gVisor, 4-vCPU cap, no Spot-as-cheap). Batch array index = task index, matches SWE-bench `--max_workers` model exactly. |
|
| 18 |
+
| (d) Normalize → SFT corpus + DPO pairs | `replaysim/normalize.py` (data-juicer), `teacher_replay.extract_dpo_pairs` | CPU Spark/dataframe dedup + the data-juicer op-graph | **EMR Serverless (Spark)** running data-juicer's Ray/standalone executor in `--mode local` per partition, OR SM Processing for the single-node data-juicer path | EMR Serverless is the Spark-style dataframe normalize/dedup home; data-juicer runs CPU-only here. Glue ETL would also work but EMR Serverless gives finer Spark control + Graviton price/perf. |
|
| 19 |
+
|
| 20 |
+
The recurring principle: **API-bound and test-execution work must NOT live on a service that bills for idle wall-clock** (Glue/SM Processing keep instances hot while you wait on Bedrock or a 5-minute pytest). Bedrock batch is async-priced; Batch array jobs are per-task Spot. Spark-shaped dataframe work (ingest, dedup, normalize) goes to the Spark services (Glue ETL for the trivial ingest, EMR Serverless for the heavy normalize).
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Stage (a) — Ingest + clean traces · Glue 5.0 Spark ETL
|
| 25 |
+
|
| 26 |
+
`ingestion/claude_code.py::ClaudeCodeIngester.ingest()` is a pure JSONL→TraceState transform (one TraceState per assistant turn; `strip_thinking=False` per report §6 — 67% of error-recovery turns are pure thinking). This is IO-bound dataframe work over a prefix of session files.
|
| 27 |
+
|
| 28 |
+
**Service: AWS Glue 5.0 Spark ETL job.**
|
| 29 |
+
- Input: `s3://<datagen-bucket>/raw/claude_code/**/*.jsonl` (Claude Code session JSONL uploaded from `~/.claude/projects`).
|
| 30 |
+
- The Glue script `mapPartitions` over the JSONL files; each partition runs `ClaudeCodeIngester().ingest(path)` unchanged (it already yields `TraceState`), filters subagent/sidechain records, and writes Parquet.
|
| 31 |
+
- Output: `s3://<datagen-bucket>/traces/v1/run_id=<id>/part-*.parquet` partitioned `by run_id`.
|
| 32 |
+
- Glue Data Catalog table `composer_traces` makes the trace store queryable by Athena (debug: "how many error-site turns this run?").
|
| 33 |
+
|
| 34 |
+
Why Glue over SM Processing here: ingestion is small, periodic, IO-bound; Glue's per-DPU-second billing + job bookmarks (skip already-ingested sessions) fit better than spinning a Processing fleet. `ClaudeCodeIngester` runs verbatim inside the Spark UDF — zero repo change to the ingester itself.
|
| 35 |
+
|
| 36 |
+
> **Live caveat (worth flagging):** `ClaudeCodeIngester.ingest()` is a per-file, single-pass, *record-parallel* pure-Python iterator with no shuffle/join — i.e. it is **not** intrinsically Spark-shaped. **SageMaker Processing** with `ProcessingInput.S3DataDistributionType = ShardedByS3Key` (custom container, the ingester runs unchanged, slice config from `processingjobconfig.json`/`resourceconfig.json`) is the *cleaner* primitive for this exact shape and avoids any dependence on the Glue stack — relevant because **AWS Glue-for-Ray is now closed to new customers** (confirmed in the Glue engine docs), so leaning on Glue here means committing to Glue-for-Spark specifically. Verdict stands as Glue ETL for the trivial periodic ingest (catalog + bookmarks are a real convenience), with SM Processing as the equally-valid, Glue-independent fallback. Either way the ingester code is untouched.
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Stage (b) — N-teacher replay · Bedrock batch, NOT OpenRouter
|
| 41 |
+
|
| 42 |
+
Today `teacher_replay.py` calls **OpenRouter** over httpx with `DEFAULT_TEACHERS = [claude-opus, gpt-5, deepseek-v4]` and a `max_total_usd` cap. The facet question: stay API or move to Bedrock?
|
| 43 |
+
|
| 44 |
+
### Verdict: move the AWS-native default to **Bedrock batch inference**, keep OpenRouter as a fallback adapter.
|
| 45 |
+
|
| 46 |
+
Reasoning grounded in research + repo:
|
| 47 |
+
|
| 48 |
+
1. **Bedrock batch = 50% of on-demand pricing**, S3-in/S3-out, ~24h turnaround SLA. The replay is the single most expensive piece of the whole system (report §10: ~$0.98/trace at N=3, ~$64/trace at 8-teacher×1000-step). A 50% cut on the dominant cost line is the highest-leverage AWS-native move in this facet, and replay is *exactly* the latency-insensitive offline workload Bedrock batch targets.
|
| 49 |
+
2. **The fan-out IS one job per teacher.** Bedrock batch does not multiplex models in a single job — `CreateModelInvocationJob` takes one `modelId`. So the N-teacher fan-out = N batch jobs over the *same* JSONL, each writing to its own S3 prefix. The flat `for state: for teacher:` loop in `replay_trace` becomes "emit one shared JSONL of all states, submit N jobs." This is structurally *simpler* than the current per-call gather.
|
| 50 |
+
3. **Heterogeneity survives on Bedrock — VERIFIED LIVE in this account.** `aws bedrock list-foundation-models --region us-west-2` returns a genuinely multi-lab pool: `anthropic.claude-opus-4-8`, `anthropic.claude-opus-4-7`, `anthropic.claude-opus-4-6-v1`, `anthropic.claude-sonnet-4-6`, `anthropic.claude-haiku-4-5-...`, `deepseek.v3.2`, `deepseek.r1-v1:0`, `meta.llama4-maverick-17b-instruct`, `meta.llama3-3-70b-instruct`, etc. That is Claude + DeepSeek + Llama = three different families, satisfying the report's N≥3 heterogeneous-population anti-collapse safeguard (§5 #4) and the "drop same-family teachers if Claude is the student" leakage rule (§6).
|
| 51 |
+
- **CRITICAL batch-eligibility split (read live from the Bedrock "Supported Regions and models for batch inference" table):** not every model has a *batch* ID. **Batch-eligible** in/through `us-west-2`: **DeepSeek V3.2** (`deepseek.v3.2`, single-region `us-west-2`), **Claude Sonnet 4.6 / Opus 4.6** (via the `us.` **cross-region inference profile** whose destinations include `us-west-2`), and the Nova/Titan families. **On-demand ONLY (no ARN-versioned batch ID, so NOT batchable):** **Claude Opus 4.7 / 4.8** — reachable via `bedrock-runtime` `Converse`/`InvokeModel` and fanned out with a bounded async/thread pool exactly like today's `replay_trace`, but you pay on-demand. So the cheap bulk batch pool is `{deepseek.v3.2, us.anthropic.claude-sonnet-4-6, us.anthropic.claude-opus-4-6}`; if the teacher set must include 4.7/4.8 they ride the on-demand path or are substituted by batchable Opus 4.6.
|
| 52 |
+
4. **Bedrock data does not train Anthropic models** (governance) — a real reason to prefer it over OpenRouter for a self-improving loop.
|
| 53 |
+
|
| 54 |
+
### Concrete wiring
|
| 55 |
+
|
| 56 |
+
`teacher_replay.py` gains a `BedrockBatchTeacherPool` alongside the OpenRouter path (the `TeacherSpec` slug becomes a Bedrock `modelId` or inference-profile id like `us.anthropic.claude-haiku-4-5-...`):
|
| 57 |
+
|
| 58 |
+
```
|
| 59 |
+
# new: teacher_replay_bedrock.py (~180 LOC)
|
| 60 |
+
def submit_replay_batch(states, teachers, s3_in, s3_out, role_arn) -> list[jobArn]:
|
| 61 |
+
# 1. write ONE shared abc.jsonl: {recordId: state_id, modelInput: {anthropic_version, messages, max_tokens}}
|
| 62 |
+
# one record per state (messages = state["messages"]). recordId == state_id is the join key.
|
| 63 |
+
# 2. for each teacher: bedrock.create_model_invocation_job(
|
| 64 |
+
# modelId=teacher.model_id, jobName=..., roleArn=role_arn,
|
| 65 |
+
# inputDataConfig={"s3InputDataConfig":{"s3Uri": s3_in}},
|
| 66 |
+
# outputDataConfig={"s3OutputDataConfig":{"s3Uri": f"{s3_out}/{teacher.slug}/"}})
|
| 67 |
+
# 3. return jobArns; poll get_model_invocation_job until Completed.
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
The output `.jsonl.out` rows carry `recordId` (=state_id) + `modelOutput`; an EMR Serverless step joins all N teacher outputs back by `state_id` into the exact `list[TeacherCallResult]` shape `extract_dpo_pairs` already consumes — so `extract_dpo_pairs` and the entire DPOPair contract are **byte-for-byte untouched**.
|
| 71 |
+
|
| 72 |
+
**Constraints to honor (from research):** Bedrock batch input file ≤ 1 GB, total job ≤ 5 GB, default 100k records/job (adjustable via Service Quotas), ≥ a minimum-records floor per model. The submitter must shard a >100k-state run into multiple JSONL files. A `BedrockBatchTeacherPool` is the right home for that sharding.
|
| 73 |
+
|
| 74 |
+
**Where OpenRouter stays:** when N>3 teachers from labs Bedrock doesn't host (e.g. a brand-new frontier model), or when sub-24h latency is needed for a hot debugging loop. The `TeacherSpec` TypedDict already abstracts provider; we add a `provider: "bedrock"|"openrouter"` field and route. This keeps `replay_trace`'s async OpenRouter path as the low-latency escape hatch and Bedrock batch as the cheap bulk default.
|
| 75 |
+
|
| 76 |
+
### Why NOT Glue-Ray / SM-Processing-with-Ray for the fan-out
|
| 77 |
+
The facet floats "Glue Ray jobs / SageMaker Processing with Ray for the N-model parallel fan-out." **AWS Glue for Ray is end-of-support** — AWS now explicitly recommends Ray-on-EKS (KubeRay) instead (docs: *AWS Glue for Ray end of support*). And running the *teacher API calls* on a Ray cluster (Glue or SM Processing) means paying for idle CPU/instances while threads block on model latency — the anti-pattern above. Ray's place in this facet is **stage (c)** orchestration of sandbox fan-out *if* we ever want Ray semantics, but Batch array jobs are the simpler AWS-native fit there too. So: **Bedrock batch for the calls, EMR Serverless to fan-in.**
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## Stage (c) — FeatureDeletion synthesis + test execution · AWS Batch array jobs on Spot
|
| 82 |
+
|
| 83 |
+
This is the compute-heavy, genuinely-new part (report §8: "per-branch sandbox isolation is the throughput ceiling of the whole idea"). Two sub-steps:
|
| 84 |
+
|
| 85 |
+
**c1. Schema inversion** — `SweBenchAdapter.to_task(instance)` (`substrates.py`) maps a SWE-bench-shaped row → `FeatureDeletionTask` (revert gold patch = broken repo; FAIL_TO_PASS = reward target; PASS_TO_PASS = guard). Pure CPU, no Docker. This runs inside the **Glue ingest job (a)** or a tiny Lambda — it's a dict transform. `is_redistributable()` license-gate (GPL/AGPL/LGPL filter) runs here.
|
| 86 |
+
|
| 87 |
+
**c2. Materialize + validate the broken repo** — this is where it gets heavy. For each task: pull the substrate's frozen image, `git apply -R` the gold patch, `scrub_tree()` (the PRIMARY reward-hack control — strip `__pycache__`/`.git`/`*.pyc`), run the test command, confirm FAIL_TO_PASS actually fails and PASS_TO_PASS actually passes (the 4-gate `validator.py` solvability check). Each task = one untrusted, isolated, ~1-5 minute container run.
|
| 88 |
+
|
| 89 |
+
**Service: AWS Batch array jobs, EC2 Spot compute environment, container = the existing `DockerSandbox` image.**
|
| 90 |
+
|
| 91 |
+
```
|
| 92 |
+
batch.submit_job(
|
| 93 |
+
jobName=f"fd-validate-{run_id}",
|
| 94 |
+
jobQueue="fd-sandbox-spot",
|
| 95 |
+
jobDefinition="fd-docker-sandbox:N", # the DockerSandbox image in ECR
|
| 96 |
+
arrayProperties={"size": n_tasks}, # up to 10,000 children
|
| 97 |
+
containerOverrides={"environment": [{"name":"TASK_MANIFEST_S3","value": s3_manifest}]})
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
- Each array child reads `AWS_BATCH_JOB_ARRAY_INDEX`, looks up its task at that line in the S3 task manifest (`tasks/v1/run_id=<id>/manifest.jsonl`), boots the broken image, runs `validator` + `FeatureDeletionEnv._grade()` against `LocalSubprocessSandbox`/`DockerSandbox`, writes one result row to `s3://.../task_grades/<run_id>/<task_id>.json`.
|
| 101 |
+
- This is **exactly the SWE-bench harness `--max_workers` model**: SWE-bench runs 1 container per instance with N parallel workers; DeepSWE hit Docker-daemon limits at 512 containers/iteration and preloaded images onto NVMe (report §8). Batch array jobs give that fan-out with managed retry (`retryStrategy`), Spot interruption handling, and per-child CloudWatch logs — without us building a container scheduler.
|
| 102 |
+
- **Isolation tier (report §8 layered posture):** the `DockerSandbox` already bakes the lockdown recipe (`network_mode=none`, `read_only`, `cap_drop=ALL`, `no-new-privileges`, `pids_limit`, gVisor `runtime='runsc'` when available). On Batch EC2 (not Fargate), we can install gVisor on the AMI and run untrusted model-generated fix attempts under `runsc` by default; Kata+Firecracker for adversarial code requires **self-managed** node groups (the EKS-MNG CPU-Options gotcha from §8 applies to Batch managed EC2 too — use a self-managed launch template if nested virt is needed).
|
| 103 |
+
|
| 104 |
+
### Why NOT SageMaker Processing for the sandboxes
|
| 105 |
+
SM Processing is one job = one fixed instance fleet with `ShardedByS3Key` splitting input across instances. It has **no per-task retry**, no Spot-native interruption recovery at task granularity, and no array-index primitive — a single poisoned task (infinite loop, fork bomb) can wedge a whole Processing instance's shard. Batch array jobs isolate each task to its own container with its own timeout (`DockerSandbox.exec_timeout_s` + Batch `timeout`), retry, and Spot replacement. For "10,000 untrusted code executions, each independent, some will hang," Batch is the textbook fit and SM Processing is not.
|
| 106 |
+
|
| 107 |
+
### Why NOT plain Fargate
|
| 108 |
+
Fargate caps at 4 vCPU / 30 GB without `--privileged`, cannot run gVisor/Kata, and is pricier than Spot for bursty bulk. Batch-on-EC2-Spot is 60-70% cheaper for this fault-tolerant fan-out (report §10 Spot savings).
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Stage (d) — Normalize → SFT corpus + DPO pairs · EMR Serverless (Spark)
|
| 113 |
+
|
| 114 |
+
After (b) fan-in produces `list[TeacherCallResult]` and `extract_dpo_pairs` produces `DPOPair` rows, `replaysim/normalize.py::DJNormalizer` runs the **data-juicer op-graph** (`recipes/replaysim/default.yaml`: length filter ×2, words-num ×2, special-char ×2, document_deduplicator). ADR-004 chose data-juicer precisely because the op set is **CPU-only** (no NeMo-Curator GPU dep).
|
| 115 |
+
|
| 116 |
+
**Service: EMR Serverless (Spark) application, Graviton.**
|
| 117 |
+
- The normalize is Spark-shaped dataframe work: read all DPOPair rows for a run, run the data-juicer op-graph, dedup across the corpus, write the partitioned corpus.
|
| 118 |
+
- data-juicer's `DefaultExecutor` runs in-process per Spark partition (the repo's `DJNormalizer.normalize()` already does file-in/file-out via `init_configs` + `DefaultExecutor`). On EMR Serverless we `mapPartitions` → run `DJNormalizer(skip_dj=False).normalize(partition_rows)` per partition, then a Spark `dropDuplicates` for cross-partition full-corpus dedup (the repo notes `document_deduplicator` is per-batch only — Spark closes that gap natively).
|
| 119 |
+
- EMR Serverless auto-scales workers, bills per vCPU/GB-second on Graviton, and is the AWS-native Spark home for "dataframe normalize + dedup at scale." The repo's `replaysim` already *is* the op-graph — EMR Serverless just runs it distributed.
|
| 120 |
+
|
| 121 |
+
Why EMR Serverless over Glue ETL here: the normalize is the heavier of the two Spark stages, data-juicer wants control over the Python runtime (custom `--py-files`/wheel), and EMR Serverless gives finer Spark tuning + Graviton price/perf than Glue's DPU model. Glue ETL stays the choice for the *light* ingest (a). Both are valid Spark; we split by weight.
|
| 122 |
+
|
| 123 |
+
**Two output corpora** (the dataset contract below): the SFT corpus (clean winning trajectories — report §5 "SFT-first competence floor") and the DPO pairs (from `extract_dpo_pairs` + future execution-oracle near-miss rejects).
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## The S3 dataset contract
|
| 128 |
+
|
| 129 |
+
One bucket per environment (reuse the existing `amazon-sagemaker-386931836011-us-west-2-*` or a dedicated `composer-datagen-386931836011-us-west-2`). Layout is **Hive-partitioned by `run_id`** so each outer-loop generation is an immutable, addressable slice (the report's "generation" in the GA framing, §3/§5) and Athena/Glue Catalog can query across generations.
|
| 130 |
+
|
| 131 |
+
```
|
| 132 |
+
s3://composer-datagen-386931836011-us-west-2/
|
| 133 |
+
raw/claude_code/**/*.jsonl # (a) input: uploaded sessions
|
| 134 |
+
traces/v1/run_id=<id>/part-*.parquet # (a) out: TraceState rows
|
| 135 |
+
tasks/v1/run_id=<id>/manifest.jsonl # (c1) FeatureDeletionTask rows (1/line; array index = line)
|
| 136 |
+
replay/v1/run_id=<id>/
|
| 137 |
+
input/states.jsonl # (b) shared Bedrock batch input (recordId=state_id)
|
| 138 |
+
teacher=<slug>/*.jsonl.out # (b) per-teacher Bedrock batch output
|
| 139 |
+
task_grades/v1/run_id=<id>/<task_id>.json # (c2) validator + _grade() results
|
| 140 |
+
corpus/v1/run_id=<id>/
|
| 141 |
+
sft/part-*.parquet # (d) SFT corpus (clean winners)
|
| 142 |
+
dpo/part-*.parquet # (d) DPO pairs (normalized DPOPair)
|
| 143 |
+
manifests/run_id=<id>.json # run-level manifest: counts, cost, lineage, schema_version
|
| 144 |
+
diloco/rendezvous/round_<NNNNNN>/rank_<RRRR>.pt # (separate) inner-loop ObjectStoreAllReduce (already exists)
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
**Format decisions:**
|
| 148 |
+
- **Parquet** for `traces/`, `corpus/sft/`, `corpus/dpo/` — columnar, compressed, Athena-queryable, the SFT/DPO training read path (TRL `load_dataset` reads Parquet directly). This is the durable corpus.
|
| 149 |
+
- **JSONL** for `replay/input/`, `tasks/manifest`, `task_grades/` — because that's what Bedrock batch *requires* (`s3InputFormat=JSONL`), what AWS Batch array-index line-lookup wants, and what `DPOPair`/`TraceState` natively serialize to. JSONL is the wire format between stages; Parquet is the corpus-at-rest format.
|
| 150 |
+
- **`recordId == state_id`** is the universal join key linking trace → replay → DPO pair. Already true: `TraceState.state_id` is `f"{path.stem}::{idx:04d}"` and `DPOPair.state_id` carries it through.
|
| 151 |
+
- **`manifests/run_id.json`** is the run-level contract: `{run_id, schema_version, n_traces, n_tasks, n_dpo_pairs, teacher_pool, bedrock_job_arns, total_cost_usd, parent_run_id}`. `parent_run_id` threads the flywheel lineage (report §5: improved student regenerates next round's traces). The held-out-eval guard (`safety/HeldoutSplit`) reads `schema_version` + a `split` column to enforce the disjoint held-out set the report flags as the load-bearing gap (§7 Pushback 4).
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
## Orchestration: Step Functions (the non-cluster Argo analog)
|
| 156 |
+
|
| 157 |
+
Report §8 uses Argo Workflows on EKS as the outer-loop controller. For the **AWS-native, no-persistent-cluster** path this facet targets, the equivalent is **AWS Step Functions** (Standard workflow) driving the DAG:
|
| 158 |
+
|
| 159 |
+
```
|
| 160 |
+
Ingest(Glue) → InvertSchema(Lambda) → [Bedrock batch ×N (Map)] → FanIn(EMR-Serverless)
|
| 161 |
+
→ ExtractDPO+SynthTasks → SandboxValidate(Batch array, .sync) → Normalize(EMR-Serverless)
|
| 162 |
+
→ WriteManifest(Lambda)
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
Step Functions has native `.sync` integrations for **Glue**, **EMR Serverless**, **Batch** (`arn:aws:states:::batch:submitJob.sync` — blocks until the whole array completes), and **SageMaker**, plus a `Map` state for the N-teacher Bedrock fan-out. This makes the whole outer loop one declarative state machine, retryable, with per-stage IAM. If/when the program moves to the EKS-primary path of §8, the same stage boundaries lift to Argo DAG nodes — the S3 contract is identical, so it's a controller swap, not a rewrite.
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## Repo delta (what to build in `composer_replication/`)
|
| 170 |
+
|
| 171 |
+
1. **`composer_replication/teacher_replay_bedrock.py`** (~180 LOC) — `BedrockBatchTeacherPool`: `submit_replay_batch()` writes the shared states JSONL, submits one `create_model_invocation_job` per teacher, polls, and parses `.jsonl.out` back into `list[TeacherCallResult]`. Add `provider`/`model_id` to `TeacherSpec`. `extract_dpo_pairs` untouched.
|
| 172 |
+
2. **`composer_replication/datagen/aws/batch_validate.py`** (~120 LOC) — the Batch array-child entrypoint: read `AWS_BATCH_JOB_ARRAY_INDEX` → task manifest line → boot `DockerSandbox`/`LocalSubprocessSandbox` → run `validator` + `_grade()` → write `task_grades/.../{task_id}.json`. Plus a `submit_validate_array()` helper.
|
| 173 |
+
3. **`composer_replication/datagen/aws/glue_ingest_job.py`** (~80 LOC) — Glue Spark entrypoint wrapping `ClaudeCodeIngester.ingest` in `mapPartitions`; writes `traces/` Parquet + Glue Catalog table.
|
| 174 |
+
4. **`composer_replication/replaysim/emr_normalize_job.py`** (~100 LOC) — EMR Serverless Spark entrypoint wrapping `DJNormalizer` per partition + Spark cross-partition dedup; writes `corpus/dpo/` + `corpus/sft/` Parquet.
|
| 175 |
+
5. **`composer_replication/datagen/aws/s3_contract.py`** (~120 LOC) — the S3 layout constants, `RunManifest` dataclass, Parquet/JSONL serializers for `TraceState`/`FeatureDeletionTask`/`DPOPair`, the `recordId==state_id` join helpers, and `schema_version`/`split` column injection for the held-out guard.
|
| 176 |
+
6. **`infra/datagen_stepfunctions.json`** (+ a thin CDK/`infra/datagen_stack.py`) — the Step Functions state machine + IAM roles (Bedrock batch service role, Batch Spot compute env, EMR Serverless app, Glue role). ~250 LOC IaC.
|
| 177 |
+
7. **`pyproject.toml`** — extend the `[aws]`/`[datagen]` extras with `boto3` for Bedrock/Batch/Glue/EMR clients (the `[serverless]` extra already needs `s3fs`/`boto3` per report §9).
|
| 178 |
+
8. **Dockerfile** — the `DockerSandbox` ECR image (also the Batch job-definition image), baking gVisor `runsc` on the AMI/launch-template for default untrusted-code isolation.
|
| 179 |
+
|
| 180 |
+
The trainer, loss, `FeatureDeletionEnv`, curriculum, monitor, `extract_dpo_pairs`, `DJNormalizer` op-graph, and the DiLoCo/`ObjectStoreAllReduce` inner loop are all **untouched** — every AWS-native piece wraps an existing repo entrypoint and reads/writes the S3 contract.
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## Open questions
|
| 185 |
+
|
| 186 |
+
- **Bedrock batch 24h SLA vs flywheel cadence.** Report §5 says outer-loop cadence is hours-to-days, so 24h batch turnaround is acceptable for bulk generations — but a fast bootstrap iteration may want the OpenRouter low-latency path. Need a `batch_vs_realtime` policy knob keyed on `max_total_usd` + urgency.
|
| 187 |
+
- **Per-branch sandbox cold-start is the §8/§10 explicit falsifier.** If Batch array job startup + image pull dominates wall-clock at target fan-out even with gVisor, the report's demotion path applies: keep Batch for control/grading but move bulk sandbox execution to a container-free pool (SWE-MiniSandbox class) or a warm Spot fleet. Must instrument cold-start in `batch_validate.py`.
|
| 188 |
+
- **Cross-region inference profiles for Bedrock batch** (`us.anthropic...`) route to multiple regions for throughput — confirm the batch service role + S3 bucket policy allow the destination regions, else throttling.
|
| 189 |
+
- **DeepSeek on Bedrock batch in us-west-2 — RESOLVED LIVE:** `deepseek.v3.2` is batch-eligible single-region in `us-west-2`; `deepseek.r1-v1:0` is in the on-demand catalog. Claude batch IDs are the `us.` cross-region inference profiles (`us.anthropic.claude-sonnet-4-6`, `us.anthropic.claude-opus-4-6`). Still confirm **minimum-records floors per model** + the **records/job & file-size quotas** in Service Quotas at build time (adjustable, but the `BedrockBatchTeacherPool` must shard input JSONL to fit).
|
| 190 |
+
- **Opus 4.7/4.8 are on-demand-only (no batch ID).** If the heterogeneity ablation (report §3) wants the newest Claude as a teacher, it costs on-demand pricing — budget for it or substitute the batchable Opus 4.6. GPT-5 (in `DEFAULT_TEACHERS`) is not on Bedrock at all → that one family member stays on the OpenRouter escape hatch or is replaced by Bedrock Llama 4 / DeepSeek as the third family.
|
| 191 |
+
- **`golden_diff` must be ACL-isolated.** It is `repr=False` in `FeatureDeletionTask` and held out of the policy observation; on S3 it must live in a *separate* deny-by-default prefix (`tasks/golden/`), never co-located with the policy-visible `tasks/v1/...` — oracle-cleanliness Gate 1 (report §4).
|
|
@@ -0,0 +1,299 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# F3 — The RL System on AWS, Runnable NOW (SageMaker us-west-2)
|
| 2 |
+
|
| 3 |
+
**Status:** design, runnable-today. **Account:** 386931836011, **region:** us-west-2, **role:** Admin (Isengard).
|
| 4 |
+
**Live facts verified 2026-06-09** (see "Live AWS findings" below). Grounds in the deep-research report
|
| 5 |
+
(`research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` §9 SageMaker path / hybrid, §10 phased plan)
|
| 6 |
+
and the actual repo code (`SageMakerExecutor`, `ComposerReplicationTrainer`, `ObjectStoreAllReduce`,
|
| 7 |
+
`replica_entrypoint`, `examples/gsm8k_grpo`).
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## 0. TL;DR (committed)
|
| 12 |
+
|
| 13 |
+
1. **For the first runnable GPU smoke RIGHT NOW: a single SageMaker Training Job, `ml.g5.2xlarge`, BYO-container
|
| 14 |
+
extended from the AWS PyTorch DLC.** This is NOT the `SageMakerExecutor` N-replica path — it is plain GRPO on one
|
| 15 |
+
GPU. The executor's multi-replica DiLoCo rendezvous is the *next* step, not the smoke. Reason: the smoke's job is
|
| 16 |
+
to prove the trainer + reward + vLLM rollout works on a real GPU at minimum cost and zero quota friction.
|
| 17 |
+
2. **GRPO rollout = vLLM colocated in the training container** (`use_vllm=True, vllm_mode="colocate"`). TRL 1.5's
|
| 18 |
+
default is colocate; it runs vLLM in the *same process* sharing the training GPU at `vllm_gpu_memory_utilization=0.3`.
|
| 19 |
+
No separate inference endpoint for the smoke. The `server` mode (`trl vllm-serve`) and VeRL's `AsyncServer` are the
|
| 20 |
+
scale answer for tool-heavy agentic rollouts later (report §8) — not for a 0.5B GSM8K smoke.
|
| 21 |
+
3. **Platform decision:** Training Jobs for the bursty smoke and periodic small-model runs (this facet); **HyperPod
|
| 22 |
+
(attached to EKS)** for the long, resilience-bound inner GRPO loop (report §9). Both share the identical S3
|
| 23 |
+
`ObjectStoreAllReduce` rendezvous, so a run moves between them with zero trainer/loss/DiLoCo change.
|
| 24 |
+
4. **The `SageMakerExecutor` (already built, mock-tested) drives N independent single-instance Training Jobs**, each
|
| 25 |
+
tagged `REPLICA_RANK=i`/`WORLD_SIZE=N` via the `Environment` map, all pointed at one `s3://.../rendezvous/` prefix.
|
| 26 |
+
It is the bursty-fallback DiLoCo backend. To make it run live we need a built+pushed container, real
|
| 27 |
+
`role_arn`/`image_uri`/`output_s3_path`, and a non-zero quota for N concurrent training jobs.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 1. Live AWS findings (verified 2026-06-09, this account/region)
|
| 32 |
+
|
| 33 |
+
| Fact | Value | Consequence |
|
| 34 |
+
|---|---|---|
|
| 35 |
+
| Caller identity | `arn:aws:sts::386931836011:assumed-role/Admin/baladita-Isengard` | Admin — can create roles, push ECR, run training jobs. |
|
| 36 |
+
| SageMaker default bucket (us-west-2) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` | Use as the rendezvous + output bucket — already covered by `AmazonSageMakerFullAccess`. |
|
| 37 |
+
| Existing exec roles | `AmazonSageMaker-ExecutionRole-20250725T133247` (and ...20241223T...) | `role_arn = arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247`. |
|
| 38 |
+
| Exec role policies | `AmazonSageMakerFullAccess` + custom `AmazonSageMaker-ExecutionPolicy-...` | FullAccess grants S3 to buckets named `*SageMaker*`/`*sagemaker*` — so the rendezvous bucket MUST be the SageMaker bucket above (or a `sagemaker-*` bucket) or you must attach an explicit S3 policy. **This is the IAM gotcha the executor docstring flags.** |
|
| 39 |
+
| `ml.g5.2xlarge for training job usage` | **1.0** (non-zero!) | Single-replica g5 smoke runs IMMEDIATELY, no quota request. |
|
| 40 |
+
| `ml.g5.2xlarge for spot training job usage` | **1.0** | Spot smoke also available (70% cheaper). |
|
| 41 |
+
| `ml.g5.12xlarge for training job usage` | **1.0** | One 4×A10G box available for a 7B run later. |
|
| 42 |
+
| `ml.g6.2xlarge for training job usage` | **0.0** | g6 (L4) needs a Service Quotas increase first — prefer g5 for the smoke. |
|
| 43 |
+
| g5.2xlarge EC2 offering | us-west-2a/b/c | Capacity exists across AZs. |
|
| 44 |
+
| Already present | `sagemaker-dynamo-on-eks-hyperpod-*` bucket | Confirms HyperPod-on-EKS has been used here — the report's §9 hybrid is live-reachable. |
|
| 45 |
+
| boto3 / sagemaker SDK locally | NOT installed | `pip install -e .[aws]` + `pip install sagemaker` on the launch host (laptop/Studio), not in the repo's hard deps. |
|
| 46 |
+
|
| 47 |
+
**The single most important runnable-now fact:** g5.2xlarge training-job quota is already 1 — the smoke needs no
|
| 48 |
+
quota ticket. (Default for these GPU types is 0; this account has been bumped to 1.)
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## 2. Training Jobs vs HyperPod vs EKS — when each (report §9, §10)
|
| 53 |
+
|
| 54 |
+
- **SageMaker Training Jobs (THIS facet, bursty inner loop / smoke).** Ephemeral, pay-per-second, `boto3.create_training_job`,
|
| 55 |
+
zero persistent cluster. Right for: the first GPU smoke, periodic/smaller-model runs, the `SageMakerExecutor`
|
| 56 |
+
DiLoCo fallback. re:Post guidance: Training Jobs fit *periodic / smaller-model / pay-per-use*. The 28-day max runtime
|
| 57 |
+
and per-job cold-start (instance provisioning ~3-6 min) are acceptable for bursty work. Warm pools
|
| 58 |
+
(`KeepAlivePeriodInSeconds`) cut cold-start on repeated launches — but note this account's `g5 training warm pool
|
| 59 |
+
usage` quota is 0, so warm pools need a quota bump.
|
| 60 |
+
- **SageMaker HyperPod attached to EKS (long resilience-bound inner loop).** Report §9: HyperPod maps 1-to-1 to an EKS
|
| 61 |
+
control plane (one EKS cluster = one HyperPod node-group in a VPC), with auto-detect-and-replace of faulty
|
| 62 |
+
accelerators and PyTorch job auto-resume. Right for: continuous/large-model/persistent multi-day RL where a node
|
| 63 |
+
failure on a Training Job would lose the run. The `sagemaker-dynamo-on-eks-hyperpod-*` bucket shows this is already
|
| 64 |
+
exercised here. **"Use HyperPod for the inner loop" does NOT mean leaving EKS** — it is a node-group swap on the same
|
| 65 |
+
control plane. Build target: the future `EKSExecutor` targets both Karpenter GPU nodes and HyperPod nodes transparently.
|
| 66 |
+
- **Plain EKS (primary for everything else — report §8).** Outer MCTS/sandbox/dataset loop, vLLM RayService rollout
|
| 67 |
+
groups, gVisor/Kata sandbox pods, Argo controller. The inner GRPO trainer is the one piece that swaps between a
|
| 68 |
+
Karpenter p5/g6e NodePool and a HyperPod node-group.
|
| 69 |
+
|
| 70 |
+
**Decision for F3:** SageMaker Training Jobs now (smoke + `SageMakerExecutor` DiLoCo fallback); HyperPod-on-EKS later
|
| 71 |
+
for the long inner run. Same S3 rendezvous throughout.
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## 3. The first runnable smoke: Qwen2.5-0.5B GRPO on GSM8K, single g5.2xlarge Training Job
|
| 76 |
+
|
| 77 |
+
### 3.1 Shape
|
| 78 |
+
One Training Job, `InstanceCount=1`, `ml.g5.2xlarge` (1× A10G, 24 GB). GRPO with vLLM **colocated** in the
|
| 79 |
+
training container. This is the `examples/gsm8k_grpo/run.py` recipe lifted from CPU to one real GPU, with vLLM turned on.
|
| 80 |
+
It does **not** exercise the DiLoCo rendezvous (that's §4). It proves: container builds, trainer runs on GPU, vLLM
|
| 81 |
+
rollout works, reward fires, checkpoint lands in S3.
|
| 82 |
+
|
| 83 |
+
### 3.2 Container — BYO extended from the AWS PyTorch DLC (do NOT use the stock HF DLC)
|
| 84 |
+
- **Base:** `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker`
|
| 85 |
+
(verified present in the us-west-2 DLC registry; 763104351884 is the AWS DLC account). The DLC already has the SageMaker
|
| 86 |
+
training toolkit, CUDA, and a working torch — so vLLM's CUDA wheels match.
|
| 87 |
+
- **Why not the stock HF DLC (`huggingface-pytorch-training:4.49.0`)?** It pins transformers 4.49 and does NOT bundle
|
| 88 |
+
`trl` or `vllm`; you'd be pip-installing the whole RL stack anyway. Extending the PyTorch DLC gives a clean,
|
| 89 |
+
version-controlled layer.
|
| 90 |
+
- **Why a prebuilt ECR image and not `source_dir`+`requirements.txt`?** Installing `vllm` + `trl` + `flash-attn` at job
|
| 91 |
+
start over `requirements.txt` adds 5-10 min of cold-start per job and is a flaky failure surface (wheel/CUDA mismatch).
|
| 92 |
+
Bake them into the image once, push to the account's private ECR. `source_dir` is fine for *just the training script*
|
| 93 |
+
layered on top, but the heavy deps must be baked.
|
| 94 |
+
|
| 95 |
+
`docker/Dockerfile.sagemaker`:
|
| 96 |
+
```dockerfile
|
| 97 |
+
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker
|
| 98 |
+
# RL stack (baked, not pip-at-startup)
|
| 99 |
+
RUN pip install --no-cache-dir \
|
| 100 |
+
"trl>=0.12" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
|
| 101 |
+
"vllm" "fsspec>=2024.6" "s3fs>=2024.6"
|
| 102 |
+
# The framework itself
|
| 103 |
+
COPY . /opt/composer_replication
|
| 104 |
+
RUN pip install --no-cache-dir -e "/opt/composer_replication[train,serverless]"
|
| 105 |
+
# SageMaker invokes the image; for the smoke we use a plain GRPO entry script,
|
| 106 |
+
# for the DiLoCo path the executor passes ContainerEntrypoint explicitly.
|
| 107 |
+
ENV HF_HOME=/opt/ml/input/hf_cache
|
| 108 |
+
```
|
| 109 |
+
Build + push (Admin, one-time):
|
| 110 |
+
```bash
|
| 111 |
+
aws ecr create-repository --repository-name composer-rl --region us-west-2
|
| 112 |
+
aws ecr get-login-password --region us-west-2 | docker login --username AWS \
|
| 113 |
+
--password-stdin 386931836011.dkr.ecr.us-west-2.amazonaws.com
|
| 114 |
+
docker build -f docker/Dockerfile.sagemaker -t 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke .
|
| 115 |
+
docker push 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### 3.3 The smoke training script — `examples/gsm8k_grpo/run_sagemaker.py`
|
| 119 |
+
A thin GPU variant of `examples/gsm8k_grpo/run.py`. Same `gsm8k_reward` (RLVR `#### NUMBER` regex), same
|
| 120 |
+
`ComposerReplicationTrainer(alpha_sdpo=0, beta_replay=0)` (plain GRPO — channels 2/3 off). Differences from the CPU example:
|
| 121 |
+
```python
|
| 122 |
+
from trl import GRPOConfig
|
| 123 |
+
config = GRPOConfig(
|
| 124 |
+
output_dir="/opt/ml/model", # SageMaker uploads this to OutputDataConfig.S3OutputPath
|
| 125 |
+
per_device_train_batch_size=8,
|
| 126 |
+
num_generations=8,
|
| 127 |
+
max_prompt_length=512,
|
| 128 |
+
max_completion_length=256,
|
| 129 |
+
learning_rate=1e-5,
|
| 130 |
+
max_steps=20, # smoke — minutes, not hours
|
| 131 |
+
logging_steps=1,
|
| 132 |
+
save_strategy="no",
|
| 133 |
+
bf16=True, # A10G supports bf16
|
| 134 |
+
# --- the rollout path: vLLM colocated in-process on the same GPU ---
|
| 135 |
+
use_vllm=True,
|
| 136 |
+
vllm_mode="colocate", # TRL 1.5 default; same process, no server
|
| 137 |
+
vllm_gpu_memory_utilization=0.3, # leave 70% for the 0.5B policy + grads + KV
|
| 138 |
+
vllm_tensor_parallel_size=1,
|
| 139 |
+
beta=0.04, # small KL-to-ref; or 0.0 for pure smoke
|
| 140 |
+
report_to=[],
|
| 141 |
+
)
|
| 142 |
+
```
|
| 143 |
+
Read hyperparameters from `/opt/ml/input/config/hyperparameters.json` (SageMaker writes the estimator's
|
| 144 |
+
`hyperparameters=` there) so the same script is config-driven.
|
| 145 |
+
|
| 146 |
+
### 3.4 The launch — SageMaker Python SDK `Estimator` (run from the laptop / Studio)
|
| 147 |
+
```python
|
| 148 |
+
import sagemaker
|
| 149 |
+
from sagemaker.estimator import Estimator
|
| 150 |
+
|
| 151 |
+
sess = sagemaker.Session() # picks up region us-west-2
|
| 152 |
+
ROLE = "arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
|
| 153 |
+
IMAGE = "386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke"
|
| 154 |
+
BUCKET = "amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d"
|
| 155 |
+
|
| 156 |
+
est = Estimator(
|
| 157 |
+
image_uri=IMAGE,
|
| 158 |
+
role=ROLE,
|
| 159 |
+
instance_type="ml.g5.2xlarge", # quota = 1 (verified live) — no ticket needed
|
| 160 |
+
instance_count=1,
|
| 161 |
+
volume_size=100,
|
| 162 |
+
max_run=3600, # 1h cap for the smoke
|
| 163 |
+
output_path=f"s3://{BUCKET}/composer-rl/smoke/output",
|
| 164 |
+
base_job_name="composer-grpo-smoke",
|
| 165 |
+
environment={"HF_HUB_ENABLE_HF_TRANSFER": "1"},
|
| 166 |
+
hyperparameters={"model": "Qwen/Qwen2.5-0.5B-Instruct", "max_steps": 20},
|
| 167 |
+
entry_point="run_sagemaker.py",
|
| 168 |
+
source_dir="examples/gsm8k_grpo", # script layered on the baked image
|
| 169 |
+
keep_alive_period_in_seconds=0, # warm-pool quota is 0 in this acct; leave off
|
| 170 |
+
# use_spot_instances=True, max_wait=7200 # optional: spot quota is also 1
|
| 171 |
+
)
|
| 172 |
+
est.fit(wait=True, logs=True) # GSM8K loads from HF inside the container
|
| 173 |
+
```
|
| 174 |
+
**Cost:** `ml.g5.2xlarge` is ~$1.52/hr on-demand in us-west-2; a 20-step 0.5B smoke is ~15-25 min ⇒ **well under $1**.
|
| 175 |
+
On spot (quota=1) ~$0.45-0.60/hr ⇒ pennies. The CPU example proves the loop in ~60s; this proves it on a real GPU with
|
| 176 |
+
the real vLLM rollout path, which the CPU example explicitly does not exercise.
|
| 177 |
+
|
| 178 |
+
### 3.5 Gotchas baked into the recipe
|
| 179 |
+
- **vLLM needs HF model download at job start.** Either set `HF_HUB_ENABLE_HF_TRANSFER=1` (done) or stage the model
|
| 180 |
+
into S3 and pass it as an input channel; for a 0.5B model the live download is fine. `EnableNetworkIsolation` MUST stay
|
| 181 |
+
False (the executor pins this) so the container can reach `huggingface.co` and S3.
|
| 182 |
+
- **`vllm_gpu_memory_utilization=0.3` is the load-bearing knob on a 24 GB A10G.** Too high ⇒ OOM when the policy +
|
| 183 |
+
grads + optimizer also need the GPU; too low ⇒ tiny KV cache, slow rollout. 0.3 is the TRL/Ray reference default for
|
| 184 |
+
a small model on one GPU.
|
| 185 |
+
- **GSM8K = `openai/gsm8k` config `main`.** Already what the example loads. No license blocker (MIT).
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## 4. The DiLoCo N-replica path: how `SageMakerExecutor` drives the rendezvous
|
| 190 |
+
|
| 191 |
+
This is the executor that already exists and is mock-tested (`tests/test_sagemaker_executor.py` — 20+ tests covering
|
| 192 |
+
rank-ordered handles, env injection, status mapping, cancel idempotency, partial-launch rollback). It is the bursty
|
| 193 |
+
DiLoCo backend, distinct from the §3 smoke.
|
| 194 |
+
|
| 195 |
+
### 4.1 What it does (verified from source)
|
| 196 |
+
- `launch_replicas(N, ...)` submits **N independent single-instance Training Jobs** (NOT one multi-instance job — that
|
| 197 |
+
would couple replicas through SageMaker's intra-job NCCL fabric and break the "each replica syncs only through S3"
|
| 198 |
+
design). Each job gets `Environment={"REPLICA_RANK": str(i), "WORLD_SIZE": str(N), "RENDEZVOUS_URI": s3uri}` and
|
| 199 |
+
`ContainerEntrypoint=["python","-m","composer_replication.diloco.serverless.replica_entrypoint"]` with
|
| 200 |
+
`ContainerArguments=["--rendezvous", s3uri, "--world-size", N, "--trainer-module", ..., "--trainer-fn", ...]`.
|
| 201 |
+
- `replica_entrypoint.main` reads `REPLICA_RANK`, builds `ObjectStoreAllReduce(uri=s3://..., rank, world_size)`, wraps
|
| 202 |
+
it in `MockManager`, and calls the user's `train(manager=, rank=, world_size=, **trainer_kwargs)`. The trainer wires
|
| 203 |
+
`manager` into `make_diloco_outer_loop`; pseudo-gradients sync via `round_{NNNNNN}/rank_{RRRR}.pt` PUT-then-poll-then-mean
|
| 204 |
+
on S3. **DiLoCo math, loss, trainer untouched.**
|
| 205 |
+
- `poll`/`collect` map `describe_training_job.TrainingJobStatus`; `stream_logs` reads
|
| 206 |
+
`/aws/sagemaker/TrainingJobs/<job>/algo-*`; `cancel` calls `stop_training_job` idempotently.
|
| 207 |
+
|
| 208 |
+
### 4.2 The asymmetry that makes this clean (report §8)
|
| 209 |
+
Gang scheduling is needed for *intra-replica* FSDP NCCL but NOT for *inter-replica* DiLoCo sync — replicas rendezvous
|
| 210 |
+
through S3, so a straggler simply blocks at the poll loop (bounded by `timeout_s=1800`) instead of deadlocking. On
|
| 211 |
+
SageMaker, N separate jobs have no mutual network path (`supports_inter_replica_network=False`), which is exactly right.
|
| 212 |
+
|
| 213 |
+
### 4.3 What to wire to run it live (the deltas)
|
| 214 |
+
1. **Same baked image** from §3.2 (it already `pip install -e .[serverless]`, so `replica_entrypoint`, `s3fs`, `fsspec`
|
| 215 |
+
are present). The executor passes `ContainerEntrypoint` explicitly, so a generic image works.
|
| 216 |
+
2. **Rendezvous bucket = the SageMaker default bucket** (`amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d`) so the
|
| 217 |
+
exec role's `AmazonSageMakerFullAccess` already grants the live S3 PUT/GET the allreduce poll loop needs. Use
|
| 218 |
+
`rendezvous_uri = "s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/composer-rl/runs/<run_id>/rendezvous/"`.
|
| 219 |
+
3. **Quota:** N concurrent jobs need `ml.g5.2xlarge for training job usage >= N`. Currently 1 ⇒ N=1 works today; for
|
| 220 |
+
N=2-4 DiLoCo, request a Service Quotas increase (Service Quotas console → SageMaker → "ml.g5.2xlarge for training job
|
| 221 |
+
usage"). The smoke proves the executor end-to-end at N=1 (one job, one rank — degenerate allreduce returns its own
|
| 222 |
+
tensor), then N=2 once quota lands.
|
| 223 |
+
4. **Driver script** `examples/diloco_sagemaker/run.py` (~80 LOC): construct `SageMakerExecutor(role_arn=..., image_uri=...,
|
| 224 |
+
output_s3_path=..., region="us-west-2")`, call `launch_replicas(N, entrypoint="...replica_entrypoint",
|
| 225 |
+
entrypoint_args={"rendezvous_uri": s3uri, "trainer_module": "examples.gsm8k_grpo.diloco_train", "trainer_fn": "train",
|
| 226 |
+
"trainer_kwargs": {...}}, gpu="A10G", timeout=3600)`, then `collect(handles)`. `gpu="A10G"` maps to `ml.g5.2xlarge`
|
| 227 |
+
via the executor's `_GPU_INSTANCE_MAP`.
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
## 5. The GRPO rollout problem — colocated vLLM now, server/AsyncServer later
|
| 232 |
+
|
| 233 |
+
TRL's `GRPOTrainer` needs a generation path each step. Three options, committed mapping:
|
| 234 |
+
|
| 235 |
+
| Option | When | On SageMaker |
|
| 236 |
+
|---|---|---|
|
| 237 |
+
| `model.generate()` (no vLLM) | never for real runs — too slow | the CPU example uses this implicitly; fine only for the 0.5B CPU toy. |
|
| 238 |
+
| **vLLM colocate** (`use_vllm=True, vllm_mode="colocate"`) | **the smoke + most single-GPU runs** | vLLM in the same process, shares the training GPU at `vllm_gpu_memory_utilization=0.3`. One container, one job, no endpoint. TRL 1.5 default. **This is the F3 answer.** |
|
| 239 |
+
| vLLM server (`trl vllm-serve`) | multi-GPU where you dedicate GPUs to generation | a *second* SageMaker job or a SageMaker endpoint runs `trl vllm-serve`; the trainer job points `vllm_server_host/port` at it. Introduces inter-process comm + a network hop — only worth it when generation dominates and you have spare GPUs. |
|
| 240 |
+
| VeRL `AsyncServer` | tool-heavy agentic tree-of-work rollouts (report §8) | the scale answer for the SWE-agent tree: async GPU-decoupled agent loop TRL lacks. A later facet; the engine should be a configurable backend, not hardcoded. |
|
| 241 |
+
|
| 242 |
+
For the F3 smoke and the DiLoCo fallback, **colocate is correct and simplest**: it keeps everything in one
|
| 243 |
+
self-contained training container, which is exactly what a single-instance SageMaker job wants. No separate inference
|
| 244 |
+
endpoint to provision, secure, or pay for.
|
| 245 |
+
|
| 246 |
+
**One subtlety the report flags (§8):** the SDPO channel (Channel 2) needs full-vocabulary *logits* (TRL-hosted, which
|
| 247 |
+
the trainer's `_compute_sdpo_loss` does via `model(...).logits`), while Channel 3 needs only log-probs. Colocated vLLM
|
| 248 |
+
handles *rollout generation*; the SDPO/replay logits/log-probs come from the policy forward pass in `_compute_loss`,
|
| 249 |
+
not from vLLM. So turning on `alpha_sdpo>0` later does not change the rollout backend choice.
|
| 250 |
+
|
| 251 |
+
---
|
| 252 |
+
|
| 253 |
+
## 6. Concrete repo deltas (to make this runnable, not hand-wavy)
|
| 254 |
+
|
| 255 |
+
| Path (~LOC) | What | Why |
|
| 256 |
+
|---|---|---|
|
| 257 |
+
| `docker/Dockerfile.sagemaker` (~15) | Extend PyTorch DLC 2.6.0-gpu-py312; bake trl+vllm+peft+accelerate+datasets+s3fs+fsspec + `pip install -e .[train,serverless]`. | The report (§9) names "a Dockerfile wrapping composer_replication" as a missing build artifact. This is it. |
|
| 258 |
+
| `examples/gsm8k_grpo/run_sagemaker.py` (~120) | GPU+vLLM variant of `run.py`; reads `/opt/ml/input/config/hyperparameters.json`; writes to `/opt/ml/model`; `use_vllm=True, vllm_mode="colocate"`. | The runnable smoke entry. |
|
| 259 |
+
| `examples/diloco_sagemaker/run.py` (~80) | Driver that builds `SageMakerExecutor` with the live role/image/bucket and calls `launch_replicas`/`collect` for N replicas over the S3 rendezvous. | Turns the mock-tested executor into a live driver — no executor code change needed. |
|
| 260 |
+
| `examples/gsm8k_grpo/diloco_train.py` (~60) | A `train(manager, rank, world_size, **kw)` that wraps the GRPO trainer in `make_diloco_outer_loop(manager=...)`. | The `trainer_module:trainer_fn` the executor imports inside each replica. |
|
| 261 |
+
| `scripts/build_and_push_ecr.sh` (~20) | ECR create-repo + login + build + push (the §3.2 commands). | One-command image publish. |
|
| 262 |
+
| `docs/AWS_SAGEMAKER_QUICKSTART.md` (~120) | The §1 live facts + §3 estimator recipe + the quota/IAM gotchas + the spot variant. | So the next person runs it in one read. |
|
| 263 |
+
| `pyproject.toml` `aws` extra (+1 line) | add `sagemaker>=2.200` alongside `boto3` (the SDK `Estimator` lives there; executor uses raw boto3 but the launch driver wants the SDK). | The launch host needs the SDK; currently only `boto3` is in the extra. |
|
| 264 |
+
|
| 265 |
+
**Nothing in `SageMakerExecutor`, `ComposerReplicationTrainer`, `ObjectStoreAllReduce`, `replica_entrypoint`, or the
|
| 266 |
+
loss changes.** The executor's design (N single-instance jobs, env-injected rank, S3-only rendezvous,
|
| 267 |
+
`EnableNetworkIsolation=False`) is already correct for this environment — verified against the live IAM/quota facts.
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## 7. Open questions / next gates
|
| 272 |
+
|
| 273 |
+
- **N>1 DiLoCo quota:** `ml.g5.2xlarge for training job usage` = 1 today. N=2-4 needs a Service Quotas increase
|
| 274 |
+
(typically minutes-to-hours for g5; not guaranteed instant). Request before the N>1 run.
|
| 275 |
+
- **Warm pools:** `g5 training warm pool usage` quota = 0 ⇒ each job pays ~3-6 min cold-start. For the bursty DiLoCo
|
| 276 |
+
fallback at small H (frequent re-launch) this matters; request warm-pool quota or accept the cold-start, or move the
|
| 277 |
+
long inner loop to HyperPod (which is persistent — no per-round cold-start).
|
| 278 |
+
- **vLLM version pin:** the smoke leaves `vllm` unpinned in the Dockerfile; pin to a version whose CUDA matches the DLC
|
| 279 |
+
(cu124 / torch 2.6) before promoting past smoke, to avoid a silent wheel mismatch.
|
| 280 |
+
- **HyperPod-on-EKS path:** the `sagemaker-dynamo-on-eks-hyperpod-*` bucket shows it's been used here; the future
|
| 281 |
+
`EKSExecutor` + HyperPod node-group attach is the report's §9 recommendation for the long inner run. Out of scope for
|
| 282 |
+
F3 (Training-Jobs facet) but the rendezvous makes the swap free.
|
| 283 |
+
- **Spot interruption + DiLoCo:** spot g5 quota = 1; with `use_spot_instances=True` a replica can be reclaimed mid-round.
|
| 284 |
+
The bounded `timeout_s=1800` poll means a reclaimed replica stalls its peers up to 30 min then `TimeoutError`s. For
|
| 285 |
+
spot DiLoCo, add `save_freq` checkpointing + relaunch-on-interruption in the driver (report §10 failure modes).
|
| 286 |
+
|
| 287 |
+
## 8. References
|
| 288 |
+
- Repo: `composer_replication/diloco/serverless/sagemaker.py` (SageMakerExecutor), `replica_entrypoint.py`,
|
| 289 |
+
`allreduce.py` (ObjectStoreAllReduce + MockManager), `trainer/composer_trainer.py` (ComposerReplicationTrainer +
|
| 290 |
+
`make_po_config`), `examples/gsm8k_grpo/run.py`, `tests/test_sagemaker_executor.py`.
|
| 291 |
+
- Report §8 (EKS primary), §9 (SageMaker path + HyperPod hybrid), §10 (cost / phased plan).
|
| 292 |
+
- AWS DLC registry us-west-2: `pytorch-training:2.6.0-gpu-py312-cu124-ubuntu22.04-sagemaker` @ 763104351884
|
| 293 |
+
(docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-us-west-2.html).
|
| 294 |
+
- TRL vLLM colocate: GRPOConfig `use_vllm`/`vllm_mode`/`vllm_gpu_memory_utilization` (huggingface/trl
|
| 295 |
+
grpo_config.py; huggingface.co/blog/vllm-colocate; Ray TRL-GRPO example).
|
| 296 |
+
- SageMaker quotas: g5/g6/p4d training-job usage default 0 (StackOverflow 71655321; re:Post) — verified live this
|
| 297 |
+
account: g5.2xlarge=1, g6.2xlarge=0.
|
| 298 |
+
- HyperPod-EKS 1:1 mapping: docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html.
|
| 299 |
+
- Live `aws sts get-caller-identity` / `aws service-quotas list-service-quotas` / `aws iam` (2026-06-09).
|
|
@@ -0,0 +1,143 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# F4 — Decoupled DiLoCo on Serverless + S3 (AWS-native, concrete)
|
| 2 |
+
|
| 3 |
+
**Facet:** the headline distributed-training question — how N independent SageMaker Training Jobs (or EKS Indexed-Job pods) each run inner DiLoCo (AdamW H steps) then sync pseudo-gradients ONCE per ~500 steps through S3 via `ObjectStoreAllReduce`, with no cross-job NCCL.
|
| 4 |
+
|
| 5 |
+
**Environment (LIVE):** account `386931836011`, region `us-west-2`, `Admin`/Isengard. Verified live below:
|
| 6 |
+
- Rendezvous-ready bucket exists: `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` (matches the `sagemaker`-name pattern that `AmazonSageMakerFullAccess` grants S3 on — load-bearing, see IAM).
|
| 7 |
+
- Two ready execution roles: `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` (use this) and `...-20241223T082691`.
|
| 8 |
+
- CPU training quota = **30** instances each for `ml.m5.xlarge` / `ml.m5.large` / `ml.c5.xlarge` in us-west-2 → the 2-replica smoke is trivially in budget.
|
| 9 |
+
- Warm-pool quota = **0** for all (default). `KeepAlivePeriodInSeconds` set without a quota increase silently no-ops. This is THE cold-start gotcha.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## 1. The decision that already shipped (ADR-005) and why it's right for AWS
|
| 14 |
+
|
| 15 |
+
ADR-005 chose **object-store rendezvous, not cross-job NCCL**, as the DiLoCo comm primitive. DiLoCo (Douillard et al. 2023, arXiv:2311.08105 §3.2) syncs once per H≈500–1000 inner steps — ~10–30 min wall-clock. The exchange per outer round is one `PutObject` of the pseudo-gradient (~2 GB for a 1B bf16 model) + (N−1) `GetObject`s. At N=8 that's ~128 GB read spread over 30 min ≈ 70 MB/s aggregate, ~$0.05/round on S3. The repo realizes exactly this in `ObjectStoreAllReduce.allreduce()` (`composer_replication/diloco/serverless/allreduce.py:131`): PUT `round_{NNNNNN}/rank_{RRRR}.pt`, poll-until-all-peers-exist, mean, `tensor.copy_(avg)`.
|
| 16 |
+
|
| 17 |
+
**Why this is correct on AWS specifically (not just plausible):** S3 has been **strongly read-after-write consistent for PUT/GET/LIST in all regions since Dec 2 2020, at no extra cost** (`aws.amazon.com/s3/consistency`; `docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html`). The poll loop's correctness depends on exactly this: rank R PUTs `rank_0003.pt`, and the instant any peer's `_exists()` returns True the subsequent `_get()` is guaranteed to return that byte-complete object — no "object not yet visible" race, no eventual-consistency retry shim. The repo's `_put` even writes a `.tmp` then `os.replace` on the local path (atomic on POSIX); on S3 a single `PutObject` is atomic per key by definition, so a peer never sees a half-written `.pt`. **The substrate is consistency-correct by construction on S3.**
|
| 18 |
+
|
| 19 |
+
The one-line architectural payoff on K8s/serverless (report §8): a straggler replica simply blocks peers at the poll loop (bounded by `timeout_s=1800`) instead of deadlocking an NCCL gang. Inter-replica DiLoCo sync needs **no gang scheduling** — only intra-replica FSDP does.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## 2. Exact lifecycle: `SageMakerExecutor.launch_replicas(N)` → collect
|
| 24 |
+
|
| 25 |
+
`composer_replication/diloco/serverless/sagemaker.py` (`SageMakerExecutor`) already implements the full `ServerlessExecutor` Protocol. The end-to-end flow for one Decoupled DiLoCo run:
|
| 26 |
+
|
| 27 |
+
```
|
| 28 |
+
make_diloco_run(N=4) # orchestrator (NEW, ~120 LOC — see §6)
|
| 29 |
+
└─ exec = SageMakerExecutor(
|
| 30 |
+
role_arn="arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247",
|
| 31 |
+
image_uri="386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-diloco:latest",
|
| 32 |
+
output_s3_path="s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-out/run42/",
|
| 33 |
+
region="us-west-2")
|
| 34 |
+
└─ handles = exec.launch_replicas(
|
| 35 |
+
n_replicas=4, entrypoint=<ignored>,
|
| 36 |
+
entrypoint_args={
|
| 37 |
+
"rendezvous_uri": "s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/run42/",
|
| 38 |
+
"trainer_module": "composer_replication.trainer.composer_trainer",
|
| 39 |
+
"trainer_fn": "diloco_train", # NEW thin wrapper — see §6
|
| 40 |
+
"trainer_kwargs": {"model_name":"Qwen/Qwen2.5-0.5B","sync_every":500,"total_steps":2000}},
|
| 41 |
+
gpu="A10G", timeout=86400)
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
**Per replica, `launch_replicas` submits ONE single-instance Training Job** (`sagemaker.py:309-369`):
|
| 45 |
+
- `ResourceConfig.InstanceCount == 1` — deliberately NOT one multi-instance job (which would wire SageMaker's intra-job NCCL via `resourceconfig.json` and couple the replicas — the wrong model). N replicas = N separate jobs.
|
| 46 |
+
- `AlgorithmSpecification.ContainerEntrypoint = ["python","-m","composer_replication.diloco.serverless.replica_entrypoint"]`, `ContainerArguments = ["--rendezvous", s3uri, "--world-size","4","--trainer-module",...,"--trainer-fn","diloco_train","--trainer-kwargs-json", "{...}"]` (each token a separate list element).
|
| 47 |
+
- `Environment = {"REPLICA_RANK":"<rank>","WORLD_SIZE":"4","RENDEZVOUS_URI":s3uri}` — the rank channel.
|
| 48 |
+
- `EnableNetworkIsolation=False` — **load-bearing**, pinned, never a knob: True severs the container's outbound S3 and dead-locks the allreduce poll until `timeout_s`. The bucket access is granted on `RoleArn` instead (SageMaker's IRSA analog).
|
| 49 |
+
- On any rank's `create_training_job` failure, already-launched siblings are best-effort `cancel`ed then it raises (clean abort, no orphan jobs).
|
| 50 |
+
|
| 51 |
+
**Inside each job, `replica_entrypoint.main`** (`replica_entrypoint.py:38`):
|
| 52 |
+
1. reads `REPLICA_RANK` from env (falls back to argv via the dual contract — argv for SageMaker/Local, env for EKS),
|
| 53 |
+
2. builds `store = ObjectStoreAllReduce(uri=s3uri, rank, world_size)`. For an `s3://` URI this hits `_init_fsspec()` → `fsspec.filesystem("s3")` (s3fs; in the `[serverless]` extra),
|
| 54 |
+
3. `manager = MockManager(store)`,
|
| 55 |
+
4. imports `trainer_module.trainer_fn`, injects `manager=`, `rank=`, `world_size=`, calls it.
|
| 56 |
+
|
| 57 |
+
**Inside `diloco_train` (the trainer_fn):**
|
| 58 |
+
```python
|
| 59 |
+
diloco = make_diloco_outer_loop(
|
| 60 |
+
manager=manager, model_fragments=[model],
|
| 61 |
+
inner_optimizer=AdamW(model.parameters(), lr=1e-5),
|
| 62 |
+
outer_lr=0.7, outer_momentum=0.9, nesterov=True, sync_every=500)
|
| 63 |
+
with diloco:
|
| 64 |
+
for step in range(total_steps):
|
| 65 |
+
inner_optim.zero_grad(); loss = composer_loss(...); loss.backward()
|
| 66 |
+
inner_optim.step() # at step%500==0 DiLoCo's post-hook fires
|
| 67 |
+
```
|
| 68 |
+
At every 500th `inner_optim.step()`, torchft's DiLoCo post-hook fires `prepare_sync` → `perform_sync`. `perform_sync` computes the pseudo-gradient `θ_initial − θ_local` (sign convention pinned in `diloco/__init__.py:13-38`), calls `manager.allreduce(pseudograd)` → `MockManager.allreduce` (`allreduce.py:268`) → `ObjectStoreAllReduce.allreduce`: PUT `round_000000/rank_0003.pt`, poll for all 4 ranks' files, mean, `copy_` back. `MockManager.should_commit()` always returns True (no FT failover; replica failure is the orchestrator's job), then the **outer Nesterov SGD step** applies the averaged pseudo-gradient and redistributes the new global weights. `start_quorum()` bumps `_step` so `current_step()` advances exactly once per round (fragment-rotation math). Repeat for the next 500 inner steps → `round_000001/`, etc.
|
| 69 |
+
|
| 70 |
+
**Collect:** `exec.collect(handles, timeout=...)` polls `describe_training_job` per handle until terminal (`Completed`/`Failed`/`Stopped`), returns rank-ordered result dicts incl. `S3ModelArtifacts`. `poll` maps `TrainingJobStatus` via `_STATUS_MAP` (refining `InProgress`+queued → `pending`). `stream_logs` reads `/aws/sagemaker/TrainingJobs` CloudWatch stream `<job>/algo-1-<epoch>`.
|
| 71 |
+
|
| 72 |
+
The DiLoCo math, `MockManager`, `ObjectStoreAllReduce`, `make_diloco_outer_loop`, the trainer, and the loss are **all byte-for-byte unchanged** across Local→EKS→SageMaker. That is the whole point of the Protocol.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## 3. S3 rendezvous specifics for AWS
|
| 77 |
+
|
| 78 |
+
- **Bucket/prefix:** `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/<run_id>/` for the rendezvous; a sibling `…/diloco-out/<run_id>/` for `OutputDataConfig`. Use the sagemaker-named bucket so the stock execution-role policy already grants access (see IAM). Layout written by the substrate: `…/diloco-rdv/<run_id>/round_{NNNNNN}/rank_{RRRR}.pt`.
|
| 79 |
+
- **Consistency:** strong RAW + strong LIST, all regions, since 2020 — the poll loop (`_exists` then `_get`) needs no consistency shim. A peer's file becomes visible atomically on PUT completion.
|
| 80 |
+
- **IAM (load-bearing):** the jobs need `s3:GetObject` + `s3:PutObject` (and ideally `s3:ListBucket`) on `…/diloco-rdv/<run_id>/*` on the **execution `RoleArn`**, NOT the caller. The stock `AmazonSageMakerFullAccess` on the existing roles grants S3 only on buckets whose name contains `sagemaker`/`Sagemaker`/`SageMaker`/`aws-glue` — which is exactly why the `amazon-sagemaker-…` bucket works out of the box and a custom-named `s3://composer-diloco-rdv` would silently 403 the first PUT and hang every peer until `timeout_s`. Either keep the rendezvous in a sagemaker-named bucket (recommended, zero IAM work) or attach an inline policy scoping `s3:GetObject/PutObject/ListBucket` to the custom prefix.
|
| 81 |
+
- **Poll timeout vs stragglers:** `ObjectStoreAllReduce(timeout_s=1800, poll_interval_s=1.0)`. 30 min comfortably covers a SageMaker cold-started replica that joins late (3–5 min provision + first-500-steps lag). For Spot churn at larger N, raise to `timeout_s=3600`. A `TimeoutError` names the exact missing `rank_R` + `round_N` — the orchestrator can then cancel + relaunch that rank (DiLoCo's `should_commit==True` means a stalled round does not silently skip; the run aborts cleanly rather than averaging a partial set).
|
| 82 |
+
- **Cost:** ~$0.05/round (ADR-005), negligible vs GPU. For a 2000-step / sync_every=500 run that's 4 rounds ≈ $0.20 of S3 for the whole run.
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## 4. What's missing to run it for real (the deferred smoke) + the cheapest validating run
|
| 87 |
+
|
| 88 |
+
**The gap.** ADR-005 §"Open/deferred" flags a "real serverless smoke" as never run; report §10 Phase 0 lists "EKSExecutor + S3 rendezvous + dep bump" as substrate hardening. Concretely, what exists vs what's missing:
|
| 89 |
+
|
| 90 |
+
| Layer | State |
|
| 91 |
+
|---|---|
|
| 92 |
+
| `ObjectStoreAllReduce` over `file://` | **Proven** — `test_serverless_diloco_integration.py` runs a 2-process `LocalProcessExecutor` run and asserts cross-rank weight convergence after one outer round. |
|
| 93 |
+
| `ObjectStoreAllReduce` over `s3://` | **Code path exists, never exercised against real S3.** `_init_fsspec()` is untested with live s3fs; the `_exists`/`_get`/`_put` S3 branches have only mock coverage. |
|
| 94 |
+
| `SageMakerExecutor` against real boto3 | **Never submitted a real job.** Tests inject a `_MockSMClient`. The ECR `image_uri` does not yet exist (no Dockerfile baking `composer_replication`). |
|
| 95 |
+
| `diloco_train` trainer_fn | **Missing.** `composer_trainer.py` has the trainer but no thin DiLoCo-wrapped entry that accepts the injected `manager`/`rank`/`world_size` kwargs. |
|
| 96 |
+
|
| 97 |
+
**The exact remaining gaps to close, in order (each cheap, each decisive):**
|
| 98 |
+
1. **s3:// smoke (≈$0, ~10 min):** point the *existing* `test_serverless_diloco_integration` multi-process test at `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/smoke-$(uuid)/` instead of a tmp dir, with 2 local processes. This exercises the real s3fs PUT/poll/GET/mean path and the strong-consistency assumption with zero GPU spend. Only `s3fs` + `boto3` (already in `[serverless]`) and ambient `Admin` creds are needed. Gate: both ranks converge to identical weights (the existing assertion).
|
| 99 |
+
2. **Dockerfile + ECR push (~$0):** ~30-line image `FROM pytorch/pytorch`, `pip install -e .[serverless,diloco,train]`, entrypoint `python -m composer_replication.diloco.serverless.replica_entrypoint`. Push to `386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-diloco:latest`.
|
| 100 |
+
3. **`diloco_train` trainer_fn (~40 LOC):** wraps the existing trainer in `make_diloco_outer_loop`, accepts `manager/rank/world_size`.
|
| 101 |
+
4. **2 tiny CPU SageMaker jobs (the validating run, ~$0.10–0.30):** `SageMakerExecutor(gpu=None → ml.m5.xlarge, $0.23/hr on-demand)`, `n_replicas=2`, `nn.Linear` or a 0.5B model with `total_steps=4, sync_every=2`, `rendezvous_uri` in the sagemaker bucket. Two jobs × ~5 min cold-start + ~2 min run ≈ 0.24 instance-hours ≈ **$0.06–0.30**. Within the 30-instance CPU quota with massive headroom. Gate: `collect()` returns both `succeeded`, both ranks' `round_000000/rank_000{0,1}.pt` appear in S3, and the two model artifacts in `…/diloco-out/` are byte-identical (proves the cross-job allreduce ran through real S3, not just locally).
|
| 102 |
+
|
| 103 |
+
That ladder — `file://` (done) → `s3://` local (step 1) → 2 real CPU jobs (step 4) — is the report's prescribed cheapest path and closes the deferred-smoke gap for well under the ADR's $2–5 estimate (the original estimate assumed GPU + Modal).
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## 5. Streaming DiLoCo (`fragment_sync_delay>0`) on this substrate
|
| 108 |
+
|
| 109 |
+
Streaming DiLoCo (Liu et al. 2025, "Eager Updates for Overlapped Communication in DiLoCo", arXiv:2501.18512; the DiLoCo line is arXiv:2311.08105) splits the model into fragments synced on staggered schedules so the allreduce of fragment k overlaps inner computation of the next steps. `make_diloco_outer_loop` already exposes the knobs: `fragment_sync_delay>0`, `fragment_update_alpha`, and `model_fragments=[frag_0,...,frag_M-1]` (`diloco/__init__.py:72-93`).
|
| 110 |
+
|
| 111 |
+
From torchft's `local_sgd.py` (`_StreamingDiLoCoFragment`, verified via DeepWiki against the upstream source):
|
| 112 |
+
- **prepare_sync** fires at inner step `sync_every − fragment_sync_delay`: computes pseudo-gradients (`_save_grads`) and **launches the allreduce without waiting**, recording the `Work` in `self._allreduce_work`, on a separate CUDA stream.
|
| 113 |
+
- **perform_sync** fires at `sync_every`: **waits** on that `Work`, restores global params, `should_commit()`, applies the outer step, then `_merge_parameters` blends `local*alpha + global*(1−alpha)` (`fragment_update_alpha`; 0.0 = standard full-replacement DiLoCo).
|
| 114 |
+
- **`_current_fragment() = current_step() % len(fragments)`** — round-robin; each fragment syncs on its own offset. This is exactly why `MockManager.start_quorum()` bumps `_step` once per round and `current_step()` is faithful: get this wrong and replicas pick different fragments and diverge.
|
| 115 |
+
|
| 116 |
+
**The load-bearing gotcha for the object-store substrate.** Streaming's overlap depends on `manager.allreduce()` being **asynchronous** — prepare_sync launches it, `fragment_sync_delay` steps of inner compute run, then perform_sync waits. But the repo's `MockManager.allreduce` is **synchronous**: it calls `ObjectStoreAllReduce.allreduce`, which **blocks** on the poll-until-all-peers loop before returning `_ImmediateWork` (whose `.wait()` is a no-op). So on this substrate, **today, prepare_sync blocks for the full S3 rendezvous and `fragment_sync_delay` buys zero overlap** — Streaming degrades to vanilla, correctly but without the comm/compute overlap benefit. This is fine for correctness (and is why the same API "configures Streaming" per the docstring) but defeats the point on a 2 GB-per-fragment, 30-min-cadence S3 sync.
|
| 117 |
+
|
| 118 |
+
**The fix to make Streaming real here (~60 LOC, deferred):** give `ObjectStoreAllReduce` a non-blocking mode: `allreduce_async(tensor)` PUTs `round_N/rank_R.pt` and returns immediately; the returned `Work.wait()` then runs the poll-GET-mean-copy. `MockManager.allreduce` returns that deferred `Work` instead of `_ImmediateWork`. Now prepare_sync's PUT returns instantly, the `fragment_sync_delay` inner steps run while peers are PUTting concurrently, and perform_sync's `.wait()` does the poll/mean. Because S3 is strongly consistent, by the time perform_sync waits `fragment_sync_delay` steps later, peer files are far more likely already present — the overlap is genuine. Per-fragment streaming further shrinks each PUT to (model_size / M), so the poll is over smaller objects. This is the natural Streaming-DiLoCo realization on object storage and is the right Phase-5 upgrade (report §10) for multi-fragment large-model runs; vanilla (`fragment_sync_delay=0`, single fragment) is correct and sufficient for Phases 0–4.
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 6. Repo delta (what to build in `composer_replication/`)
|
| 123 |
+
|
| 124 |
+
| File | Delta | ~LOC |
|
| 125 |
+
|---|---|---|
|
| 126 |
+
| `composer_replication/trainer/composer_trainer.py` | NEW `diloco_train(*, manager, rank, world_size, model_name, sync_every=500, total_steps, **kw)` — wraps the existing trainer body in `make_diloco_outer_loop(manager=manager,...)`; this is the `trainer_fn` `replica_entrypoint` calls. | ~40 |
|
| 127 |
+
| `composer_replication/diloco/serverless/run.py` | NEW thin orchestrator `make_diloco_run(executor, n, rendezvous_uri, trainer_module, trainer_fn, trainer_kwargs, gpu, timeout)` → `launch_replicas` + `collect`; surfaces straggler `TimeoutError` → relaunch-rank. | ~120 |
|
| 128 |
+
| `docker/Dockerfile.diloco` | NEW — `FROM pytorch/pytorch:*-cuda*`, `pip install -e .[serverless,diloco,train]`, ENTRYPOINT to `replica_entrypoint`. Push to ECR `composer-diloco:latest`. | ~30 |
|
| 129 |
+
| `composer_replication/diloco/serverless/sagemaker.py` | EXTEND: optional `keep_alive_period_s` → `ResourceConfig.KeepAlivePeriodInSeconds` (warm pool; **document the 0-default quota gotcha**); optional `use_spot` → `EnableManagedSpotTraining=True` + `MaxWaitTimeInSeconds` (> `MaxRuntimeInSeconds`) + `CheckpointConfig`. Both default off. | ~40 |
|
| 130 |
+
| `composer_replication/diloco/serverless/allreduce.py` | EXTEND (Phase 5, for real Streaming): `allreduce_async` + a deferred `Work` whose `.wait()` runs the poll/mean; `MockManager` returns it. Vanilla path untouched. | ~60 |
|
| 131 |
+
| `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py` | EXTEND: parametrize rendezvous over `file://` AND `s3://amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d/diloco-rdv/smoke-<uuid>/` (gated on `AWS_SMOKE=1`) — closes the s3:// path gap with zero GPU. | ~30 |
|
| 132 |
+
| `docs/adrs/ADR-005-serverless-diloco.md` | UPDATE the "Open/deferred — Real serverless smoke" clause: replace the Modal $2–5 estimate with the SageMaker 2×CPU-job ($0.06–0.30) plan above. | ~10 |
|
| 133 |
+
|
| 134 |
+
Untouched: `loss.py`, `teacher_replay.py`, `safety/`, `make_diloco_outer_loop`, `MockManager` core, `ObjectStoreAllReduce` core, `replica_entrypoint` (its dual argv/env contract already supports both SageMaker and EKS).
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## 7. Open questions / falsifiers
|
| 139 |
+
|
| 140 |
+
- **Warm-pool quota:** default 0 in us-west-2 (verified). If iterative dev wants warm starts (skip the 3–5 min cold-start per round-of-jobs), request a `ml.<type> for training warm pool usage` quota increase first; otherwise `KeepAlivePeriodInSeconds` no-ops.
|
| 141 |
+
- **Pseudo-gradient dtype/size at real model scale:** the smoke uses tiny tensors; a 0.5B–8B bf16 pseudo-gradient is 1–16 GB per PUT — confirm s3fs multipart upload throughput and that `torch.save({"rank","tensor"})` round-trips bf16 on CPU (the code casts peer tensors back to device+dtype on GET).
|
| 142 |
+
- **Streaming overlap:** only real after the `allreduce_async` delta (§5); until then `fragment_sync_delay>0` is correct-but-no-overlap on S3. Measure round wall-clock with vs without before claiming the benefit.
|
| 143 |
+
- **N>16 straggler fragility under Spot:** the bounded `timeout_s` poll is the mitigation; the HyperPod-attached-node-group path (report §9, same S3 rendezvous) is the resilience escalation for multi-day runs.
|
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# F5 — Fidelity Audit: Are we FULLY taking advantage of the papers + the Composer 2.5 / Composer-2 methodology?
|
| 2 |
+
|
| 3 |
+
> **Date:** 2026-06-09. **Scope:** rubric-style audit of every documented Composer 2.5 / Composer-2 ingredient AND each load-bearing paper against the *actual* code in `composer_replication/`, plus a prioritized "to fully replicate + extend" gap list. AWS-native where a gap needs cloud infra (account 386931836011, us-west-2).
|
| 4 |
+
> **Grounding:** `docs/COMPOSER_RECIPE_MAPPING.md`, `research/01/09/10` (Composer 2.5 blog + Composer-2 techreport mining), `research/05-12`, ADR-008/014/015, and `research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md`.
|
| 5 |
+
|
| 6 |
+
## Headline verdict
|
| 7 |
+
|
| 8 |
+
The repo is a **high-fidelity replication of Composer 2.5's two *published* channels** (Dr.GRPO base + SDPO textual-feedback self-distillation) plus the framework's *own* third channel (multi-teacher trace-replay-DPO). The datagen / curriculum / anti-hack / safety substrate is essentially complete. But there are **three classes of gap that block "taking a model to the next level"**:
|
| 9 |
+
|
| 10 |
+
1. **One byte-level loss-math infidelity that the evidence says actually matters**: Channel 1 rides TRL's **k3-in-loss** KL estimator; Composer 2 explicitly chose **k1-in-reward** (`-log r`), and the 2025/26 literature (arXiv:2512.21852 "A Comedy of Estimators"; verl adopting k1-in-reward as the *only* reverse-KL option; TRL issue #4967) shows k1-in-reward **improves OOD generalization** while k3-in-reward can *collapse*. The repo documents this as a "small delta, not patched" — but OOD generalization is exactly the "next-level" axis. This is the single most concrete fidelity fix.
|
| 11 |
+
2. **Composer-2's *non-hint* behavior-shaping recipe is entirely MISSING**: the **auxiliary scalar reward array** (style / communication / unfinished-todo penalties), the **nonlinear length/effort penalty** `C_length = ((1+kx)^{1-q}-1)/(k(1-q))`, and **self-summarization with reward-to-all-chain-tokens**. These are *fully specified with equations* in research/10 and are reproducible without the hint mystery. None is in code.
|
| 12 |
+
3. **The novel extension (multi-model Monte-Carlo tree-of-work + world-model deliberation head) is 100% design, 0% code.** No tree controller, no env-step-between-branches recursion, no `SiblingBootstrapGenerator`, no `<deliberate>` token, no next-state head. The report's "core delta" (teacher-plurality → execution-oracle fitness, depth-1 → recursion) is unbuilt.
|
| 13 |
+
|
| 14 |
+
The good news the report stresses: the substrate for all of this already exists. ~9/10 of the system is reuse. The fidelity gaps are *additive*, not architectural rewrites.
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## RUBRIC A — Composer 2.5 / Composer-2 documented ingredients
|
| 19 |
+
|
| 20 |
+
| # | Ingredient (source) | Status | Implementing file / evidence | Gap |
|
| 21 |
+
|---|---|---|---|---|
|
| 22 |
+
| (a) | **Targeted RL w/ textual feedback** = SDPO/OPSD self-distill (Ch2) — 2.5 blog | **FULLY-REPLICATED** | `opsd.py::generalized_jsd_loss` (byte-for-byte vs siyan-zhao/OPSD, re-aligned Wave 15); `trainer/composer_trainer.py::_compute_sdpo_loss` (full-logits, stop-grad teacher, post-hint masked, ADR-011 aligned indices); `trainer/data_collator.py` emits `ctx_teacher_input_ids` + `student/teacher_response_idx`; tests `test_sdpo_alignment_indices.py`, `test_opsd_parity.py`. | Loss + wiring are production-grade. The *hint source* is the open question — see (a′) below. SDPO not yet smoke-tested against a real `trl.GRPOTrainer` on GPU (ADR-008 note). |
|
| 23 |
+
| (a′) | **Hint generation** (the #1 reproducibility gap — unstated in *every* Cursor artifact) | **PARTIAL (designed-around)** | `hint_generator.py` layered: `TemplateHintGenerator` → `RawErrorHintGenerator` (routed) → `LLMJudgeHintGenerator` (cached, clamped) → `default_composite()`. ADR-009/012. | Layers 1-3 built; **layer 4 (`SiblingBootstrapGenerator` = SDPO "successful-rollout-as-implicit-feedback")** is design-only. LLM-judge path needs a live model wired (Bedrock). |
|
| 24 |
+
| (b1) | **25× synthetic data — Feature Deletion generator** — 2.5 blog | **FULLY-REPLICATED (substrate-inversion form)** | `datagen/substrates.py::SweBenchAdapter` (revert gold patch → broken repo, FAIL_TO_PASS=reward target, license filter), `datagen/env.py::FeatureDeletionEnv`, `datagen/validator.py` (4-gate solvability), `datagen/schema.py`. | Real *generative* synthesis (manufacture novel broken states beyond SWE-bench inversion) absent; only adapts existing SWE-* instances. No "25×" scale-out generator suite. |
|
| 25 |
+
| (b2) | **Dynamic-difficulty curriculum ("select for AND create harder tasks dynamically")** — 2.5 blog + Composer-2 §3 (keyed on #turns + thinking-tokens) | **FULLY-REPLICATED (select-for half)** | `datagen/curriculum.py::DifficultyCurriculum` — p̂(1−p̂) frontier weighting, retire >0.95, quarantine <0.02, **effort tilt on turns/think-tokens** (ADR-012 #4, matching Composer-2's exact heuristic). | **CREATE half missing**: no live escalation of deletion-span / coupling / multi-feature difficulty during the run. Curriculum scores an *existing* pool; it doesn't mint harder tasks. |
|
| 26 |
+
| (c1) | **Dr.GRPO base objective** — Composer-2 §4.1 | **FULLY-REPLICATED** | `composer_trainer.py::make_dr_grpo_config` + `make_po_config` (PO menu: grpo/dr_grpo/bnpo/dapo/gspo/cispo, pure TRL 1.5.0 config). `loss_type="dr_grpo"`, `scale_rewards="none"`, `num_iterations=1`, drift-guard asserts. ADR-014. | None on the objective itself. |
|
| 27 |
+
| (c2) | **k1-vs-k3 KL** — Composer-2 §4.1 explicitly chooses **k1 = −log r in *reward*** (variance argument, citing Amini et al.) | **PARTIAL — DOCUMENTED INFIDELITY** | `composer_trainer.py:496-509` documents that TRL's `_compute_loss` uses **k3-in-loss** (`exp(Δ)−Δ−1`), NOT k1. `test_dr_grpo_config_and_alignment.py::test_trl_kl_estimator_is_k3_not_k1` pins this. Honest delta, not patched. | **The evidence says this delta matters for the "next level":** arXiv:2512.21852 + TRL #4967 + verl (k1-in-reward only) show k1-in-reward ↑ OOD generalization; k3-in-reward can collapse. Composer chose k1 deliberately. Fix is implementable (see Gap #1). |
|
| 28 |
+
| (d) | **CPT → SFT → RL phase structure** — Composer-2 §3-4 (CPT loss ↓ ⇒ RL ceiling ↑, replicated on Qwen3-Coder-30B) | **PARTIAL (intentional skip + plumbing)** | Documented decision to skip CPT and start from a code-tuned base (COMPOSER_RECIPE_MAPPING.md row a; corroborated by Composer-2's own CPT→RL causal claim). Inner/outer loop split exists (datagen=outer, `ComposerReplicationTrainer`=inner). | **No SFT-first stage in code.** Report §5 prescribes "SFT-first on clean winning trajectories before RL" — there is no SFT trainer/recipe; only the RL trainer exists. CPT correctly skipped. |
|
| 29 |
+
| (e) | **Sharded Muon + dual-mesh HSDP** (2.5 blog) / FSDP+CP+decoupled-EP, Adam (Composer-2 §6) | **MISSING (intentional, irrelevant at our scale)** | — | Correctly out of scope for dense Qwen3-{7,32}B (the mapping doc + report both say skip until MoE base). Distributed substrate is DiLoCo-over-S3, not HSDP. Note research/10 *corrects* the blog: Composer-2 uses **Adam**, not Muon, and FSDP+CP+decoupled-EP, not HSDP. |
|
| 30 |
+
| (f) | **Anyrun production-fidelity sandboxed RL harness** (>500 pods/s, per-pod Firecracker microVM, fork/snapshot, Anygress egress proxy) — Composer-2 §6.2 | **PARTIAL** | `datagen/sandbox.py` (`Sandbox` Protocol, `LocalSubprocessSandbox`, `scrub_tree` primary control, denylist defense-in-depth), `datagen/docker_sandbox.py`, `diloco/serverless/{executor.py,eks.py,sagemaker.py,modal_spawn.py}`. | No microVM isolation (gVisor/Kata-Firecracker), no fork/snapshot, no egress proxy, no >100k-pod orchestration. The report's EKS plan (§8: gVisor default → Kata+Firecracker → container-free SWE-MiniSandbox) is design-only. `eks.py`/`sagemaker.py` are executor skeletons, not the full Anyrun analogue. |
|
| 31 |
+
| (g) | **Reward-hacking monitoring** (2.5 blog: bytecode decompile / type-cache hacks; "agentic monitoring tools") | **FULLY-REPLICATED (defense-in-depth) + run-level guard now wired** | `datagen/monitor.py::HackMonitor` (signature + patch-provenance, obfuscation-resistant), `sandbox.py::scrub_tree` (physical cache/.git removal = "the wall"), `datagen/validator.py` (4-gate), `safety/holdout.py::HeldoutSplit` (id + content-hash disjointness), `safety/kill_switch.py::HeldOutGuard` (proxy-real Hacking-Gap + KL hard-stop), **now wired into the trainer** (`composer_trainer.py::_maybe_update_killswitch`, ADR-015, 2026-06-08). | The held-out kill-switch — the report's "most load-bearing safeguard, documented gap" — is **now CLOSED** (ADR-015). Remaining: `HackMonitor` validated only on constructed examples (report warns synthetic-hack monitors fail to generalize); offline LLM-judge monitor (EvilGenie-style) not built. |
|
| 32 |
+
| (h) | **Aux scalar rewards (style/communication/unfinished-todo penalties)** — Composer-2 §4.2 | **MISSING** | Reward is pure test-pass-fraction (`env.py::_grade`). No auxiliary reward array. `integrations/altered_minds/reward.py` is an MMLU-format reward for ADR-013 ladder, not the Composer behavior-reward suite. | Fully specified in research/10; reproducible without the hint mystery. Build a `behavior_rewards.py` reward-fn bank. |
|
| 33 |
+
| (i) | **Nonlinear length/effort penalty** `C_length{k,q}(x)=((1+kx)^{1−q}−1)/(k(1−q))` — Composer-2 §4.2 (exact equation) | **MISSING** | — | Trivially implementable (≈30 LOC reward shaper over {thinking, tool-call, tool-output, final-msg tokens, #calls, #turns}). Induces parallel tool calls per the report. |
|
| 34 |
+
| (j) | **Self-summarization (reward-to-all-chain-tokens)** — Composer-2 §4.1 | **MISSING** | — | The mechanism that handles 100k-token long-horizon rollouts (the regime the report says the *tree* is for). Not built. |
|
| 35 |
+
| (k) | **MoE router replay** — Composer-2 §6.2 | **MISSING (out of scope, dense bases)** | — | Only relevant for MoE-base RL; correct to defer. |
|
| 36 |
+
|
| 37 |
+
**Rubric A score:** 5 FULLY-REPLICATED of the *core* published recipe (a, b1, b2, c1, g), 1 partial-around (a′), 1 documented-infidelity (c2), 1 intentional-skip (d, e, k), and **3 missing-but-specified Composer-2 behavior-shaping items (h, i, j)** that are the cheapest available "next-level" wins.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## RUBRIC B — Load-bearing papers
|
| 42 |
+
|
| 43 |
+
| Paper (cluster) | Status | Implementing file / evidence | Gap |
|
| 44 |
+
|---|---|---|---|
|
| 45 |
+
| **SDPO / OPSD** (2601.20802 / 2601.18734) | **FULLY-REPLICATED** | `opsd.py` (byte-for-byte OPSD JSD), `composer_trainer._compute_sdpo_loss`. | SDPO "successful-rollout-as-implicit-feedback" lever (=sibling bootstrap) not built. |
|
| 46 |
+
| **Dr.GRPO** (2503.20783) | **FULLY-REPLICATED** | `make_dr_grpo_config`; tests pin length-std-off + no-std-norm. | KL is k3-not-k1 (see Rubric A c2). |
|
| 47 |
+
| **DAPO / GSPO / CISPO / BNPO** (2503.14476 / 2507.18071 / 2506.13585) | **FULLY-REPLICATED (as menu)** | `PO_OBJECTIVES` presets + drift guards (ADR-014). Note Composer-2 *rejected* DAPO overlong-masking at small scale. | None — menu exceeds Composer's own single choice. |
|
| 48 |
+
| **DiLoCo / Streaming-DiLoCo** (2311.08105 / 2501.18512) | **PARTIAL** | `diloco/__init__.py::make_diloco_outer_loop`, `diloco/serverless/allreduce.py::ObjectStoreAllReduce` (s3://), `MockManager` mirroring torchft.Manager. ADR-003/005. | Streaming-DiLoCo (overlapped/quantized comm, partial-param sync) not implemented — plain DiLoCo outer loop only. Real multi-replica AWS run unproven (executor skeletons). |
|
| 49 |
+
| **World-model cluster** (MuZero 1911.08265, Dreamer 2301.04104, CWM 2510.02387, Chain-of-World 2603.03195) | **MISSING (design only)** | None. Report §2 designs a parameter-isolated next-state head + `<deliberate>` token as a *second SDPO mode*; calibration (ECE/Brier) primary, Foresight@k kill-ablation. | **Next-state head is NOT built — purely designed.** This is the project's stated end-goal. |
|
| 50 |
+
| **SimPO / TAID / Entropy-OPD distillation** (ADR-007) | **FULLY-REPLICATED** | `distillation/{simpo.py,taid.py,entropy_aware_opd.py}`, wired into `loss.py::compose_loss(dpo_variant=, sdpo_wrapper=, taid_t=)`; tests `test_distillation_losses.py`, `test_taid_parity.py`. | Production trainer (`composer_trainer.py`) only exposes SDPO/DPO; SimPO/TAID/Entropy-OPD live in the verification-mirror `compose_loss`, not the live GRPO subclass. |
|
| 51 |
+
| **PRIME-RL / verl / Monarch** (ADR-006) | **PARTIAL** | `recipes/prime_rl/composer_loss.py` (Ch1+Ch3, raises NotImplementedError for SDPO — needs full logits, ADR-008), `recipes/monarch/actors.py`, parity harness. | PRIME-RL can't host SDPO (logits gap upstream); Monarch is actor skeletons; verl `AsyncServer` (the report's scale answer for tool-heavy trees) not integrated. |
|
| 52 |
+
| **Prune-vs-train-on-all evidence** (RAFT 2504.11343, neg-gradient 2505.18830, near-miss 2503.14391, expert-failure, CCA/PURE) | **PARTIAL (ladder scaffolded)** | ADR-013 A0–A4 isolated-channel ladder + `integrations/altered_minds/{ladder.py,kl_logging.py,reward.py}`. | The report's **P0–P6 "generate-once/route-many" branch-usage axis** (the experiment that *settles* prune-vs-all) is design-only. No typed-train-on-all routing. |
|
| 53 |
+
|
| 54 |
+
**Rubric B score:** 4 FULLY-REPLICATED (SDPO/OPSD, Dr.GRPO, PO-menu, SimPO/TAID/Entropy-OPD), 3 PARTIAL (DiLoCo, PRIME-RL/verl/Monarch, prune-vs-all ladder), **1 MISSING (the entire world-model cluster — the stated goal).**
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## THE NOVEL EXTENSION — multi-model Monte-Carlo tree-of-work + world-model deliberation head
|
| 59 |
+
|
| 60 |
+
**Status: 100% design, 0% code.** Verified by grep: no `SiblingBootstrap*`, no `world_model`/`WorldModel`, no `<deliberate>`/`MCTS`/`next_state_head` in `composer_replication/` (only docstring uses of the word "deliberately"). Every reference lives in `research/` notes and the deep-research report.
|
| 61 |
+
|
| 62 |
+
What exists today (the *ancestor*): `teacher_replay.py` is **flat depth-1** (N teachers query the same `state["messages"]`, nobody applies the action) with **teacher-plurality fitness** (`Counter` over normalized actions, `extract_dpo_pairs`). The report's two changes — (1) **recursion** (apply each candidate via `FeatureDeletionEnv.step()` → new state → branch again) and (2) **execution-oracle fitness** (`_grade()`'s masked pass-fraction replaces plurality) — are the entire idea and are unbuilt.
|
| 63 |
+
|
| 64 |
+
**Minimal first build (matches report Phase 1, the "one cheap, clearly-worth-it change"):**
|
| 65 |
+
- A tree controller that, between branches, calls `FeatureDeletionEnv.step(action)` and grades leaves with `_grade()` — turning Channel-3 depth-1 stars into a real tree **with NO new loss term**. The divergence signal enters as the SDPO teacher's privileged-info conditioning (the reserved `SiblingBootstrapGenerator` slot: select max-reward sibling, emit "a working approach looks like: …", feed the *same* `ctx_teacher` splice).
|
| 66 |
+
- This is ~1 module (`datagen/tree_controller.py`, ~200-300 LOC) + `SiblingBootstrapGenerator` (~60 LOC in `hint_generator.py`). The collator and loss are untouched.
|
| 67 |
+
- **Divergence-gating is mandatory** (report §3/§10): branch only where sibling next-action distributions already disagree, else collapse to one rollout — turns O(N^D) into O(N·decision-points). Ungated cost ≈$64/trace vs $0.98 flat.
|
| 68 |
+
|
| 69 |
+
**The world-model head (report Phase 4, gated on the P0–P6 verdict):** a parameter-isolated next-state adapter + `<deliberate>` token as a *second SDPO mode* — splice the realized post-action observation (stdout, tool_error kind, signed FAIL_TO_PASS delta, one-line diff) into the teacher context as privileged info, distill the student toward the foreseen-outcome distribution. Carrier requires no new kernel (rides `generalized_jsd_loss`). Build only if Foresight@k ≠ 0.
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## Prioritized "to fully replicate + extend" gap list (build order)
|
| 74 |
+
|
| 75 |
+
Ordered by (fidelity-leverage × cheapness), front-loading the items that move the "next-level" needle for the least build.
|
| 76 |
+
|
| 77 |
+
**Tier 0 — cheap fidelity fixes the evidence says move OOD generalization (do first):**
|
| 78 |
+
1. **k1-in-reward KL** (Rubric A c2). Add a `kl_estimator="k1"` + `use_kl_in_reward=True` path to the trainer: compute `−log r` per token, fold into the *advantage/reward* (not the loss), set TRL `beta=0.0` to disable its k3-in-loss term. Mirror TRL issue #4967 / verl's choice. `composer_trainer.py` ~60 LOC + test flipping the pinned k3 assertion. **This is the highest-fidelity-leverage single change.**
|
| 79 |
+
2. **Composer-2 behavior rewards** (Rubric A h+i): `datagen/behavior_rewards.py` — the aux scalar reward array (style/communication/unfinished-todo) + the nonlinear length/effort penalty `C_length` (exact eq. in research/10), as TRL `RewardFunc`s composable with `env.reward_fn`. ~120 LOC. Reproducible *without* the hint mystery; directly targets Composer's "communication style + effort calibration" goal.
|
| 80 |
+
|
| 81 |
+
**Tier 1 — close the highest-value PARTIALs:**
|
| 82 |
+
3. **SDPO live-GPU smoke** (Rubric A a): instantiate `ComposerReplicationTrainer` against a real `trl.GRPOTrainer` on a small model (Qwen2.5-0.5B) on a SageMaker Training Job (g5/g6e) or HyperPod node-group — discharges the ADR-008 "never smoke-tested against real GRPOTrainer" caveat.
|
| 83 |
+
4. **`SiblingBootstrapGenerator`** (Rubric A a′ layer 4 + the SDPO implicit-feedback lever): ~60 LOC in `hint_generator.py`, wired into `default_composite`. Unblocks the tree's "zero-new-loss-term" wiring.
|
| 84 |
+
5. **SFT-first stage** (Rubric A d): a thin SFT recipe over clean winning trajectories before RL (report §5 "competence floor"). Reuse TRL `SFTTrainer`.
|
| 85 |
+
|
| 86 |
+
**Tier 2 — the novel extension (the report's phased ladder):**
|
| 87 |
+
6. **Tree controller + execution-oracle fitness** (recursion, the core delta): `datagen/tree_controller.py` — env-step between branches, `_grade()` leaves, divergence-gated expansion. Report Phase 1+2.
|
| 88 |
+
7. **P0–P6 typed-train-on-all routing** (settles prune-vs-all): extend ADR-013's ladder with generate-once/route-many on a shared tree; primary metrics = near-miss calibration (ECE/Brier) + Foresight@k, pass@1 secondary. Report Phase 3.
|
| 89 |
+
8. **World-model next-state head** (the stated goal): parameter-isolated adapter + `<deliberate>` token as a second SDPO mode. Build *only if* P4/P6 beat P0–P3 on foresight. Report Phase 4.
|
| 90 |
+
|
| 91 |
+
**Tier 3 — production-fidelity infra (Anyrun analogue on AWS):**
|
| 92 |
+
9. **Real AWS sandbox tiering** (Rubric A f): EKS gVisor `runsc` RuntimeClass default → Kata+Firecracker (self-managed node groups; EKS Managed Node Groups override the CPU-Options needed for nested virt) → container-free SWE-MiniSandbox for high fan-out. Egress-off. us-west-2, the live SageMaker S3 buckets as the trace/rendezvous store.
|
| 93 |
+
10. **`EKSExecutor` + `SageMakerExecutor` flesh-out + `[serverless]` dep bump** (s3fs/boto3/kubernetes): make the DiLoCo-over-S3 substrate actually launch multi-replica on the live account. Report §9.
|
| 94 |
+
11. **verl `AsyncServer` backend** for the tool-heavy tree (TRL has no async GPU-decoupled agent loop). Report §8/§10 Phase 5.
|
| 95 |
+
|
| 96 |
+
**Tier 4 — defense-in-depth completion:**
|
| 97 |
+
12. **Offline LLM-judge hack monitor** (EvilGenie-style, via Bedrock) as a flagging-only monitor (never the training reward); the report warns `HackMonitor` validated on constructed examples likely misses in-the-wild hacks.
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## What "FULLY taking advantage" would mean, concretely
|
| 102 |
+
|
| 103 |
+
- **Published Composer 2.5 recipe (Ch1 Dr.GRPO + Ch2 SDPO):** essentially done; the *only* fidelity infidelity is k3-vs-k1 KL (Gap #1) and the missing live-GPU SDPO smoke (Gap #3).
|
| 104 |
+
- **Composer-2's reproducible behavior-shaping (aux rewards + length penalty + self-summarization):** the cheapest unrealized "next-level" wins — fully specified, zero hint-mystery, currently 0% built (Gaps #2 + self-summarization).
|
| 105 |
+
- **The papers' frontier the repo *exceeds* Composer on:** the PO-objective menu (6 objectives vs Composer's 1), the distillation menu (SimPO/TAID/Entropy-OPD), and the run-level collapse kill-switch. These are genuine over-delivery.
|
| 106 |
+
- **The novel bet (tree-of-work + world-model head):** the literature says build it *as a falsifiable ablation, not a premise* (report §3/§7). It is the project's differentiator and is entirely unbuilt — Tier 2 is where "next level" beyond Composer actually lives, conditioned on the P0–P6 verdict.
|