Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 20: fix SageMaker smoke — torch-2.7 DLC + drop vllm pin (the real conflict)
Browse filesSecond live run peeled the fundamental conflict: trl 1.5.1 requires
transformers>=4.56.2, which references torch.float8_e8m0fnu (torch>=2.7).
So pinning transformers DOWN (last commit) was impossible, and the torch-2.6
DLC was the wrong base. Fixes:
- launcher + Dockerfile: use the torch-2.7 DLC
2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26 (pullable, verified).
- requirements.txt: drop vllm — vllm==0.8.5 hard-pins torch==2.6.0 and fights
the 2.7 base. vLLM-colocate is a rollout speed optimization, not what the
smoke proves; the smoke uses model.generate rollout. Resolver dry-run now
clean (trl pulls transformers>=4.56.2, satisfiable on torch 2.7).
- launcher: --no-vllm → --vllm (opt-in, OFF by default); vLLM needs a baked
image with a torch-2.7-matched vllm>=0.9.
- quickstart: document the full trl→transformers→torch chain + the vLLM note.
Two failed runs cost ~$0.30 total (365 + 316 billable sec).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@@ -5,16 +5,18 @@
|
|
| 5 |
# surface). The one-shot smoke can instead use the stock DLC + source_dir
|
| 6 |
# (run_sagemaker_launch.py --image dlc), which needs no local build.
|
| 7 |
#
|
| 8 |
-
# Base: AWS PyTorch DLC, tag RESOLVED LIVE against the us-west-2 registry
|
| 9 |
-
#
|
| 10 |
-
#
|
| 11 |
-
|
|
|
|
| 12 |
|
| 13 |
-
# RL stack baked in. torch 2.
|
| 14 |
-
# torch. vllm
|
|
|
|
| 15 |
RUN pip install --no-cache-dir \
|
| 16 |
"trl>=1.5,<2" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
|
| 17 |
-
"vllm=
|
| 18 |
|
| 19 |
# The framework itself (train + serverless extras → trainer, loss, executors,
|
| 20 |
# replica_entrypoint, s3fs all present).
|
|
|
|
| 5 |
# surface). The one-shot smoke can instead use the stock DLC + source_dir
|
| 6 |
# (run_sagemaker_launch.py --image dlc), which needs no local build.
|
| 7 |
#
|
| 8 |
+
# Base: AWS PyTorch DLC, tag RESOLVED LIVE against the us-west-2 registry.
|
| 9 |
+
# MUST be torch-2.7: trl 1.5 → transformers>=4.56.2 → torch.float8_e8m0fnu
|
| 10 |
+
# (torch>=2.7). The torch-2.6 DLC fails AutoModel.from_pretrained on that dtype.
|
| 11 |
+
# cu128, -v1.26 build suffix required (no bare floating tag exists).
|
| 12 |
+
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26
|
| 13 |
|
| 14 |
+
# RL stack baked in. torch 2.7 + CUDA 12.8 already in the DLC — do NOT reinstall
|
| 15 |
+
# torch. vllm>=0.9 is the torch-2.7 line (0.8.x hard-pins torch 2.6 and would
|
| 16 |
+
# fight this base); pin to a 2.7-matched vllm to avoid a wheel/CUDA mismatch.
|
| 17 |
RUN pip install --no-cache-dir \
|
| 18 |
"trl>=1.5,<2" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
|
| 19 |
+
"vllm>=0.9" "fsspec>=2024.6" "s3fs>=2024.6" "hf_transfer>=0.1.6"
|
| 20 |
|
| 21 |
# The framework itself (train + serverless extras → trainer, loss, executors,
|
| 22 |
# replica_entrypoint, s3fs all present).
|
|
@@ -10,18 +10,29 @@ GPU, end-to-end, for **under $1**. Implements F3 (`research/design-F3-rl-sagemak
|
|
| 10 |
| `ml.g5.2xlarge` training-job quota | **1** (live, code `L-2D6DEB3C`) → no quota ticket needed |
|
| 11 |
| Execution role | `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` |
|
| 12 |
| Bucket (rendezvous + output) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` |
|
| 13 |
-
| PyTorch DLC base image | `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.
|
| 14 |
-
|
| 15 |
-
> **
|
| 16 |
-
>
|
| 17 |
-
>
|
| 18 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
> ```bash
|
| 20 |
> aws ecr describe-images --registry-id 763104351884 \
|
| 21 |
> --repository-name pytorch-training --region us-west-2 \
|
| 22 |
> --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
|
| 23 |
-
> | tr '\t' '\n' | grep -E '^2.
|
| 24 |
> ```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
> **SDK pin:** the smoke launcher uses the **sagemaker SDK v2** Estimator API.
|
| 27 |
> SDK **v3 is an API rewrite** that dropped `sagemaker.estimator.Estimator` and
|
|
@@ -37,14 +48,14 @@ python examples/gsm8k_grpo/run_sagemaker_launch.py --max-steps 20
|
|
| 37 |
|
| 38 |
This uses the **stock PyTorch DLC directly** as the training image and ships the
|
| 39 |
framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
|
| 40 |
-
(trl +
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
|
| 45 |
Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
|
| 46 |
-
`--
|
| 47 |
-
|
| 48 |
|
| 49 |
**Cost:** `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
|
| 50 |
~15–25 min ⇒ **well under $1**. Spot ≈ $0.45–0.60/hr ⇒ pennies.
|
|
|
|
| 10 |
| `ml.g5.2xlarge` training-job quota | **1** (live, code `L-2D6DEB3C`) → no quota ticket needed |
|
| 11 |
| Execution role | `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` |
|
| 12 |
| Bucket (rendezvous + output) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` |
|
| 13 |
+
| PyTorch DLC base image | `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26` |
|
| 14 |
+
|
| 15 |
+
> **The base image MUST be the torch-2.7 DLC** — learned from two live runs
|
| 16 |
+
> (2026-06-09). The dependency chain forces it:
|
| 17 |
+
> `ComposerReplicationTrainer → trl 1.5.x → transformers>=4.56.2 →
|
| 18 |
+
> torch.float8_e8m0fnu (MXFP4, torch>=2.7)`. The torch-**2.6** DLC fails
|
| 19 |
+
> `AutoModel.from_pretrained` with `AttributeError: module 'torch' has no
|
| 20 |
+
> attribute 'float8_e8m0fnu'`, and pinning transformers *down* is impossible
|
| 21 |
+
> (trl 1.5's floor is 4.56.2). Resolve the tag against the live registry — the
|
| 22 |
+
> AWS docs page lists wrong/stale tags (it showed a cu124 2.6 tag that doesn't
|
| 23 |
+
> exist; real tags are cu126 for 2.6, cu128 for 2.7, each with a mandatory
|
| 24 |
+
> `-vX.Y` build suffix — no bare floating tag):
|
| 25 |
> ```bash
|
| 26 |
> aws ecr describe-images --registry-id 763104351884 \
|
| 27 |
> --repository-name pytorch-training --region us-west-2 \
|
| 28 |
> --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
|
| 29 |
+
> | tr '\t' '\n' | grep -E '^2.7.[0-9]+-gpu-py312-cu128-.*-sagemaker-v[0-9.]+$' | head -1
|
| 30 |
> ```
|
| 31 |
+
>
|
| 32 |
+
> **vLLM is OFF by default** in the smoke: `vllm==0.8.5` hard-pins `torch==2.6.0`,
|
| 33 |
+
> which fights the torch-2.7 base. The smoke uses `model.generate` rollout (what
|
| 34 |
+
> it proves is trainer-on-GPU + reward, not rollout speed). For colocated vLLM,
|
| 35 |
+
> bake a torch-2.7-matched `vllm>=0.9` into an image and pass `--image <ecr> --vllm`.
|
| 36 |
|
| 37 |
> **SDK pin:** the smoke launcher uses the **sagemaker SDK v2** Estimator API.
|
| 38 |
> SDK **v3 is an API rewrite** that dropped `sagemaker.estimator.Estimator` and
|
|
|
|
| 48 |
|
| 49 |
This uses the **stock PyTorch DLC directly** as the training image and ships the
|
| 50 |
framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
|
| 51 |
+
(trl + the RL stack) installs at job start. No 15 GB local image build, no ECR
|
| 52 |
+
push. The script trains `Qwen/Qwen2.5-0.5B-Instruct` with GRPO + the GSM8K
|
| 53 |
+
`#### NUMBER` RLVR reward, using `model.generate` rollout (vLLM off by default —
|
| 54 |
+
see the torch-pin note above).
|
| 55 |
|
| 56 |
Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
|
| 57 |
+
`--vllm` (enable colocated vLLM — only with a baked `--image` carrying a
|
| 58 |
+
torch-2.7 vllm), `--image <ecr-uri>` (use a prebuilt baked image instead of the DLC).
|
| 59 |
|
| 60 |
**Cost:** `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
|
| 61 |
~15–25 min ⇒ **well under $1**. Spot ≈ $0.45–0.60/hr ⇒ pennies.
|
|
@@ -1,21 +1,22 @@
|
|
| 1 |
-
# Installed by SageMaker at job start, layered on the PyTorch DLC
|
| 2 |
-
#
|
| 3 |
-
#
|
| 4 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
trl>=1.5,<2
|
| 6 |
-
# transformers MUST stay below 4.53: 4.53+ references torch.float8_e8m0fnu at
|
| 7 |
-
# import (MXFP4 quant), a dtype only added in torch 2.7 — but the DLC ships
|
| 8 |
-
# torch 2.6 (and vllm 0.8.5 pins torch 2.6), so a newer transformers crashes
|
| 9 |
-
# AutoModel.from_pretrained with AttributeError. Floor 4.51.1 = vllm 0.8.5's
|
| 10 |
-
# requirement. (Caught by a live run 2026-06-09; the bare ">=4.46" silently
|
| 11 |
-
# pulled 4.53+.)
|
| 12 |
-
transformers>=4.51.1,<4.53
|
| 13 |
peft>=0.13
|
| 14 |
accelerate>=1.0
|
| 15 |
datasets>=3.0
|
| 16 |
-
# vLLM must match the DLC's torch 2.6 / cu126. 0.8.x is the torch-2.6 line;
|
| 17 |
-
# pinning avoids a silent wheel/CUDA mismatch at colocate time (F3 §7).
|
| 18 |
-
vllm==0.8.5
|
| 19 |
-
fsspec>=2024.6
|
| 20 |
s3fs>=2024.6
|
|
|
|
| 21 |
hf_transfer>=0.1.6
|
|
|
|
| 1 |
+
# Installed by SageMaker at job start, layered on the PyTorch DLC.
|
| 2 |
+
#
|
| 3 |
+
# DEPENDENCY CHAIN (learned from two live runs 2026-06-09):
|
| 4 |
+
# ComposerReplicationTrainer needs trl 1.5.x
|
| 5 |
+
# → trl 1.5.1 requires transformers>=4.56.2
|
| 6 |
+
# → transformers 4.56.2 references torch.float8_e8m0fnu at import
|
| 7 |
+
# → that dtype only exists in torch>=2.7
|
| 8 |
+
# So the base image MUST be the torch-2.7 DLC (2.7.1-gpu-py312-cu128-...-v1.26),
|
| 9 |
+
# NOT the torch-2.6 one. Pinning transformers DOWN is impossible (trl floor).
|
| 10 |
+
#
|
| 11 |
+
# vLLM is intentionally NOT here: vllm==0.8.5 hard-pins torch==2.6.0, which
|
| 12 |
+
# fights the torch-2.7 base. vLLM-colocate is a rollout SPEED optimization, not
|
| 13 |
+
# what this smoke proves (trainer runs on GPU + reward fires). The smoke uses
|
| 14 |
+
# model.generate rollout (--no-vllm). For colocated vLLM later, use a
|
| 15 |
+
# torch-2.7-matched vllm (>=0.9) on the same DLC — a clean follow-up.
|
| 16 |
trl>=1.5,<2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
peft>=0.13
|
| 18 |
accelerate>=1.0
|
| 19 |
datasets>=3.0
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
s3fs>=2024.6
|
| 21 |
+
fsspec>=2024.6
|
| 22 |
hf_transfer>=0.1.6
|
|
@@ -36,9 +36,13 @@ ACCOUNT = "386931836011"
|
|
| 36 |
ROLE = f"arn:aws:iam::{ACCOUNT}:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
|
| 37 |
BUCKET = f"amazon-sagemaker-{ACCOUNT}-{REGION}-7597bf4d9a3d"
|
| 38 |
# DLC tag resolved live against the 763104351884 us-west-2 registry.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
DLC_IMAGE = (
|
| 40 |
"763104351884.dkr.ecr.us-west-2.amazonaws.com/"
|
| 41 |
-
"pytorch-training:2.
|
| 42 |
)
|
| 43 |
|
| 44 |
_HERE = os.path.dirname(os.path.abspath(__file__))
|
|
@@ -69,8 +73,11 @@ def main() -> int:
|
|
| 69 |
ap.add_argument("--max-steps", type=int, default=20)
|
| 70 |
ap.add_argument("--n-train-rows", type=int, default=100)
|
| 71 |
ap.add_argument("--model", default="Qwen/Qwen2.5-0.5B-Instruct")
|
| 72 |
-
ap.add_argument("--
|
| 73 |
-
help="
|
|
|
|
|
|
|
|
|
|
| 74 |
ap.add_argument("--spot", action="store_true", help="use managed spot (quota=1 too)")
|
| 75 |
ap.add_argument("--no-wait", action="store_true", help="submit and return; poll later")
|
| 76 |
ap.add_argument("--max-run", type=int, default=3600)
|
|
@@ -84,7 +91,7 @@ def main() -> int:
|
|
| 84 |
print(f"[launch] region={REGION} image={image}")
|
| 85 |
print(f"[launch] role={ROLE}")
|
| 86 |
print(f"[launch] instance={args.instance_type} max_steps={args.max_steps} "
|
| 87 |
-
f"vllm={
|
| 88 |
|
| 89 |
staging = _stage_source()
|
| 90 |
print(f"[launch] staged source_dir at {staging}")
|
|
@@ -93,7 +100,7 @@ def main() -> int:
|
|
| 93 |
"model": args.model,
|
| 94 |
"n_train_rows": args.n_train_rows,
|
| 95 |
"max_steps": args.max_steps,
|
| 96 |
-
"use_vllm": "
|
| 97 |
}
|
| 98 |
|
| 99 |
spot_kwargs = {}
|
|
|
|
| 36 |
ROLE = f"arn:aws:iam::{ACCOUNT}:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
|
| 37 |
BUCKET = f"amazon-sagemaker-{ACCOUNT}-{REGION}-7597bf4d9a3d"
|
| 38 |
# DLC tag resolved live against the 763104351884 us-west-2 registry.
|
| 39 |
+
# MUST be the torch-2.7 DLC: ComposerReplicationTrainer → trl 1.5.x →
|
| 40 |
+
# transformers>=4.56.2 → torch.float8_e8m0fnu (torch>=2.7). The torch-2.6 DLC
|
| 41 |
+
# (cu126) fails AutoModel.from_pretrained with AttributeError on that dtype.
|
| 42 |
+
# (Learned from two live runs 2026-06-09; see requirements.txt + the quickstart.)
|
| 43 |
DLC_IMAGE = (
|
| 44 |
"763104351884.dkr.ecr.us-west-2.amazonaws.com/"
|
| 45 |
+
"pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26"
|
| 46 |
)
|
| 47 |
|
| 48 |
_HERE = os.path.dirname(os.path.abspath(__file__))
|
|
|
|
| 73 |
ap.add_argument("--max-steps", type=int, default=20)
|
| 74 |
ap.add_argument("--n-train-rows", type=int, default=100)
|
| 75 |
ap.add_argument("--model", default="Qwen/Qwen2.5-0.5B-Instruct")
|
| 76 |
+
ap.add_argument("--vllm", action="store_true",
|
| 77 |
+
help="enable colocated vLLM rollout. OFF by default: the default "
|
| 78 |
+
"requirements.txt omits vllm (it hard-pins torch==2.6, fighting "
|
| 79 |
+
"the torch-2.7 DLC trl 1.5 needs). Only pass this with a baked "
|
| 80 |
+
"image (--image) carrying a torch-2.7-matched vllm>=0.9.")
|
| 81 |
ap.add_argument("--spot", action="store_true", help="use managed spot (quota=1 too)")
|
| 82 |
ap.add_argument("--no-wait", action="store_true", help="submit and return; poll later")
|
| 83 |
ap.add_argument("--max-run", type=int, default=3600)
|
|
|
|
| 91 |
print(f"[launch] region={REGION} image={image}")
|
| 92 |
print(f"[launch] role={ROLE}")
|
| 93 |
print(f"[launch] instance={args.instance_type} max_steps={args.max_steps} "
|
| 94 |
+
f"vllm={args.vllm} spot={args.spot}")
|
| 95 |
|
| 96 |
staging = _stage_source()
|
| 97 |
print(f"[launch] staged source_dir at {staging}")
|
|
|
|
| 100 |
"model": args.model,
|
| 101 |
"n_train_rows": args.n_train_rows,
|
| 102 |
"max_steps": args.max_steps,
|
| 103 |
+
"use_vllm": "true" if args.vllm else "false",
|
| 104 |
}
|
| 105 |
|
| 106 |
spot_kwargs = {}
|