Wave 20: fix SageMaker smoke — torch-2.7 DLC + drop vllm pin (the real conflict)

Second live run peeled the fundamental conflict: trl 1.5.1 requires
transformers>=4.56.2, which references torch.float8_e8m0fnu (torch>=2.7).
So pinning transformers DOWN (last commit) was impossible, and the torch-2.6
DLC was the wrong base. Fixes:
- launcher + Dockerfile: use the torch-2.7 DLC
2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26 (pullable, verified).
- requirements.txt: drop vllm — vllm==0.8.5 hard-pins torch==2.6.0 and fights
the 2.7 base. vLLM-colocate is a rollout speed optimization, not what the
smoke proves; the smoke uses model.generate rollout. Resolver dry-run now
clean (trl pulls transformers>=4.56.2, satisfiable on torch 2.7).
- launcher: --no-vllm → --vllm (opt-in, OFF by default); vLLM needs a baked
image with a torch-2.7-matched vllm>=0.9.
- quickstart: document the full trl→transformers→torch chain + the vLLM note.

Two failed runs cost ~$0.30 total (365 + 316 billable sec).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

docker/Dockerfile.sagemaker +9 -7
docs/AWS_SAGEMAKER_QUICKSTART.md +24 -13
examples/gsm8k_grpo/requirements.txt +16 -15
examples/gsm8k_grpo/run_sagemaker_launch.py +12 -5

docker/Dockerfile.sagemaker CHANGED Viewed

@@ -5,16 +5,18 @@
 # surface). The one-shot smoke can instead use the stock DLC + source_dir
 # (run_sagemaker_launch.py --image dlc), which needs no local build.
 #
-# Base: AWS PyTorch DLC, tag RESOLVED LIVE against the us-west-2 registry
-# 2026-06-09 — it is cu126 (NOT the cu124 some docs list) and the -v1.25 build
-# suffix is required (no bare floating 2.6.0-gpu tag exists).
-FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker-v1.25
-# RL stack baked in. torch 2.6 + CUDA 12.6 already in the DLC — do NOT reinstall
-# torch. vllm 0.8.5 is the torch-2.6 line (pin to avoid a wheel/CUDA mismatch).
 RUN pip install --no-cache-dir \
       "trl>=1.5,<2" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
-      "vllm==0.8.5" "fsspec>=2024.6" "s3fs>=2024.6" "hf_transfer>=0.1.6"
 # The framework itself (train + serverless extras → trainer, loss, executors,
 # replica_entrypoint, s3fs all present).

 # surface). The one-shot smoke can instead use the stock DLC + source_dir
 # (run_sagemaker_launch.py --image dlc), which needs no local build.
 #
+# Base: AWS PyTorch DLC, tag RESOLVED LIVE against the us-west-2 registry.
+# MUST be torch-2.7: trl 1.5 → transformers>=4.56.2 → torch.float8_e8m0fnu
+# (torch>=2.7). The torch-2.6 DLC fails AutoModel.from_pretrained on that dtype.
+# cu128, -v1.26 build suffix required (no bare floating tag exists).
+FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26
+# RL stack baked in. torch 2.7 + CUDA 12.8 already in the DLC — do NOT reinstall
+# torch. vllm>=0.9 is the torch-2.7 line (0.8.x hard-pins torch 2.6 and would
+# fight this base); pin to a 2.7-matched vllm to avoid a wheel/CUDA mismatch.
 RUN pip install --no-cache-dir \
       "trl>=1.5,<2" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
+      "vllm>=0.9" "fsspec>=2024.6" "s3fs>=2024.6" "hf_transfer>=0.1.6"
 # The framework itself (train + serverless extras → trainer, loss, executors,
 # replica_entrypoint, s3fs all present).

docs/AWS_SAGEMAKER_QUICKSTART.md CHANGED Viewed

@@ -10,18 +10,29 @@ GPU, end-to-end, for **under $1**. Implements F3 (`research/design-F3-rl-sagemak
 | `ml.g5.2xlarge` training-job quota | **1** (live, code `L-2D6DEB3C`) → no quota ticket needed |
 | Execution role | `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` |
 | Bucket (rendezvous + output) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` |
-| PyTorch DLC base image | `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker-v1.25` |
-> **Two corrections vs the AWS DLC docs page** (found by querying the live ECR
-> registry, not the docs): the tag is **cu126**, not cu124, and the **`-v1.25`
-> build suffix is required** — there is no bare floating `2.6.0-gpu...` tag.
-> Always resolve the tag against the registry:
 > ```bash
 > aws ecr describe-images --registry-id 763104351884 \
 >   --repository-name pytorch-training --region us-west-2 \
 >   --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
->   | tr '\t' '\n' | grep -E '^2.6.0-gpu-py312-cu126-.*-sagemaker-v[0-9.]+$' | head -1
 > ```
 > **SDK pin:** the smoke launcher uses the **sagemaker SDK v2** Estimator API.
 > SDK **v3 is an API rewrite** that dropped `sagemaker.estimator.Estimator` and
@@ -37,14 +48,14 @@ python examples/gsm8k_grpo/run_sagemaker_launch.py --max-steps 20
 This uses the **stock PyTorch DLC directly** as the training image and ships the
 framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
-(trl + vllm==0.8.5 + the RL stack) installs at job start. No 15 GB local image
-build, no ECR push. The script trains `Qwen/Qwen2.5-0.5B-Instruct` with GRPO +
-the GSM8K `#### NUMBER` RLVR reward, vLLM **colocated** in-process
-(`vllm_gpu_memory_utilization=0.3` on the 24 GB A10G).
 Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
-`--no-vllm` (fall back to `model.generate` rollout if a vLLM/CUDA wheel mismatch
-appears), `--image <ecr-uri>` (use a prebuilt baked image instead of the DLC).
 **Cost:** `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
 ~15–25 min ⇒ **well under $1**. Spot ≈ $0.45–0.60/hr ⇒ pennies.

 | `ml.g5.2xlarge` training-job quota | **1** (live, code `L-2D6DEB3C`) → no quota ticket needed |
 | Execution role | `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` |
 | Bucket (rendezvous + output) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` |
+| PyTorch DLC base image | `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26` |
+> **The base image MUST be the torch-2.7 DLC** — learned from two live runs
+> (2026-06-09). The dependency chain forces it:
+> `ComposerReplicationTrainer → trl 1.5.x → transformers>=4.56.2 →
+> torch.float8_e8m0fnu (MXFP4, torch>=2.7)`. The torch-**2.6** DLC fails
+> `AutoModel.from_pretrained` with `AttributeError: module 'torch' has no
+> attribute 'float8_e8m0fnu'`, and pinning transformers *down* is impossible
+> (trl 1.5's floor is 4.56.2). Resolve the tag against the live registry — the
+> AWS docs page lists wrong/stale tags (it showed a cu124 2.6 tag that doesn't
+> exist; real tags are cu126 for 2.6, cu128 for 2.7, each with a mandatory
+> `-vX.Y` build suffix — no bare floating tag):
 > ```bash
 > aws ecr describe-images --registry-id 763104351884 \
 >   --repository-name pytorch-training --region us-west-2 \
 >   --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
+>   | tr '\t' '\n' | grep -E '^2.7.[0-9]+-gpu-py312-cu128-.*-sagemaker-v[0-9.]+$' | head -1
 > ```
+>
+> **vLLM is OFF by default** in the smoke: `vllm==0.8.5` hard-pins `torch==2.6.0`,
+> which fights the torch-2.7 base. The smoke uses `model.generate` rollout (what
+> it proves is trainer-on-GPU + reward, not rollout speed). For colocated vLLM,
+> bake a torch-2.7-matched `vllm>=0.9` into an image and pass `--image <ecr> --vllm`.
 > **SDK pin:** the smoke launcher uses the **sagemaker SDK v2** Estimator API.
 > SDK **v3 is an API rewrite** that dropped `sagemaker.estimator.Estimator` and
 This uses the **stock PyTorch DLC directly** as the training image and ships the
 framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
+(trl + the RL stack) installs at job start. No 15 GB local image build, no ECR
+push. The script trains `Qwen/Qwen2.5-0.5B-Instruct` with GRPO + the GSM8K
+`#### NUMBER` RLVR reward, using `model.generate` rollout (vLLM off by default —
+see the torch-pin note above).
 Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
+`--vllm` (enable colocated vLLM — only with a baked `--image` carrying a
+torch-2.7 vllm), `--image <ecr-uri>` (use a prebuilt baked image instead of the DLC).
 **Cost:** `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
 ~15–25 min ⇒ **well under $1**. Spot ≈ $0.45–0.60/hr ⇒ pennies.

examples/gsm8k_grpo/requirements.txt CHANGED Viewed

@@ -1,21 +1,22 @@
-# Installed by SageMaker at job start, layered on the PyTorch DLC 2.6.0
-# (torch 2.6 + CUDA 12.6 already present — do NOT reinstall torch).
-# F3 §3.2: baking these into a Dockerfile.sagemaker is the repeatable path;
-# this requirements.txt is the no-local-build path for the one-shot smoke.
 trl>=1.5,<2
-# transformers MUST stay below 4.53: 4.53+ references torch.float8_e8m0fnu at
-# import (MXFP4 quant), a dtype only added in torch 2.7 — but the DLC ships
-# torch 2.6 (and vllm 0.8.5 pins torch 2.6), so a newer transformers crashes
-# AutoModel.from_pretrained with AttributeError. Floor 4.51.1 = vllm 0.8.5's
-# requirement. (Caught by a live run 2026-06-09; the bare ">=4.46" silently
-# pulled 4.53+.)
-transformers>=4.51.1,<4.53
 peft>=0.13
 accelerate>=1.0
 datasets>=3.0
-# vLLM must match the DLC's torch 2.6 / cu126. 0.8.x is the torch-2.6 line;
-# pinning avoids a silent wheel/CUDA mismatch at colocate time (F3 §7).
-vllm==0.8.5
-fsspec>=2024.6
 s3fs>=2024.6
 hf_transfer>=0.1.6

+# Installed by SageMaker at job start, layered on the PyTorch DLC.
+#
+# DEPENDENCY CHAIN (learned from two live runs 2026-06-09):
+#   ComposerReplicationTrainer needs trl 1.5.x
+#     → trl 1.5.1 requires transformers>=4.56.2
+#       → transformers 4.56.2 references torch.float8_e8m0fnu at import
+#         → that dtype only exists in torch>=2.7
+# So the base image MUST be the torch-2.7 DLC (2.7.1-gpu-py312-cu128-...-v1.26),
+# NOT the torch-2.6 one. Pinning transformers DOWN is impossible (trl floor).
+#
+# vLLM is intentionally NOT here: vllm==0.8.5 hard-pins torch==2.6.0, which
+# fights the torch-2.7 base. vLLM-colocate is a rollout SPEED optimization, not
+# what this smoke proves (trainer runs on GPU + reward fires). The smoke uses
+# model.generate rollout (--no-vllm). For colocated vLLM later, use a
+# torch-2.7-matched vllm (>=0.9) on the same DLC — a clean follow-up.
 trl>=1.5,<2
 peft>=0.13
 accelerate>=1.0
 datasets>=3.0
 s3fs>=2024.6
+fsspec>=2024.6
 hf_transfer>=0.1.6

examples/gsm8k_grpo/run_sagemaker_launch.py CHANGED Viewed

@@ -36,9 +36,13 @@ ACCOUNT = "386931836011"
 ROLE = f"arn:aws:iam::{ACCOUNT}:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
 BUCKET = f"amazon-sagemaker-{ACCOUNT}-{REGION}-7597bf4d9a3d"
 # DLC tag resolved live against the 763104351884 us-west-2 registry.
 DLC_IMAGE = (
     "763104351884.dkr.ecr.us-west-2.amazonaws.com/"
-    "pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker-v1.25"
 )
 _HERE = os.path.dirname(os.path.abspath(__file__))
@@ -69,8 +73,11 @@ def main() -> int:
     ap.add_argument("--max-steps", type=int, default=20)
     ap.add_argument("--n-train-rows", type=int, default=100)
     ap.add_argument("--model", default="Qwen/Qwen2.5-0.5B-Instruct")
-    ap.add_argument("--no-vllm", action="store_true",
-                    help="disable colocated vLLM (use model.generate rollout) — safer fallback")
     ap.add_argument("--spot", action="store_true", help="use managed spot (quota=1 too)")
     ap.add_argument("--no-wait", action="store_true", help="submit and return; poll later")
     ap.add_argument("--max-run", type=int, default=3600)
@@ -84,7 +91,7 @@ def main() -> int:
     print(f"[launch] region={REGION} image={image}")
     print(f"[launch] role={ROLE}")
     print(f"[launch] instance={args.instance_type} max_steps={args.max_steps} "
-          f"vllm={not args.no_vllm} spot={args.spot}")
     staging = _stage_source()
     print(f"[launch] staged source_dir at {staging}")
@@ -93,7 +100,7 @@ def main() -> int:
         "model": args.model,
         "n_train_rows": args.n_train_rows,
         "max_steps": args.max_steps,
-        "use_vllm": "false" if args.no_vllm else "true",
     }
     spot_kwargs = {}

 ROLE = f"arn:aws:iam::{ACCOUNT}:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
 BUCKET = f"amazon-sagemaker-{ACCOUNT}-{REGION}-7597bf4d9a3d"
 # DLC tag resolved live against the 763104351884 us-west-2 registry.
+# MUST be the torch-2.7 DLC: ComposerReplicationTrainer → trl 1.5.x →
+# transformers>=4.56.2 → torch.float8_e8m0fnu (torch>=2.7). The torch-2.6 DLC
+# (cu126) fails AutoModel.from_pretrained with AttributeError on that dtype.
+# (Learned from two live runs 2026-06-09; see requirements.txt + the quickstart.)
 DLC_IMAGE = (
     "763104351884.dkr.ecr.us-west-2.amazonaws.com/"
+    "pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26"
 )
 _HERE = os.path.dirname(os.path.abspath(__file__))
     ap.add_argument("--max-steps", type=int, default=20)
     ap.add_argument("--n-train-rows", type=int, default=100)
     ap.add_argument("--model", default="Qwen/Qwen2.5-0.5B-Instruct")
+    ap.add_argument("--vllm", action="store_true",
+                    help="enable colocated vLLM rollout. OFF by default: the default "
+                         "requirements.txt omits vllm (it hard-pins torch==2.6, fighting "
+                         "the torch-2.7 DLC trl 1.5 needs). Only pass this with a baked "
+                         "image (--image) carrying a torch-2.7-matched vllm>=0.9.")
     ap.add_argument("--spot", action="store_true", help="use managed spot (quota=1 too)")
     ap.add_argument("--no-wait", action="store_true", help="submit and return; poll later")
     ap.add_argument("--max-run", type=int, default=3600)
     print(f"[launch] region={REGION} image={image}")
     print(f"[launch] role={ROLE}")
     print(f"[launch] instance={args.instance_type} max_steps={args.max_steps} "
+          f"vllm={args.vllm} spot={args.spot}")
     staging = _stage_source()
     print(f"[launch] staged source_dir at {staging}")
         "model": args.model,
         "n_train_rows": args.n_train_rows,
         "max_steps": args.max_steps,
+        "use_vllm": "true" if args.vllm else "false",
     }
     spot_kwargs = {}