Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on
Commit
a578ad9
·
1 Parent(s): 498dbd7

Wave 20: fix SageMaker smoke — torch-2.7 DLC + drop vllm pin (the real conflict)

Browse files

Second live run peeled the fundamental conflict: trl 1.5.1 requires
transformers>=4.56.2, which references torch.float8_e8m0fnu (torch>=2.7).
So pinning transformers DOWN (last commit) was impossible, and the torch-2.6
DLC was the wrong base. Fixes:
- launcher + Dockerfile: use the torch-2.7 DLC
2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26 (pullable, verified).
- requirements.txt: drop vllm — vllm==0.8.5 hard-pins torch==2.6.0 and fights
the 2.7 base. vLLM-colocate is a rollout speed optimization, not what the
smoke proves; the smoke uses model.generate rollout. Resolver dry-run now
clean (trl pulls transformers>=4.56.2, satisfiable on torch 2.7).
- launcher: --no-vllm → --vllm (opt-in, OFF by default); vLLM needs a baked
image with a torch-2.7-matched vllm>=0.9.
- quickstart: document the full trl→transformers→torch chain + the vLLM note.

Two failed runs cost ~$0.30 total (365 + 316 billable sec).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docker/Dockerfile.sagemaker CHANGED
@@ -5,16 +5,18 @@
5
  # surface). The one-shot smoke can instead use the stock DLC + source_dir
6
  # (run_sagemaker_launch.py --image dlc), which needs no local build.
7
  #
8
- # Base: AWS PyTorch DLC, tag RESOLVED LIVE against the us-west-2 registry
9
- # 2026-06-09 it is cu126 (NOT the cu124 some docs list) and the -v1.25 build
10
- # suffix is required (no bare floating 2.6.0-gpu tag exists).
11
- FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker-v1.25
 
12
 
13
- # RL stack baked in. torch 2.6 + CUDA 12.6 already in the DLC — do NOT reinstall
14
- # torch. vllm 0.8.5 is the torch-2.6 line (pin to avoid a wheel/CUDA mismatch).
 
15
  RUN pip install --no-cache-dir \
16
  "trl>=1.5,<2" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
17
- "vllm==0.8.5" "fsspec>=2024.6" "s3fs>=2024.6" "hf_transfer>=0.1.6"
18
 
19
  # The framework itself (train + serverless extras → trainer, loss, executors,
20
  # replica_entrypoint, s3fs all present).
 
5
  # surface). The one-shot smoke can instead use the stock DLC + source_dir
6
  # (run_sagemaker_launch.py --image dlc), which needs no local build.
7
  #
8
+ # Base: AWS PyTorch DLC, tag RESOLVED LIVE against the us-west-2 registry.
9
+ # MUST be torch-2.7: trl 1.5 transformers>=4.56.2 torch.float8_e8m0fnu
10
+ # (torch>=2.7). The torch-2.6 DLC fails AutoModel.from_pretrained on that dtype.
11
+ # cu128, -v1.26 build suffix required (no bare floating tag exists).
12
+ FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26
13
 
14
+ # RL stack baked in. torch 2.7 + CUDA 12.8 already in the DLC — do NOT reinstall
15
+ # torch. vllm>=0.9 is the torch-2.7 line (0.8.x hard-pins torch 2.6 and would
16
+ # fight this base); pin to a 2.7-matched vllm to avoid a wheel/CUDA mismatch.
17
  RUN pip install --no-cache-dir \
18
  "trl>=1.5,<2" "peft>=0.13" "accelerate>=1.0" "datasets>=3.0" \
19
+ "vllm>=0.9" "fsspec>=2024.6" "s3fs>=2024.6" "hf_transfer>=0.1.6"
20
 
21
  # The framework itself (train + serverless extras → trainer, loss, executors,
22
  # replica_entrypoint, s3fs all present).
docs/AWS_SAGEMAKER_QUICKSTART.md CHANGED
@@ -10,18 +10,29 @@ GPU, end-to-end, for **under $1**. Implements F3 (`research/design-F3-rl-sagemak
10
  | `ml.g5.2xlarge` training-job quota | **1** (live, code `L-2D6DEB3C`) → no quota ticket needed |
11
  | Execution role | `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` |
12
  | Bucket (rendezvous + output) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` |
13
- | PyTorch DLC base image | `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker-v1.25` |
14
-
15
- > **Two corrections vs the AWS DLC docs page** (found by querying the live ECR
16
- > registry, not the docs): the tag is **cu126**, not cu124, and the **`-v1.25`
17
- > build suffix is required** there is no bare floating `2.6.0-gpu...` tag.
18
- > Always resolve the tag against the registry:
 
 
 
 
 
 
19
  > ```bash
20
  > aws ecr describe-images --registry-id 763104351884 \
21
  > --repository-name pytorch-training --region us-west-2 \
22
  > --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
23
- > | tr '\t' '\n' | grep -E '^2.6.0-gpu-py312-cu126-.*-sagemaker-v[0-9.]+$' | head -1
24
  > ```
 
 
 
 
 
25
 
26
  > **SDK pin:** the smoke launcher uses the **sagemaker SDK v2** Estimator API.
27
  > SDK **v3 is an API rewrite** that dropped `sagemaker.estimator.Estimator` and
@@ -37,14 +48,14 @@ python examples/gsm8k_grpo/run_sagemaker_launch.py --max-steps 20
37
 
38
  This uses the **stock PyTorch DLC directly** as the training image and ships the
39
  framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
40
- (trl + vllm==0.8.5 + the RL stack) installs at job start. No 15 GB local image
41
- build, no ECR push. The script trains `Qwen/Qwen2.5-0.5B-Instruct` with GRPO +
42
- the GSM8K `#### NUMBER` RLVR reward, vLLM **colocated** in-process
43
- (`vllm_gpu_memory_utilization=0.3` on the 24 GB A10G).
44
 
45
  Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
46
- `--no-vllm` (fall back to `model.generate` rollout if a vLLM/CUDA wheel mismatch
47
- appears), `--image <ecr-uri>` (use a prebuilt baked image instead of the DLC).
48
 
49
  **Cost:** `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
50
  ~15–25 min ⇒ **well under $1**. Spot ≈ $0.45–0.60/hr ⇒ pennies.
 
10
  | `ml.g5.2xlarge` training-job quota | **1** (live, code `L-2D6DEB3C`) → no quota ticket needed |
11
  | Execution role | `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` |
12
  | Bucket (rendezvous + output) | `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` |
13
+ | PyTorch DLC base image | `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26` |
14
+
15
+ > **The base image MUST be the torch-2.7 DLC** learned from two live runs
16
+ > (2026-06-09). The dependency chain forces it:
17
+ > `ComposerReplicationTrainer trl 1.5.x transformers>=4.56.2
18
+ > torch.float8_e8m0fnu (MXFP4, torch>=2.7)`. The torch-**2.6** DLC fails
19
+ > `AutoModel.from_pretrained` with `AttributeError: module 'torch' has no
20
+ > attribute 'float8_e8m0fnu'`, and pinning transformers *down* is impossible
21
+ > (trl 1.5's floor is 4.56.2). Resolve the tag against the live registry — the
22
+ > AWS docs page lists wrong/stale tags (it showed a cu124 2.6 tag that doesn't
23
+ > exist; real tags are cu126 for 2.6, cu128 for 2.7, each with a mandatory
24
+ > `-vX.Y` build suffix — no bare floating tag):
25
  > ```bash
26
  > aws ecr describe-images --registry-id 763104351884 \
27
  > --repository-name pytorch-training --region us-west-2 \
28
  > --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
29
+ > | tr '\t' '\n' | grep -E '^2.7.[0-9]+-gpu-py312-cu128-.*-sagemaker-v[0-9.]+$' | head -1
30
  > ```
31
+ >
32
+ > **vLLM is OFF by default** in the smoke: `vllm==0.8.5` hard-pins `torch==2.6.0`,
33
+ > which fights the torch-2.7 base. The smoke uses `model.generate` rollout (what
34
+ > it proves is trainer-on-GPU + reward, not rollout speed). For colocated vLLM,
35
+ > bake a torch-2.7-matched `vllm>=0.9` into an image and pass `--image <ecr> --vllm`.
36
 
37
  > **SDK pin:** the smoke launcher uses the **sagemaker SDK v2** Estimator API.
38
  > SDK **v3 is an API rewrite** that dropped `sagemaker.estimator.Estimator` and
 
48
 
49
  This uses the **stock PyTorch DLC directly** as the training image and ships the
50
  framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
51
+ (trl + the RL stack) installs at job start. No 15 GB local image build, no ECR
52
+ push. The script trains `Qwen/Qwen2.5-0.5B-Instruct` with GRPO + the GSM8K
53
+ `#### NUMBER` RLVR reward, using `model.generate` rollout (vLLM off by default —
54
+ see the torch-pin note above).
55
 
56
  Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
57
+ `--vllm` (enable colocated vLLM only with a baked `--image` carrying a
58
+ torch-2.7 vllm), `--image <ecr-uri>` (use a prebuilt baked image instead of the DLC).
59
 
60
  **Cost:** `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
61
  ~15–25 min ⇒ **well under $1**. Spot ≈ $0.45–0.60/hr ⇒ pennies.
examples/gsm8k_grpo/requirements.txt CHANGED
@@ -1,21 +1,22 @@
1
- # Installed by SageMaker at job start, layered on the PyTorch DLC 2.6.0
2
- # (torch 2.6 + CUDA 12.6 already present — do NOT reinstall torch).
3
- # F3 §3.2: baking these into a Dockerfile.sagemaker is the repeatable path;
4
- # this requirements.txt is the no-local-build path for the one-shot smoke.
 
 
 
 
 
 
 
 
 
 
 
5
  trl>=1.5,<2
6
- # transformers MUST stay below 4.53: 4.53+ references torch.float8_e8m0fnu at
7
- # import (MXFP4 quant), a dtype only added in torch 2.7 — but the DLC ships
8
- # torch 2.6 (and vllm 0.8.5 pins torch 2.6), so a newer transformers crashes
9
- # AutoModel.from_pretrained with AttributeError. Floor 4.51.1 = vllm 0.8.5's
10
- # requirement. (Caught by a live run 2026-06-09; the bare ">=4.46" silently
11
- # pulled 4.53+.)
12
- transformers>=4.51.1,<4.53
13
  peft>=0.13
14
  accelerate>=1.0
15
  datasets>=3.0
16
- # vLLM must match the DLC's torch 2.6 / cu126. 0.8.x is the torch-2.6 line;
17
- # pinning avoids a silent wheel/CUDA mismatch at colocate time (F3 §7).
18
- vllm==0.8.5
19
- fsspec>=2024.6
20
  s3fs>=2024.6
 
21
  hf_transfer>=0.1.6
 
1
+ # Installed by SageMaker at job start, layered on the PyTorch DLC.
2
+ #
3
+ # DEPENDENCY CHAIN (learned from two live runs 2026-06-09):
4
+ # ComposerReplicationTrainer needs trl 1.5.x
5
+ # → trl 1.5.1 requires transformers>=4.56.2
6
+ # → transformers 4.56.2 references torch.float8_e8m0fnu at import
7
+ # → that dtype only exists in torch>=2.7
8
+ # So the base image MUST be the torch-2.7 DLC (2.7.1-gpu-py312-cu128-...-v1.26),
9
+ # NOT the torch-2.6 one. Pinning transformers DOWN is impossible (trl floor).
10
+ #
11
+ # vLLM is intentionally NOT here: vllm==0.8.5 hard-pins torch==2.6.0, which
12
+ # fights the torch-2.7 base. vLLM-colocate is a rollout SPEED optimization, not
13
+ # what this smoke proves (trainer runs on GPU + reward fires). The smoke uses
14
+ # model.generate rollout (--no-vllm). For colocated vLLM later, use a
15
+ # torch-2.7-matched vllm (>=0.9) on the same DLC — a clean follow-up.
16
  trl>=1.5,<2
 
 
 
 
 
 
 
17
  peft>=0.13
18
  accelerate>=1.0
19
  datasets>=3.0
 
 
 
 
20
  s3fs>=2024.6
21
+ fsspec>=2024.6
22
  hf_transfer>=0.1.6
examples/gsm8k_grpo/run_sagemaker_launch.py CHANGED
@@ -36,9 +36,13 @@ ACCOUNT = "386931836011"
36
  ROLE = f"arn:aws:iam::{ACCOUNT}:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
37
  BUCKET = f"amazon-sagemaker-{ACCOUNT}-{REGION}-7597bf4d9a3d"
38
  # DLC tag resolved live against the 763104351884 us-west-2 registry.
 
 
 
 
39
  DLC_IMAGE = (
40
  "763104351884.dkr.ecr.us-west-2.amazonaws.com/"
41
- "pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker-v1.25"
42
  )
43
 
44
  _HERE = os.path.dirname(os.path.abspath(__file__))
@@ -69,8 +73,11 @@ def main() -> int:
69
  ap.add_argument("--max-steps", type=int, default=20)
70
  ap.add_argument("--n-train-rows", type=int, default=100)
71
  ap.add_argument("--model", default="Qwen/Qwen2.5-0.5B-Instruct")
72
- ap.add_argument("--no-vllm", action="store_true",
73
- help="disable colocated vLLM (use model.generate rollout) safer fallback")
 
 
 
74
  ap.add_argument("--spot", action="store_true", help="use managed spot (quota=1 too)")
75
  ap.add_argument("--no-wait", action="store_true", help="submit and return; poll later")
76
  ap.add_argument("--max-run", type=int, default=3600)
@@ -84,7 +91,7 @@ def main() -> int:
84
  print(f"[launch] region={REGION} image={image}")
85
  print(f"[launch] role={ROLE}")
86
  print(f"[launch] instance={args.instance_type} max_steps={args.max_steps} "
87
- f"vllm={not args.no_vllm} spot={args.spot}")
88
 
89
  staging = _stage_source()
90
  print(f"[launch] staged source_dir at {staging}")
@@ -93,7 +100,7 @@ def main() -> int:
93
  "model": args.model,
94
  "n_train_rows": args.n_train_rows,
95
  "max_steps": args.max_steps,
96
- "use_vllm": "false" if args.no_vllm else "true",
97
  }
98
 
99
  spot_kwargs = {}
 
36
  ROLE = f"arn:aws:iam::{ACCOUNT}:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247"
37
  BUCKET = f"amazon-sagemaker-{ACCOUNT}-{REGION}-7597bf4d9a3d"
38
  # DLC tag resolved live against the 763104351884 us-west-2 registry.
39
+ # MUST be the torch-2.7 DLC: ComposerReplicationTrainer → trl 1.5.x →
40
+ # transformers>=4.56.2 → torch.float8_e8m0fnu (torch>=2.7). The torch-2.6 DLC
41
+ # (cu126) fails AutoModel.from_pretrained with AttributeError on that dtype.
42
+ # (Learned from two live runs 2026-06-09; see requirements.txt + the quickstart.)
43
  DLC_IMAGE = (
44
  "763104351884.dkr.ecr.us-west-2.amazonaws.com/"
45
+ "pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26"
46
  )
47
 
48
  _HERE = os.path.dirname(os.path.abspath(__file__))
 
73
  ap.add_argument("--max-steps", type=int, default=20)
74
  ap.add_argument("--n-train-rows", type=int, default=100)
75
  ap.add_argument("--model", default="Qwen/Qwen2.5-0.5B-Instruct")
76
+ ap.add_argument("--vllm", action="store_true",
77
+ help="enable colocated vLLM rollout. OFF by default: the default "
78
+ "requirements.txt omits vllm (it hard-pins torch==2.6, fighting "
79
+ "the torch-2.7 DLC trl 1.5 needs). Only pass this with a baked "
80
+ "image (--image) carrying a torch-2.7-matched vllm>=0.9.")
81
  ap.add_argument("--spot", action="store_true", help="use managed spot (quota=1 too)")
82
  ap.add_argument("--no-wait", action="store_true", help="submit and return; poll later")
83
  ap.add_argument("--max-run", type=int, default=3600)
 
91
  print(f"[launch] region={REGION} image={image}")
92
  print(f"[launch] role={ROLE}")
93
  print(f"[launch] instance={args.instance_type} max_steps={args.max_steps} "
94
+ f"vllm={args.vllm} spot={args.spot}")
95
 
96
  staging = _stage_source()
97
  print(f"[launch] staged source_dir at {staging}")
 
100
  "model": args.model,
101
  "n_train_rows": args.n_train_rows,
102
  "max_steps": args.max_steps,
103
+ "use_vllm": "true" if args.vllm else "false",
104
  }
105
 
106
  spot_kwargs = {}