AWS SageMaker Quickstart — the runnable-now GRPO smoke

The minimum path to running the Composer-replication RL inner loop on a real GPU, end-to-end, for under $1. Implements F3 (research/design-F3-rl-sagemaker.md).

Live account facts (verified 2026-06-09, acct 386931836011, us-west-2)

Fact	Value
`ml.g5.2xlarge` training-job quota	1 (live, code `L-2D6DEB3C`) → no quota ticket needed
Execution role	`arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247`
Bucket (rendezvous + output)	`amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d`
PyTorch DLC base image	`763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26`

The base image MUST be the torch-2.7 DLC — learned from two live runs (2026-06-09). The dependency chain forces it: ComposerReplicationTrainer → trl 1.5.x → transformers>=4.56.2 → torch.float8_e8m0fnu (MXFP4, torch>=2.7). The torch-2.6 DLC fails AutoModel.from_pretrained with AttributeError: module 'torch' has no attribute 'float8_e8m0fnu', and pinning transformers down is impossible (trl 1.5's floor is 4.56.2). Resolve the tag against the live registry — the AWS docs page lists wrong/stale tags (it showed a cu124 2.6 tag that doesn't exist; real tags are cu126 for 2.6, cu128 for 2.7, each with a mandatory -vX.Y build suffix — no bare floating tag):
aws ecr describe-images --registry-id 763104351884 \
  --repository-name pytorch-training --region us-west-2 \
  --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
  | tr '\t' '\n' | grep -E '^2.7.[0-9]+-gpu-py312-cu128-.*-sagemaker-v[0-9.]+$' | head -1
vLLM is OFF by default in the smoke: vllm==0.8.5 hard-pins torch==2.6.0, which fights the torch-2.7 base. The smoke uses model.generate rollout (what it proves is trainer-on-GPU + reward, not rollout speed). For colocated vLLM, bake a torch-2.7-matched vllm>=0.9 into an image and pass --image <ecr> --vllm.

SDK pin: the smoke launcher uses the sagemaker SDK v2 Estimator API. SDK v3 is an API rewrite that dropped sagemaker.estimator.Estimator and sagemaker.pytorch — install pip install 'sagemaker>=2.200,<3'.

Run it (no local Docker build)

pip install 'sagemaker>=2.200,<3'
export AWS_REGION=us-west-2
python examples/gsm8k_grpo/run_sagemaker_launch.py --max-steps 20

This uses the stock PyTorch DLC directly as the training image and ships the framework + entry script via source_dir; examples/gsm8k_grpo/requirements.txt (trl + the RL stack) installs at job start. No 15 GB local image build, no ECR push. The script trains Qwen/Qwen2.5-0.5B-Instruct with GRPO + the GSM8K #### NUMBER RLVR reward, using model.generate rollout (vLLM off by default — see the torch-pin note above).

Flags: --no-wait (submit + poll later), --spot (managed spot, quota=1 too), --vllm (enable colocated vLLM — only with a baked --image carrying a torch-2.7 vllm), --image <ecr-uri> (use a prebuilt baked image instead of the DLC).

Cost: ml.g5.2xlarge ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is ~15–25 min ⇒ well under $1. Spot ≈ $0.45–0.60/hr ⇒ pennies.

The repeatable path (baked image)

For runs where the ~5–10 min per-job pip-install is unwanted (and for the DiLoCo N-replica SageMakerExecutor, which passes ContainerEntrypoint and needs the framework baked in), build the image once:

scripts/build_and_push_ecr.sh            # creates ECR repo, builds, pushes composer-rl:smoke
python examples/gsm8k_grpo/run_sagemaker_launch.py \
  --image 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke

On an Apple-Silicon host the build cross-compiles (--platform linux/amd64) the ~15 GB GPU DLC under emulation — slow; prefer a linux/amd64 host or CodeBuild.

Gotchas (load-bearing)

EnableNetworkIsolation stays False (the default) so the container can reach huggingface.co (model + GSM8K download) and S3.
vllm_gpu_memory_utilization=0.3 is the load-bearing knob on a 24 GB A10G: too high → OOM when the policy + grads also need the GPU; too low → tiny KV cache. Use --no-vllm if a vLLM wheel/CUDA mismatch surfaces.
Warm pools are off — g5 training warm pool usage quota is 0 in this account, so each job pays ~3–6 min cold-start. Request a warm-pool quota bump for iterative dev, or move the long inner loop to HyperPod (persistent).
"Waiting for capacity" in SecondaryStatus is transient g5 capacity contention in the region, not an error — the job proceeds when capacity frees.

Next: DiLoCo N-replica (the `SageMakerExecutor` path)

examples/diloco_sagemaker/run.py (driver, F3 §4.3) drives N independent single-instance Training Jobs sharing one s3://.../rendezvous/ prefix via ObjectStoreAllReduce — no cross-job NCCL. N=1 runs today; N=2–4 needs a ml.g5.2xlarge for training job usage quota increase. The DiLoCo math, loss, trainer, and ObjectStoreAllReduce are unchanged from the smoke — the S3 rendezvous is the entire portability contract (validated file:// and live s3://; see test_serverless_local.py::test_s3_rendezvous_allreduce_across_replicas).