composer-replication-framework / docs /AWS_SAGEMAKER_QUICKSTART.md

Baladithya Balamurugan

Wave 20: fix SageMaker smoke — torch-2.7 DLC + drop vllm pin (the real conflict)

a578ad9 21 days ago

5.39 kB

	# AWS SageMaker Quickstart — the runnable-now GRPO smoke

	The minimum path to running the Composer-replication RL inner loop on a real
	GPU, end-to-end, for under $1. Implements F3 (`research/design-F3-rl-sagemaker.md`).

	## Live account facts (verified 2026-06-09, acct 386931836011, us-west-2)

	\| Fact \| Value \|
	\|---\|---\|
	\| `ml.g5.2xlarge` training-job quota \| 1 (live, code `L-2D6DEB3C`) → no quota ticket needed \|
	\| Execution role \| `arn:aws:iam::386931836011:role/service-role/AmazonSageMaker-ExecutionRole-20250725T133247` \|
	\| Bucket (rendezvous + output) \| `amazon-sagemaker-386931836011-us-west-2-7597bf4d9a3d` \|
	\| PyTorch DLC base image \| `763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker-v1.26` \|

	> The base image MUST be the torch-2.7 DLC — learned from two live runs
	> (2026-06-09). The dependency chain forces it:
	> `ComposerReplicationTrainer → trl 1.5.x → transformers>=4.56.2 →
	> torch.float8_e8m0fnu (MXFP4, torch>=2.7)`. The torch-2.6 DLC fails
	> `AutoModel.from_pretrained` with `AttributeError: module 'torch' has no
	> attribute 'float8_e8m0fnu'`, and pinning transformers down is impossible
	> (trl 1.5's floor is 4.56.2). Resolve the tag against the live registry — the
	> AWS docs page lists wrong/stale tags (it showed a cu124 2.6 tag that doesn't
	> exist; real tags are cu126 for 2.6, cu128 for 2.7, each with a mandatory
	> `-vX.Y` build suffix — no bare floating tag):
	> ```bash
	> aws ecr describe-images --registry-id 763104351884 \
	> --repository-name pytorch-training --region us-west-2 \
	> --query "reverse(sort_by(imageDetails,&imagePushedAt))[].imageTags" --output text \
	> \| tr '\t' '\n' \| grep -E '^2.7.[0-9]+-gpu-py312-cu128-.*-sagemaker-v[0-9.]+$' \| head -1
	> ```
	>
	> vLLM is OFF by default in the smoke: `vllm==0.8.5` hard-pins `torch==2.6.0`,
	> which fights the torch-2.7 base. The smoke uses `model.generate` rollout (what
	> it proves is trainer-on-GPU + reward, not rollout speed). For colocated vLLM,
	> bake a torch-2.7-matched `vllm>=0.9` into an image and pass `--image <ecr> --vllm`.

	> SDK pin: the smoke launcher uses the sagemaker SDK v2 Estimator API.
	> SDK v3 is an API rewrite that dropped `sagemaker.estimator.Estimator` and
	> `sagemaker.pytorch` — install `pip install 'sagemaker>=2.200,<3'`.

	## Run it (no local Docker build)

	```bash
	pip install 'sagemaker>=2.200,<3'
	export AWS_REGION=us-west-2
	python examples/gsm8k_grpo/run_sagemaker_launch.py --max-steps 20
	```

	This uses the stock PyTorch DLC directly as the training image and ships the
	framework + entry script via `source_dir`; `examples/gsm8k_grpo/requirements.txt`
	(trl + the RL stack) installs at job start. No 15 GB local image build, no ECR
	push. The script trains `Qwen/Qwen2.5-0.5B-Instruct` with GRPO + the GSM8K
	`#### NUMBER` RLVR reward, using `model.generate` rollout (vLLM off by default —
	see the torch-pin note above).

	Flags: `--no-wait` (submit + poll later), `--spot` (managed spot, quota=1 too),
	`--vllm` (enable colocated vLLM — only with a baked `--image` carrying a
	torch-2.7 vllm), `--image <ecr-uri>` (use a prebuilt baked image instead of the DLC).

	Cost: `ml.g5.2xlarge` ≈ $1.52/hr on-demand; a 20-step 0.5B smoke is
	~15–25 min ⇒ well under $1. Spot ≈ $0.45–0.60/hr ⇒ pennies.

	## The repeatable path (baked image)

	For runs where the ~5–10 min per-job pip-install is unwanted (and for the
	DiLoCo N-replica `SageMakerExecutor`, which passes `ContainerEntrypoint` and
	needs the framework baked in), build the image once:

	```bash
	scripts/build_and_push_ecr.sh # creates ECR repo, builds, pushes composer-rl:smoke
	python examples/gsm8k_grpo/run_sagemaker_launch.py \
	--image 386931836011.dkr.ecr.us-west-2.amazonaws.com/composer-rl:smoke
	```

	On an Apple-Silicon host the build cross-compiles (`--platform linux/amd64`) the
	~15 GB GPU DLC under emulation — slow; prefer a linux/amd64 host or CodeBuild.

	## Gotchas (load-bearing)

	- `EnableNetworkIsolation` stays False (the default) so the container can
	reach `huggingface.co` (model + GSM8K download) and S3.
	- `vllm_gpu_memory_utilization=0.3` is the load-bearing knob on a 24 GB
	A10G: too high → OOM when the policy + grads also need the GPU; too low → tiny
	KV cache. Use `--no-vllm` if a vLLM wheel/CUDA mismatch surfaces.
	- Warm pools are off — `g5 training warm pool usage` quota is 0 in this
	account, so each job pays ~3–6 min cold-start. Request a warm-pool quota bump
	for iterative dev, or move the long inner loop to HyperPod (persistent).
	- "Waiting for capacity" in `SecondaryStatus` is transient g5 capacity
	contention in the region, not an error — the job proceeds when capacity frees.

	## Next: DiLoCo N-replica (the `SageMakerExecutor` path)

	`examples/diloco_sagemaker/run.py` (driver, F3 §4.3) drives N independent
	single-instance Training Jobs sharing one `s3://.../rendezvous/` prefix via
	`ObjectStoreAllReduce` — no cross-job NCCL. N=1 runs today; N=2–4 needs a
	`ml.g5.2xlarge for training job usage` quota increase. The DiLoCo math, loss,
	trainer, and `ObjectStoreAllReduce` are unchanged from the smoke — the S3
	rendezvous is the entire portability contract (validated `file://` and live
	`s3://`; see `test_serverless_local.py::test_s3_rendezvous_allreduce_across_replicas`).
	```