Wave 20: fix SageMaker smoke — pin transformers<4.53 (torch 2.6 float8 skew)

First live run (composer-grpo-smoke-...18-35-31) reached Training then failed
at AutoModel.from_pretrained: AttributeError module 'torch' has no attribute
'float8_e8m0fnu'. transformers 4.53+ references that dtype at import (MXFP4
quant) but it only exists in torch 2.7; the DLC ships torch 2.6 (vllm 0.8.5
pins 2.6). The bare ">=4.46" silently pulled 4.53+. Pin >=4.51.1,<4.53
(4.51.1 = vllm 0.8.5's floor, verified against PyPI metadata). Cost of the
failed run: 365 billable sec ≈ $0.15.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

examples/gsm8k_grpo/requirements.txt +7 -0

examples/gsm8k_grpo/requirements.txt CHANGED Viewed

@@ -3,6 +3,13 @@
 # F3 §3.2: baking these into a Dockerfile.sagemaker is the repeatable path;
 # this requirements.txt is the no-local-build path for the one-shot smoke.
 trl>=1.5,<2
 peft>=0.13
 accelerate>=1.0
 datasets>=3.0

 # F3 §3.2: baking these into a Dockerfile.sagemaker is the repeatable path;
 # this requirements.txt is the no-local-build path for the one-shot smoke.
 trl>=1.5,<2
+# transformers MUST stay below 4.53: 4.53+ references torch.float8_e8m0fnu at
+# import (MXFP4 quant), a dtype only added in torch 2.7 — but the DLC ships
+# torch 2.6 (and vllm 0.8.5 pins torch 2.6), so a newer transformers crashes
+# AutoModel.from_pretrained with AttributeError. Floor 4.51.1 = vllm 0.8.5's
+# requirement. (Caught by a live run 2026-06-09; the bare ">=4.46" silently
+# pulled 4.53+.)
+transformers>=4.51.1,<4.53
 peft>=0.13
 accelerate>=1.0
 datasets>=3.0