Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on
Commit
498dbd7
·
1 Parent(s): 7453f13

Wave 20: fix SageMaker smoke — pin transformers<4.53 (torch 2.6 float8 skew)

Browse files

First live run (composer-grpo-smoke-...18-35-31) reached Training then failed
at AutoModel.from_pretrained: AttributeError module 'torch' has no attribute
'float8_e8m0fnu'. transformers 4.53+ references that dtype at import (MXFP4
quant) but it only exists in torch 2.7; the DLC ships torch 2.6 (vllm 0.8.5
pins 2.6). The bare ">=4.46" silently pulled 4.53+. Pin >=4.51.1,<4.53
(4.51.1 = vllm 0.8.5's floor, verified against PyPI metadata). Cost of the
failed run: 365 billable sec ≈ $0.15.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

examples/gsm8k_grpo/requirements.txt CHANGED
@@ -3,6 +3,13 @@
3
  # F3 §3.2: baking these into a Dockerfile.sagemaker is the repeatable path;
4
  # this requirements.txt is the no-local-build path for the one-shot smoke.
5
  trl>=1.5,<2
 
 
 
 
 
 
 
6
  peft>=0.13
7
  accelerate>=1.0
8
  datasets>=3.0
 
3
  # F3 §3.2: baking these into a Dockerfile.sagemaker is the repeatable path;
4
  # this requirements.txt is the no-local-build path for the one-shot smoke.
5
  trl>=1.5,<2
6
+ # transformers MUST stay below 4.53: 4.53+ references torch.float8_e8m0fnu at
7
+ # import (MXFP4 quant), a dtype only added in torch 2.7 — but the DLC ships
8
+ # torch 2.6 (and vllm 0.8.5 pins torch 2.6), so a newer transformers crashes
9
+ # AutoModel.from_pretrained with AttributeError. Floor 4.51.1 = vllm 0.8.5's
10
+ # requirement. (Caught by a live run 2026-06-09; the bare ">=4.46" silently
11
+ # pulled 4.53+.)
12
+ transformers>=4.51.1,<4.53
13
  peft>=0.13
14
  accelerate>=1.0
15
  datasets>=3.0