Why are the released weights in fp32?
#7
by
BootsofLagrangian
- opened
The paper mentions using PyTorch AMP (bfloat16) for training, but the model is released in float32.
Is there a specific reason for this? I assume you released the master weights directly (maybe for exact reproducibility or referencing OLMo's checkpointing style)?
Just curious since most recent models usually release 2-byte tensors (bf16/fp16).
P.S. Thanks for extending the max context length to 36k! I previously asked about the 4k limit in Molmo - https://huggingface.co/allenai/Molmo-72B-0924/discussions/15 -, so I'm really happy to see this massive upgrade.