FAQ

Q: I get non-finite losses on step 0-10.
- A: Lower LR, use AMP warmup, keep seq length 512, and ensure attention mask is causal+key-side only.
Q: How do I resume training?
- A: Use --resume_checkpoint <path_to_checkpoint.pt>.
Q: Do I need FlashAttention?
- A: Optional. The code falls back to PyTorch SDPA which is stable and memory-efficient.