File size: 378 Bytes
54c5666 |
1 2 3 4 5 6 7 8 9 10 11 |
# FAQ
- Q: I get non-finite losses on step 0-10.
- A: Lower LR, use AMP warmup, keep seq length 512, and ensure attention mask is causal+key-side only.
- Q: How do I resume training?
- A: Use `--resume_checkpoint <path_to_checkpoint.pt>`.
- Q: Do I need FlashAttention?
- A: Optional. The code falls back to PyTorch SDPA which is stable and memory-efficient.
|