Text Generation
Safetensors
English
Chinese
xTimeCrystal commited on
Commit
d925863
·
verified ·
1 Parent(s): 7320e51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -36,7 +36,7 @@ The main techniques used were:
36
 
37
  - QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
38
 
39
- Overall, these techniques allowed the model to be losslessly trained with a massive batch size of 64 x 2048 tokens and completely spike-free for 110k steps (14B tokens):
40
 
41
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
42
 
 
36
 
37
  - QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
38
 
39
+ Overall, these techniques allowed the model to be losslessly trained for 110k steps with a massive batch size of 64 x 2048 tokens without gradient accumulation while still fitting in under 30GB VRAM and being completely spike-free:
40
 
41
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
42