xTimeCrystal
/

MiniModel-200M-Base

Text Generation

Model card Files Files and versions

xTimeCrystal commited on Sep 24, 2025

Commit

d925863

·

verified ·

1 Parent(s): 7320e51

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -36,7 +36,7 @@ The main techniques used were:
 - QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
-Overall, these techniques allowed the model to be losslessly trained with a massive batch size of 64 x 2048 tokens and completely spike-free for 110k steps (14B tokens):
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)

 - QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
+Overall, these techniques allowed the model to be losslessly trained for 110k steps with a massive batch size of 64 x 2048 tokens without gradient accumulation while still fitting in under 30GB VRAM and being completely spike-free:
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)