Text Generation
Safetensors
English
Chinese
xTimeCrystal commited on
Commit
4bf7cc4
·
verified ·
1 Parent(s): 06973a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -21,9 +21,9 @@ The main techniques used were:
21
 
22
  - Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
23
 
24
- - Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (~30% decrease) VRAM usage and much higher (~20% increase) throughput.
25
 
26
- - ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([1](https://arxiv.org/abs/2109.08668v2), [2](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
27
 
28
  - Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
29
 
 
21
 
22
  - Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
23
 
24
+ - Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (\~30% decrease) VRAM usage and much higher (\~20% increase) throughput.
25
 
26
+ - ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([\[1\]](https://arxiv.org/abs/2109.08668v2), [\[2\]](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
27
 
28
  - Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
29