Text Generation
Safetensors
English
Chinese
xTimeCrystal commited on
Commit
7320e51
·
verified ·
1 Parent(s): 4bf7cc4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -19
README.md CHANGED
@@ -14,6 +14,11 @@ This modelcard aims to be a base template for new models. It has been generated
14
 
15
  <!-- Provide a longer summary of what this model is. -->
16
 
 
 
 
 
 
17
  This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.
18
 
19
  The main techniques used were:
@@ -21,18 +26,19 @@ The main techniques used were:
21
 
22
  - Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
23
 
 
 
24
  - Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (\~30% decrease) VRAM usage and much higher (\~20% increase) throughput.
25
 
26
  - ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([\[1\]](https://arxiv.org/abs/2109.08668v2), [\[2\]](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
27
 
28
  - Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
29
 
30
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
31
 
32
- - **Developed by:** xTimeCrystal
33
- - **Model type:** Softmax self-attention decoder-only transformer
34
- - **Language(s) (NLP):** English, Chinese, Python
35
- - **License:** Apache 2.0
36
 
37
  ### Model Sources [optional]
38
 
@@ -305,20 +311,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
305
 
306
  [More Information Needed]
307
 
308
- ## Glossary [optional]
309
-
310
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
311
-
312
- [More Information Needed]
313
-
314
- ## More Information [optional]
315
-
316
- [More Information Needed]
317
-
318
- ## Model Card Authors [optional]
319
-
320
- [More Information Needed]
321
-
322
  ## Model Card Contact
323
 
324
  [More Information Needed]
 
14
 
15
  <!-- Provide a longer summary of what this model is. -->
16
 
17
+ - **Developed by:** xTimeCrystal
18
+ - **Model type:** Softmax self-attention decoder-only transformer
19
+ - **Language(s) (NLP):** English, Chinese, Python
20
+ - **License:** Apache 2.0
21
+
22
  This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.
23
 
24
  The main techniques used were:
 
26
 
27
  - Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
28
 
29
+ - Efficient data bin-packing: as self attention relies heavily on attention sinks, all the sequences started with the start token ('\<s\>') and were truncated at 2048 tokens. However, this led to inefficiencies as over 70% of the processed data was padding. Thus, to alleviate this issue, we used a simple bin packing algorithm that tries to concatenate sequences to all have lengths close to 2048. After this operation, all the sequences had less than 5% padding.
30
+
31
  - Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (\~30% decrease) VRAM usage and much higher (\~20% increase) throughput.
32
 
33
  - ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([\[1\]](https://arxiv.org/abs/2109.08668v2), [\[2\]](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
34
 
35
  - Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
36
 
37
+ - QK Norm without scalars: this enhanced stability as the additional scalars caused loss spikes and massive attention activations.
38
 
39
+ Overall, these techniques allowed the model to be losslessly trained with a massive batch size of 64 x 2048 tokens and completely spike-free for 110k steps (14B tokens):
40
+
41
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a767dcbe4c3c2683495a8b/L7AuCdoEGrEVprBKIbks2.png)
 
42
 
43
  ### Model Sources [optional]
44
 
 
311
 
312
  [More Information Needed]
313
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
314
  ## Model Card Contact
315
 
316
  [More Information Needed]