Update README.md
Browse files
README.md
CHANGED
|
@@ -14,15 +14,25 @@ This modelcard aims to be a base template for new models. It has been generated
|
|
| 14 |
|
| 15 |
<!-- Provide a longer summary of what this model is. -->
|
| 16 |
|
|
|
|
| 17 |
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
-
|
| 20 |
-
|
| 21 |
-
-
|
| 22 |
-
|
| 23 |
-
-
|
| 24 |
-
|
| 25 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
### Model Sources [optional]
|
| 28 |
|
|
|
|
| 14 |
|
| 15 |
<!-- Provide a longer summary of what this model is. -->
|
| 16 |
|
| 17 |
+
This model uses SOTA innovative techniques to train a 200M decoder-only transformer language model using only 10B tokens and on a single RTX 5090 GPU in just 1 day. As shown in the examples below, the model is highly competent in diverse styles and is capable of generating text on various topics, ranging from pure recall to article writing.
|
| 18 |
|
| 19 |
+
The main techniques used were:
|
| 20 |
+
- Adaptive Muon optimizer: Based on the Muon optimizer, this allows the model to be trained with exceptional data efficiency (2.1x AdamW). Furthermore, the momentum buffer can be stored in bf16, additionally lowering VRAM usage.
|
| 21 |
|
| 22 |
+
- Aggressive data filtering: by selecting a carefully curated set of educational content and building upon the data efficiency of Muon, we were able to significantly improves capabilities in resource-constrained environments.
|
| 23 |
+
|
| 24 |
+
- Float8 pretraining: This model was pretrained using bf16 master weights, fp8 (e4m3) casting with bf16 accumulation, and full bf16 backward. However, quantizing the attention mechanism significantly hurt the loss, so it was kept in bf16. This was found to match the performance of full bf16 training with significantly reduced (~30% decrease) VRAM usage and much higher (~20% increase) throughput.
|
| 25 |
+
|
| 26 |
+
- ReLU^2 activation: This ultra-sparse activation function outperformed SwiGLU ([1](https://arxiv.org/abs/2109.08668v2), [2](https://arxiv.org/abs/2402.03804)) while only requiring 2 matmuls, slightly improving throughput.
|
| 27 |
+
|
| 28 |
+
- Full attention: when pretraining small models, every layer is precious. Thus, we use full attention (no SWA, no GQA) for all layers in the model.
|
| 29 |
+
|
| 30 |
+

|
| 31 |
+
|
| 32 |
+
- **Developed by:** xTimeCrystal
|
| 33 |
+
- **Model type:** Softmax self-attention decoder-only transformer
|
| 34 |
+
- **Language(s) (NLP):** English, Chinese, Python
|
| 35 |
+
- **License:** Apache 2.0
|
| 36 |
|
| 37 |
### Model Sources [optional]
|
| 38 |
|