Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~
|
| 2 |
|
| 3 |
Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
|
| 4 |
|
|
|
|
| 1 |
+
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model.
|
| 2 |
|
| 3 |
Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
|
| 4 |
|