Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~125M param model.
 Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct


1	+ From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model.
2
3	Test network using [Differential Transformer (Attention)](https://arxiv.org/abs/2410.05258). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
4