Add model card metadata: pipeline tag, library name, link to paper, and link to code repository.
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~125M param model.
|
| 2 |
|
| 3 |
Test network using [Tensor Product Attention](https://arxiv.org/abs/2501.06425). Other than some alterations to the attention, such as 16 heads insted of 9 and using TPA, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
|
|
@@ -21,3 +27,7 @@ One of the primary reported benefits for TPA are for inference which are not rea
|
|
| 21 |
- Final Train Perplexity: 20.95
|
| 22 |
|
| 23 |

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~125M param model.
|
| 8 |
|
| 9 |
Test network using [Tensor Product Attention](https://arxiv.org/abs/2501.06425). Other than some alterations to the attention, such as 16 heads insted of 9 and using TPA, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
|
|
|
|
| 27 |
- Final Train Perplexity: 20.95
|
| 28 |
|
| 29 |

|
| 30 |
+
|
| 31 |
+
# Code
|
| 32 |
+
|
| 33 |
+
The code is available at: https://github.com/tensorgi/T6.
|