Zubenelgenubi
Collection
3 items • Updated
A 124M-parameter GPT-2 trained from scratch on FineWeb-Edu with knowledge distillation from SmolLM-135M-Instruct.
| Component | Value |
|---|---|
| Parameters | ~124M |
| Layers | 12 |
| Heads | 12 |
| Embedding dim | 768 |
| Context | 512 |
| Loss | 0.5 CE + 0.5 KL |
| Hardware | 1x A100 |
| Time | ~75 min |
| Tokens | 327,680,000 |
| Best loss | 326.0111 |