| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - text-generation | |
| - gpt2 | |
| - knowledge-distillation | |
| - symbolic-reasoning | |
| - from-scratch | |
| datasets: | |
| - HuggingFaceFW/fineweb-edu | |
| pipeline_tag: text-generation | |
| # 124M GPT with Symbolic Reasoning Distillation | |
| A **124M-parameter** GPT-2 trained **from scratch** on | |
| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | |
| with **knowledge distillation** from | |
| [SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct). | |
| | Component | Value | | |
| |-----------|-------| | |
| | Parameters | ~124M | | |
| | Layers | 12 | | |
| | Heads | 12 | | |
| | Embedding dim | 768 | | |
| | Context | 512 | | |
| | Loss | 0.5 CE + 0.5 KL | | |
| | Hardware | 1x A100 | | |
| | Time | ~75 min | | |
| | Tokens | 327,680,000 | | |
| | Best loss | 326.0111 | | |