Language Models are Few-Shot Learners
Paper
• 2005.14165 • Published
• 19
This Model checkpoint is part of my project on "Reproducing GPT-2 for Indic Languages". Check out the main repository here: https://github.com/Shaligram-Dewangan/GPT-2-for-Indic-Languages
This is a GPT-2 style 124M parameter model. Pre-trained on 20B English + Hindi tokens.
| Attribute | Details |
|---|---|
| Model Type | Decoder only Transformer |
| Architecture | GPT - Dense |
| Number of Layers | 12 |
| Hidden Size | 768 |
| MLP Hidden Dim. | 3072 |
| Attention Heads | 12 |
| Context Length | 1024 |
| Vocab Size | 50,304 |
| Total Parameters | ~124M |
| Training Type | Pre-Trained |
| Dataset | Fineweb-Edu and Fineweb-2 |
| Languages | English and Hindi |
| Training Data Size | ~20 Billion tokens |
| Batch Size | 524,288 |
| Activation | GELU |
| Training Time | ~14 hours on 1x H100 |