| --- |
| license: mit |
| datasets: |
| - roneneldan/TinyStories |
| language: |
| - en |
| tags: |
| - text-generation |
| - gpt |
| - tinystories |
| - from-scratch |
| - pytorch |
| - rope |
| - qk-norm |
| - muon |
| - multi-token-prediction |
| pipeline_tag: text-generation |
| --- |
| |
| # TinyStories GPT (19M) |
|
|
| A small (~19.2M parameter) decoder-only GPT trained **from scratch** on |
| [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). It writes |
| simple, coherent children's stories and is a compact, hackable reference for modern |
| LLM architecture + optimization techniques — trained end-to-end in a few minutes on a |
| single consumer GPU (RTX 2060 Super, 8 GB). |
|
|
| This checkpoint uses the full **modded-nanoGPT-style recipe**: the **Muon** optimizer |
| plus **QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections**. Each |
| technique was A/B-measured on the 2060; together they lower validation loss from **2.65** |
| (plain AdamW/SwiGLU baseline) to **2.40** at the same 3,000 steps. |
|
|
| ## Sample output |
|
|
| > **Once upon a time,** there was a little girl named Lily. She loved to play with her |
| > toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate |
| > cake to make her happy. Lily's friend, Timmy, came over to play... |
|
|
| > **Lily and Tom went to the park and** saw a big dog... "Mom, mom, the dog is coming!" |
| > Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us." |
|
|
| ## Architecture |
|
|
| A LLaMA-/modded-nanoGPT-style decoder-only transformer: |
|
|
| | Component | Choice | |
| |---|---| |
| | Layers / heads / dim | 8 layers, 6 heads, `n_embd` 384 | |
| | Context length | 256 tokens | |
| | Vocabulary | 16,384 (ByteLevel BPE) | |
| | Position encoding | **RoPE** | |
| | Attention | **Grouped-Query Attention** (2 KV heads) + **QK-Norm** | |
| | MLP | **squared-ReLU** (ungated) | |
| | Normalization | **RMSNorm** | |
| | Init | **zero-init** block output projections (muP-like) | |
| | Logits | **soft-capped** at 15 (`cap·tanh(logits/cap)`) | |
| | Extra heads | **Multi-Token Prediction** (2 auxiliary heads) | |
| | Weight tying | token embedding ↔ output head (and MTP heads) | |
|
|
| ## Training |
|
|
| | | | |
| |---|---| |
| | Dataset | TinyStories (~2.1M stories) | |
| | Steps | 3,000 | |
| | Batch | 40 × 256 tokens | |
| | Optimizer | **Muon** (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule | |
| | Precision | fp16 mixed precision, `torch.compile` | |
| | Hardware | 1× RTX 2060 Super (8 GB), ~8 minutes | |
| | Train loss | 2.47 (combined next-token + MTP auxiliary) | |
| | **Validation loss** | **2.40** (perplexity ~11.0) | |
|
|
| ## Usage |
|
|
| This is a **custom architecture**, so you need `model.py` from this repo (small, |
| dependency-light). Download it next to your script, then: |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from tokenizers import Tokenizer |
| from model import GPT # model.py downloaded from this repo |
| |
| repo = "epoyraz/tinystories-25m" |
| ckpt = torch.load( |
| hf_hub_download(repo, "tinystories-25m.pt"), |
| map_location="cpu", weights_only=True, |
| ) |
| model = GPT(ckpt["config"]).eval() |
| model.load_state_dict(ckpt["model"]) |
| |
| tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json")) |
| ids = tok.encode("Once upon a time,").ids |
| out = model.generate( |
| torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40, |
| ) |
| print(tok.decode(out[0].tolist())) |
| ``` |
|
|
| `pip install torch tokenizers huggingface_hub` |
|
|
| ## Files |
|
|
| - `tinystories-25m.pt` — checkpoint (`config` + `model` state dict) |
| - `model.py` — model definition (`GPT`, all techniques) |
| - `config.json` — the model config, for reference |
| - `tokenizer.json` — ByteLevel BPE tokenizer (16K vocab) |
|
|
| ## Limitations |
|
|
| - Trained only on TinyStories — simple children's-story English, not a general assistant. |
| - Small and lightly trained: occasional repetition, name swaps, or drift. |
| - 256-token context. |
|
|
| ## References |
|
|
| - [TinyStories](https://arxiv.org/abs/2305.07759) |
| - [RoFormer / RoPE](https://arxiv.org/abs/2104.09864) |
| - [GQA](https://arxiv.org/abs/2305.13245) |
| - [DeepSeek-V3 (MTP)](https://arxiv.org/abs/2412.19437) |
| - [Muon optimizer](https://kellerjordan.github.io/posts/muon/) · [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt) |
|
|