--- license: mit datasets: - roneneldan/TinyStories language: - en tags: - text-generation - gpt - tinystories - from-scratch - pytorch - rope - qk-norm - muon - multi-token-prediction pipeline_tag: text-generation --- # TinyStories GPT (19M) A small (~19.2M parameter) decoder-only GPT trained **from scratch** on [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). It writes simple, coherent children's stories and is a compact, hackable reference for modern LLM architecture + optimization techniques — trained end-to-end in a few minutes on a single consumer GPU (RTX 2060 Super, 8 GB). This checkpoint uses the full **modded-nanoGPT-style recipe**: the **Muon** optimizer plus **QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections**. Each technique was A/B-measured on the 2060; together they lower validation loss from **2.65** (plain AdamW/SwiGLU baseline) to **2.40** at the same 3,000 steps. ## Sample output > **Once upon a time,** there was a little girl named Lily. She loved to play with her > toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate > cake to make her happy. Lily's friend, Timmy, came over to play... > **Lily and Tom went to the park and** saw a big dog... "Mom, mom, the dog is coming!" > Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us." ## Architecture A LLaMA-/modded-nanoGPT-style decoder-only transformer: | Component | Choice | |---|---| | Layers / heads / dim | 8 layers, 6 heads, `n_embd` 384 | | Context length | 256 tokens | | Vocabulary | 16,384 (ByteLevel BPE) | | Position encoding | **RoPE** | | Attention | **Grouped-Query Attention** (2 KV heads) + **QK-Norm** | | MLP | **squared-ReLU** (ungated) | | Normalization | **RMSNorm** | | Init | **zero-init** block output projections (muP-like) | | Logits | **soft-capped** at 15 (`cap·tanh(logits/cap)`) | | Extra heads | **Multi-Token Prediction** (2 auxiliary heads) | | Weight tying | token embedding ↔ output head (and MTP heads) | ## Training | | | |---|---| | Dataset | TinyStories (~2.1M stories) | | Steps | 3,000 | | Batch | 40 × 256 tokens | | Optimizer | **Muon** (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule | | Precision | fp16 mixed precision, `torch.compile` | | Hardware | 1× RTX 2060 Super (8 GB), ~8 minutes | | Train loss | 2.47 (combined next-token + MTP auxiliary) | | **Validation loss** | **2.40** (perplexity ~11.0) | ## Usage This is a **custom architecture**, so you need `model.py` from this repo (small, dependency-light). Download it next to your script, then: ```python import torch from huggingface_hub import hf_hub_download from tokenizers import Tokenizer from model import GPT # model.py downloaded from this repo repo = "epoyraz/tinystories-25m" ckpt = torch.load( hf_hub_download(repo, "tinystories-25m.pt"), map_location="cpu", weights_only=True, ) model = GPT(ckpt["config"]).eval() model.load_state_dict(ckpt["model"]) tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json")) ids = tok.encode("Once upon a time,").ids out = model.generate( torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40, ) print(tok.decode(out[0].tolist())) ``` `pip install torch tokenizers huggingface_hub` ## Files - `tinystories-25m.pt` — checkpoint (`config` + `model` state dict) - `model.py` — model definition (`GPT`, all techniques) - `config.json` — the model config, for reference - `tokenizer.json` — ByteLevel BPE tokenizer (16K vocab) ## Limitations - Trained only on TinyStories — simple children's-story English, not a general assistant. - Small and lightly trained: occasional repetition, name swaps, or drift. - 256-token context. ## References - [TinyStories](https://arxiv.org/abs/2305.07759) - [RoFormer / RoPE](https://arxiv.org/abs/2104.09864) - [GQA](https://arxiv.org/abs/2305.13245) - [DeepSeek-V3 (MTP)](https://arxiv.org/abs/2412.19437) - [Muon optimizer](https://kellerjordan.github.io/posts/muon/) · [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt)