---
license: mit
datasets:
  - roneneldan/TinyStories
language:
  - en
tags:
  - text-generation
  - gpt
  - tinystories
  - from-scratch
  - pytorch
  - rope
  - qk-norm
  - muon
  - multi-token-prediction
pipeline_tag: text-generation
---

# TinyStories GPT (19M)

A small (~19.2M parameter) decoder-only GPT trained **from scratch** on
[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). It writes
simple, coherent children's stories and is a compact, hackable reference for modern
LLM architecture + optimization techniques — trained end-to-end in a few minutes on a
single consumer GPU (RTX 2060 Super, 8 GB).

This checkpoint uses the full **modded-nanoGPT-style recipe**: the **Muon** optimizer
plus **QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections**. Each
technique was A/B-measured on the 2060; together they lower validation loss from **2.65**
(plain AdamW/SwiGLU baseline) to **2.40** at the same 3,000 steps.

## Sample output

> **Once upon a time,** there was a little girl named Lily. She loved to play with her
> toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate
> cake to make her happy. Lily's friend, Timmy, came over to play...

> **Lily and Tom went to the park and** saw a big dog... "Mom, mom, the dog is coming!"
> Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us."

## Architecture

A LLaMA-/modded-nanoGPT-style decoder-only transformer:

| Component | Choice |
|---|---|
| Layers / heads / dim | 8 layers, 6 heads, `n_embd` 384 |
| Context length | 256 tokens |
| Vocabulary | 16,384 (ByteLevel BPE) |
| Position encoding | **RoPE** |
| Attention | **Grouped-Query Attention** (2 KV heads) + **QK-Norm** |
| MLP | **squared-ReLU** (ungated) |
| Normalization | **RMSNorm** |
| Init | **zero-init** block output projections (muP-like) |
| Logits | **soft-capped** at 15 (`cap·tanh(logits/cap)`) |
| Extra heads | **Multi-Token Prediction** (2 auxiliary heads) |
| Weight tying | token embedding ↔ output head (and MTP heads) |

## Training

| | |
|---|---|
| Dataset | TinyStories (~2.1M stories) |
| Steps | 3,000 |
| Batch | 40 × 256 tokens |
| Optimizer | **Muon** (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule |
| Precision | fp16 mixed precision, `torch.compile` |
| Hardware | 1× RTX 2060 Super (8 GB), ~8 minutes |
| Train loss | 2.47 (combined next-token + MTP auxiliary) |
| **Validation loss** | **2.40** (perplexity ~11.0) |

## Usage

This is a **custom architecture**, so you need `model.py` from this repo (small,
dependency-light). Download it next to your script, then:

```python
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
from model import GPT  # model.py downloaded from this repo

repo = "epoyraz/tinystories-25m"
ckpt = torch.load(
    hf_hub_download(repo, "tinystories-25m.pt"),
    map_location="cpu", weights_only=True,
)
model = GPT(ckpt["config"]).eval()
model.load_state_dict(ckpt["model"])

tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json"))
ids = tok.encode("Once upon a time,").ids
out = model.generate(
    torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40,
)
print(tok.decode(out[0].tolist()))
```

`pip install torch tokenizers huggingface_hub`

## Files

- `tinystories-25m.pt` — checkpoint (`config` + `model` state dict)
- `model.py` — model definition (`GPT`, all techniques)
- `config.json` — the model config, for reference
- `tokenizer.json` — ByteLevel BPE tokenizer (16K vocab)

## Limitations

- Trained only on TinyStories — simple children's-story English, not a general assistant.
- Small and lightly trained: occasional repetition, name swaps, or drift.
- 256-token context.

## References

- [TinyStories](https://arxiv.org/abs/2305.07759)
- [RoFormer / RoPE](https://arxiv.org/abs/2104.09864)
- [GQA](https://arxiv.org/abs/2305.13245)
- [DeepSeek-V3 (MTP)](https://arxiv.org/abs/2412.19437)
- [Muon optimizer](https://kellerjordan.github.io/posts/muon/) · [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt)