Upgrade to modded+Muon+zero-init checkpoint (val 2.65 -> 2.40)

4cae3f7 verified 2 days ago

4.14 kB

	---
	license: mit
	datasets:
	- roneneldan/TinyStories
	language:
	- en
	tags:
	- text-generation
	- gpt
	- tinystories
	- from-scratch
	- pytorch
	- rope
	- qk-norm
	- muon
	- multi-token-prediction
	pipeline_tag: text-generation
	---

	# TinyStories GPT (19M)

	A small (~19.2M parameter) decoder-only GPT trained from scratch on
	[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). It writes
	simple, coherent children's stories and is a compact, hackable reference for modern
	LLM architecture + optimization techniques — trained end-to-end in a few minutes on a
	single consumer GPU (RTX 2060 Super, 8 GB).

	This checkpoint uses the full modded-nanoGPT-style recipe: the Muon optimizer
	plus QK-Norm + squared-ReLU MLP + logit soft-capping + zero-init projections. Each
	technique was A/B-measured on the 2060; together they lower validation loss from 2.65
	(plain AdamW/SwiGLU baseline) to 2.40 at the same 3,000 steps.

	## Sample output

	> Once upon a time, there was a little girl named Lily. She loved to play with her
	> toys and her favorite toy, a toy truck. One day, Lily's mommy made her a yummy chocolate
	> cake to make her happy. Lily's friend, Timmy, came over to play...

	> Lily and Tom went to the park and saw a big dog... "Mom, mom, the dog is coming!"
	> Lily cried. "The dog is not mean. It was friendly and friendly. It wants to play with us."

	## Architecture

	A LLaMA-/modded-nanoGPT-style decoder-only transformer:

	\| Component \| Choice \|
	\|---\|---\|
	\| Layers / heads / dim \| 8 layers, 6 heads, `n_embd` 384 \|
	\| Context length \| 256 tokens \|
	\| Vocabulary \| 16,384 (ByteLevel BPE) \|
	\| Position encoding \| RoPE \|
	\| Attention \| Grouped-Query Attention (2 KV heads) + QK-Norm \|
	\| MLP \| squared-ReLU (ungated) \|
	\| Normalization \| RMSNorm \|
	\| Init \| zero-init block output projections (muP-like) \|
	\| Logits \| soft-capped at 15 (`cap·tanh(logits/cap)`) \|
	\| Extra heads \| Multi-Token Prediction (2 auxiliary heads) \|
	\| Weight tying \| token embedding ↔ output head (and MTP heads) \|

	## Training

	\| \| \|
	\|---\|---\|
	\| Dataset \| TinyStories (~2.1M stories) \|
	\| Steps \| 3,000 \|
	\| Batch \| 40 × 256 tokens \|
	\| Optimizer \| Muon (2D weights) + AdamW (embeddings/norms), peak LR 3e-3, cosine schedule \|
	\| Precision \| fp16 mixed precision, `torch.compile` \|
	\| Hardware \| 1× RTX 2060 Super (8 GB), ~8 minutes \|
	\| Train loss \| 2.47 (combined next-token + MTP auxiliary) \|
	\| Validation loss \| 2.40 (perplexity ~11.0) \|

	## Usage

	This is a custom architecture, so you need `model.py` from this repo (small,
	dependency-light). Download it next to your script, then:

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from tokenizers import Tokenizer
	from model import GPT # model.py downloaded from this repo

	repo = "epoyraz/tinystories-25m"
	ckpt = torch.load(
	hf_hub_download(repo, "tinystories-25m.pt"),
	map_location="cpu", weights_only=True,
	)
	model = GPT(ckpt["config"]).eval()
	model.load_state_dict(ckpt["model"])

	tok = Tokenizer.from_file(hf_hub_download(repo, "tokenizer.json"))
	ids = tok.encode("Once upon a time,").ids
	out = model.generate(
	torch.tensor([ids]), max_new_tokens=120, temperature=0.7, top_k=40,
	)
	print(tok.decode(out[0].tolist()))
	```

	`pip install torch tokenizers huggingface_hub`

	## Files

	- `tinystories-25m.pt` — checkpoint (`config` + `model` state dict)
	- `model.py` — model definition (`GPT`, all techniques)
	- `config.json` — the model config, for reference
	- `tokenizer.json` — ByteLevel BPE tokenizer (16K vocab)

	## Limitations

	- Trained only on TinyStories — simple children's-story English, not a general assistant.
	- Small and lightly trained: occasional repetition, name swaps, or drift.
	- 256-token context.

	## References

	- [TinyStories](https://arxiv.org/abs/2305.07759)
	- [RoFormer / RoPE](https://arxiv.org/abs/2104.09864)
	- [GQA](https://arxiv.org/abs/2305.13245)
	- [DeepSeek-V3 (MTP)](https://arxiv.org/abs/2412.19437)
	- [Muon optimizer](https://kellerjordan.github.io/posts/muon/) · [modded-nanoGPT](https://github.com/KellerJordan/modded-nanogpt)