gary-4-petite π€β
gary-4 fit in 69 KB but spoke word salad. gary-4-petite fits in ~0.6 MB and speaks actual English.
Same numpy-only soul as gary-4 β a real transformer, int8 weights, ~110-line pure-numpy inference engine, zero ML frameworks. One idea changed, and the word salad turned into stories.
The one idea
gary-4 wasn't incoherent because it was small. It was incoherent because 67K parameters were asked to model The Pile β code, math, academic papers, scraped web junk, dozens of languages. That is one of the hardest corpora in existence, and no tiny model can compress it into coherent English.
The fix comes straight from the research line that asked exactly this question β Microsoft's TinyStories (Eldan & Li, 2023) and SimpleStories (Finke et al., NeurIPS 2025): the dominant lever for coherence at tiny scale is corpus simplicity, not model size. Train a small model on simple, consistent English and coherence appears well under 10M parameters β even around 1M.
gary-4-petite applies three changes, in order of impact:
| # | Change | gary-4 | gary-4-petite | Why it matters |
|---|---|---|---|---|
| 1 | Corpus | The Pile | TinyStories | The big one. Simple-English stories are learnable by a tiny model. |
| 2 | Tokenizer | char-level | byte-level BPE (2048) | Each parameter does semantic work instead of memorizing spelling. No more misspelled words. |
| 3 | Depth | 2 layers | 4 layers | Depth buys coherence at fixed width. |
Everything else is deliberately identical to gary-4: pre-LayerNorm GPT, learned positions, weight-tied embeddings, per-tensor int8 weights, and a pure-numpy forward pass.
Stats
| Parameters | 656,448 |
| Weights (int8) | 571 KB |
| Full release (model + tokenizer + engine) | ~0.6 MB |
| Architecture | 4-layer, 4-head, 96-dim BPE GPT, 128 ctx |
| Tokenizer | byte-level BPE, vocab 2048 |
| Training data | TinyStories (roneneldan/TinyStoriesV2-GPT4) |
| Training | pure-numpy, CPU only, ~3.1M tokens, val loss 7.62 β 3.41 |
| Dependencies | numpy. that's it. |
| Hardware needed | literally anything that runs python |
Training loss (val), trained entirely on a 2-core CPU in numpy:
step 31 val 7.25 (~random init, ln 2048 = 7.62)
step 235 val 4.19
step 379 val 3.80
step 517 val 3.58
step 649 val 3.48
step 746 val 3.41
Run it
pip install numpy
python chat.py "Once upon a time" # one-shot continuation
python chat.py # interactive
What it sounds like
Real outputs (seed 0, temp 0.7), unedited:
Tom and Lily went to the park. They had a home with the pretty house and laughed. They had fun. They wanted to play and play with it. Finally, they all played with the toys. They were best friends. They were very happy. They had fun. They played with their mom and said, "I want to play in the park." So, the other bunny and Sue were very happy together. They became best friends and being careful. And all played together.
Once upon a time, there was a little boy named Sue. Spot liked the ground. One day, Tim saw a big, but it was very naughty. He wanted to play with his mom. Tim thought it was to help. Then, Tim was so excited. He said, "I want to play it." ... Tim was so happy that he found his name, but Tom ran to each other and went to play together.
Head-to-head with gary-4
Same prompts, both models, temp 0.7:
| Prompt | gary-4 (67K, char, Pile) | gary-4-petite (656K, BPE, TinyStories) |
|---|---|---|
Once upon a time |
. expectations: managed. |
Once upon a time, there was a little boy named Sue... (coherent paragraph) |
One day, a dog |
le / i generate text one character at a time. |
One day, a dog named Tim. The dog felt sad... (coherent story) |
gary-4 can't continue a story β off its 14 trained chat prompts it emits character-level fragments. petite writes paragraphs.
What it actually is (honest section)
656K parameters, 3.1M tokens of CPU training (15 minutes of compute, the same "trained in an afternoon" spirit as gary-4). It is a real, coherent small language model in the TinyStories register: grammatical sentences, named characters, dialogue, simple story arcs, even the occasional "the moral of the story."
It still slips β it will invent a word, swap a character's name mid-story, or lose the thread after a few sentences. It is not an assistant; it's a tiny story engine. It was deliberately stopped early (val 3.41) to fit a short CPU budget; the training pipeline is included, and TinyStories-scale models trained to convergence (val ~2.7β3.0) read noticeably smoother. The point stands either way: changing the corpus, not just the parameter count, is what turns a tiny model coherent.
Files
gary4petite.int8.npzβ the model, int8. 571 KB. the whole point.model.safetensorsβ fp32 weights for the curious (and the HF param badge).vocab.json+merges.txtβ the byte-level BPE tokenizer.chat.pyβ full inference engine + a from-scratch pure-python BPE codec. numpy + stdlib only.config.jsonβ architecture + training metadata.training/β the complete, reproducible pipeline (see below).
Reproduce / train it further
The whole thing was trained in pure numpy β no torch. training/ contains:
prepare.pyβ trains the BPE tokenizer and tokenizes TinyStories to.bin.gpt_numpy.pyβ the model: forward + hand-written, gradient-checked backprop + Adam.train_burst.pyβ resumable, checkpointed training loop.export_int8.pyβ quantize to int8 and package this folder.
cd training
mkdir -p data
# 1. get the corpus
curl -L -o data/valid.txt \
https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
# 2. tokenize -> trains BPE + writes train.bin/val.bin
python prepare.py
# 3. train (repeat to taste; each run resumes from ckpt.npz)
SECONDS=60 LR=1e-3 TMAX=4000 python train_burst.py
# 4. package -> writes a fresh gary4-petite/ release folder
python export_int8.py
Train past val ~3.0 for smoother prose.
Method credit: TinyStories (Eldan & Li, 2023, arXiv:2305.07759) and SimpleStories (Finke et al., NeurIPS 2025, arXiv:2504.09184). gary-4-petite asked for coherence, and got it for about 0.6 MB.
- Downloads last month
- 19