gary-4-petite πŸ€βž•

gary-4 fit in 69 KB but spoke word salad. gary-4-petite fits in ~0.6 MB and speaks actual English.

Same numpy-only soul as gary-4 β€” a real transformer, int8 weights, ~110-line pure-numpy inference engine, zero ML frameworks. One idea changed, and the word salad turned into stories.

The one idea

gary-4 wasn't incoherent because it was small. It was incoherent because 67K parameters were asked to model The Pile β€” code, math, academic papers, scraped web junk, dozens of languages. That is one of the hardest corpora in existence, and no tiny model can compress it into coherent English.

The fix comes straight from the research line that asked exactly this question β€” Microsoft's TinyStories (Eldan & Li, 2023) and SimpleStories (Finke et al., NeurIPS 2025): the dominant lever for coherence at tiny scale is corpus simplicity, not model size. Train a small model on simple, consistent English and coherence appears well under 10M parameters β€” even around 1M.

gary-4-petite applies three changes, in order of impact:

# Change gary-4 gary-4-petite Why it matters
1 Corpus The Pile TinyStories The big one. Simple-English stories are learnable by a tiny model.
2 Tokenizer char-level byte-level BPE (2048) Each parameter does semantic work instead of memorizing spelling. No more misspelled words.
3 Depth 2 layers 4 layers Depth buys coherence at fixed width.

Everything else is deliberately identical to gary-4: pre-LayerNorm GPT, learned positions, weight-tied embeddings, per-tensor int8 weights, and a pure-numpy forward pass.

Stats

Parameters 656,448
Weights (int8) 571 KB
Full release (model + tokenizer + engine) ~0.6 MB
Architecture 4-layer, 4-head, 96-dim BPE GPT, 128 ctx
Tokenizer byte-level BPE, vocab 2048
Training data TinyStories (roneneldan/TinyStoriesV2-GPT4)
Training pure-numpy, CPU only, ~3.1M tokens, val loss 7.62 β†’ 3.41
Dependencies numpy. that's it.
Hardware needed literally anything that runs python

Training loss (val), trained entirely on a 2-core CPU in numpy:

step   31   val 7.25   (~random init, ln 2048 = 7.62)
step  235   val 4.19
step  379   val 3.80
step  517   val 3.58
step  649   val 3.48
step  746   val 3.41

Run it

pip install numpy
python chat.py "Once upon a time"        # one-shot continuation
python chat.py                            # interactive

What it sounds like

Real outputs (seed 0, temp 0.7), unedited:

Tom and Lily went to the park. They had a home with the pretty house and laughed. They had fun. They wanted to play and play with it. Finally, they all played with the toys. They were best friends. They were very happy. They had fun. They played with their mom and said, "I want to play in the park." So, the other bunny and Sue were very happy together. They became best friends and being careful. And all played together.

Once upon a time, there was a little boy named Sue. Spot liked the ground. One day, Tim saw a big, but it was very naughty. He wanted to play with his mom. Tim thought it was to help. Then, Tim was so excited. He said, "I want to play it." ... Tim was so happy that he found his name, but Tom ran to each other and went to play together.

Head-to-head with gary-4

Same prompts, both models, temp 0.7:

Prompt gary-4 (67K, char, Pile) gary-4-petite (656K, BPE, TinyStories)
Once upon a time . expectations: managed. Once upon a time, there was a little boy named Sue... (coherent paragraph)
One day, a dog le / i generate text one character at a time. One day, a dog named Tim. The dog felt sad... (coherent story)

gary-4 can't continue a story β€” off its 14 trained chat prompts it emits character-level fragments. petite writes paragraphs.

What it actually is (honest section)

656K parameters, 3.1M tokens of CPU training (15 minutes of compute, the same "trained in an afternoon" spirit as gary-4). It is a real, coherent small language model in the TinyStories register: grammatical sentences, named characters, dialogue, simple story arcs, even the occasional "the moral of the story."

It still slips β€” it will invent a word, swap a character's name mid-story, or lose the thread after a few sentences. It is not an assistant; it's a tiny story engine. It was deliberately stopped early (val 3.41) to fit a short CPU budget; the training pipeline is included, and TinyStories-scale models trained to convergence (val ~2.7–3.0) read noticeably smoother. The point stands either way: changing the corpus, not just the parameter count, is what turns a tiny model coherent.

Files

  • gary4petite.int8.npz β€” the model, int8. 571 KB. the whole point.
  • model.safetensors β€” fp32 weights for the curious (and the HF param badge).
  • vocab.json + merges.txt β€” the byte-level BPE tokenizer.
  • chat.py β€” full inference engine + a from-scratch pure-python BPE codec. numpy + stdlib only.
  • config.json β€” architecture + training metadata.
  • training/ β€” the complete, reproducible pipeline (see below).

Reproduce / train it further

The whole thing was trained in pure numpy β€” no torch. training/ contains:

  • prepare.py β€” trains the BPE tokenizer and tokenizes TinyStories to .bin.
  • gpt_numpy.py β€” the model: forward + hand-written, gradient-checked backprop + Adam.
  • train_burst.py β€” resumable, checkpointed training loop.
  • export_int8.py β€” quantize to int8 and package this folder.
cd training
mkdir -p data
# 1. get the corpus
curl -L -o data/valid.txt \
  https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt
# 2. tokenize  ->  trains BPE + writes train.bin/val.bin
python prepare.py
# 3. train (repeat to taste; each run resumes from ckpt.npz)
SECONDS=60 LR=1e-3 TMAX=4000 python train_burst.py
# 4. package  ->  writes a fresh gary4-petite/ release folder
python export_int8.py

Train past val ~3.0 for smoother prose.


Method credit: TinyStories (Eldan & Li, 2023, arXiv:2305.07759) and SimpleStories (Finke et al., NeurIPS 2025, arXiv:2504.09184). gary-4-petite asked for coherence, and got it for about 0.6 MB.

Downloads last month
19
Safetensors
Model size
656k params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train gary23w/gary-4-petite

Papers for gary23w/gary-4-petite