microGPT

A 1.35M-parameter decoder-only transformer trained from scratch on the TinyStories dataset. The entire training run took roughly two hours on an Apple Silicon laptop. At ~50,000× smaller than GPT-3, it can still produce coherent simple children's stories.

This is an educational artifact, not a production model. Its purpose is to make every component of a modern LLM legible, debuggable, and rebuildable on consumer hardware.

Quick facts


Architecture	Decoder-only transformer (GPT-style)
Parameters	1,345,792 trainable (1.35M)
File size on disk	~5.1 MB (float32)
Training data	~470M tokens of TinyStories
Training compute	~1.5 hours on Apple Silicon (MPS)
Final val loss	2.25 (perplexity 9.49)
Context window	256 tokens
Tokenizer	Byte-level BPE, vocab=4096
License	MIT

Architecture in detail

Input tokens (B, T)
    │
    ├─► Token Embedding   (4096 → 128)
    │                          │
    └─► Position Embedding ────┘ ← element-wise sum
            │
            ▼  (B, T, 128)
   ┌──── Block × 4 ────────────────────────────┐
   │                                            │
   │   x = LayerNorm(x)                         │
   │   x = x + CausalSelfAttention(x)  ← 4 heads│
   │   x = LayerNorm(x)                         │
   │   x = x + MLP(x)                  ← 128→512→128, GELU
   │                                            │
   └────────────────────────────────────────────┘
            │
            ▼  (B, T, 128)
        LayerNorm
            │
            ▼
   Linear (128 → 4096)   ← weight-tied with token embedding
            │
            ▼  (B, T, 4096)
        Logits

Hyperparameter	Value	Notes
`n_layers`	4	Stacked transformer blocks
`d_model`	128	Hidden dimension
`n_heads`	4	Each head is 128/4 = 32 dim
`head_dim`	32	Per-head dimensionality
`ffn_dim`	512	MLP intermediate width (4×d_model)
`ctx_len`	256	Maximum input length in tokens
`vocab_size`	4,096	BPE-derived vocabulary
Normalization	LayerNorm	Pre-LN (applied before sublayers)
Position encoding	Learned	Absolute, additive
Activation	GELU	In the MLP
Attention	Multi-head, causal	Implemented via `F.scaled_dot_product_attention`
Embedding tying	Yes	Output projection shares weight with `tok_emb`
Bias on linear layers	No	Following common modern practice
Dropout	0.1 (training)	0.0 at inference

Parameter breakdown — where the 1.35M live

Component	Shape	Params	%
Token embeddings (`tok_emb.weight`)	(4096, 128)	524,288	38.9%
Position embeddings (`pos_emb.weight`)	(256, 128)	32,768	2.4%
4 × transformer block	—	788,480	58.6%
└─ Per block: `ln1` (γ, β)	(128,) × 2	256
└─ Per block: `attn.qkv`	(384, 128)	49,152
└─ Per block: `attn.proj`	(128, 128)	16,384
└─ Per block: `ln2` (γ, β)	(128,) × 2	256
└─ Per block: `mlp.fc1`	(512, 128)	65,536
└─ Per block: `mlp.fc2`	(128, 512)	65,536
Final LayerNorm (`ln_f`)	(128,) × 2	256	0.02%
Output projection (`head.weight`)	(4096, 128)	0	tied
Total		1,345,792

Two observations worth absorbing:

Embeddings are 41% of total parameters at this scale. This is typical of small models — the vocab × d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
MLPs (fc1 + fc2) account for half of every block's params: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.

Training

Data

Dataset: roneneldan/TinyStories (Eldan & Li, 2023)
Stories: ~2.1M (train) + ~22K (validation)
Tokens (after BPE): ~470M (train) + ~5M (validation)
Why TinyStories specifically: synthetic dataset designed so vocabulary and grammar stay within what a 3–4 year-old understands, making coherent generation possible at very small model scales. Without this curation, a 1.35M-param model on general web text produces gibberish.

Tokenizer

Type: byte-level Byte-Pair Encoding (BPE)
Vocabulary: 4,096 tokens (including special tokens <unk>, <eos>)
Trained on: 50,000 stories from the train split (vocab converges quickly; full corpus produces a near-identical tokenizer)
Avg compression: ~4 characters per token on TinyStories text

Optimization

Hyperparameter	Value
Optimizer	AdamW
β₁, β₂	0.9, 0.95
Weight decay	0.1
Peak learning rate	3e-4
Min learning rate	3e-5
Schedule	Linear warmup (200 steps) → cosine decay
Batch size (sequences)	64
Sequence length	256
Tokens per step	16,384
Total steps	20,000
Total tokens seen	~327M
Gradient clipping	1.0 (global L2 norm)
Random seed	1337

Hardware & wall-clock


Hardware	Apple M-series laptop (MPS backend)
Precision	float32
Wall-clock	~1.5 hours
Peak memory	~1.5 GB
Disk footprint	~1 GB tokenized corpus + 5.1 MB checkpoint

Evaluation

Held-out validation loss

Step	Val loss	Perplexity
0 (init)	8.32	4096
~17,500	2.26	9.59
~20,000	2.25	9.49

For context: a uniform random predictor over the 4,096-token vocabulary would have loss ln(4096) = 8.32 and perplexity 4,096. A perplexity of 9.49 means the model has effectively narrowed each prediction down to ~9.5 plausible next-token choices, on average.

Note on missing checkpoint progression. This release contains only the final checkpoint. The original plan was to capture step-tagged snapshots (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability emergence across training — gibberish → words → grammar → narrative. A monitoring script was started after training had already passed the early steps, so only the final checkpoint was preserved. A v0.2 release with multi-checkpoint snapshots and an emergence comparison is planned.

Capability probes

Three diagnostic prompts designed to probe specific capabilities, evaluated on the final checkpoint (step 20,000, T=0.8, top-k=40):

Short-range coherence — "Lily picked up the red ball and"

...put it on the floor. She threw the red ball with her hands and pulled it onto the floor. She tried to make it shine, but it made her feel good. "Oh no, no!" she shouted. "Sharing is very dangerous!"

Pronoun continuity holds; ball persists as object. ✓

Causal reasoning — "Tom was very sad because"

...he had no one to play. He thought that it would be fun and share his toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"

"because" produces a real cause of sadness; emotional consistency holds for 2–3 sentences before drifting. ✓ (with caveats)

Narrative composition — "The little dragon wanted to fly but"

...her mommy said no. The bear was very sad that he was gone. He wanted to fly anymore and get lost.

Initial obstacle is set up correctly, but the model loses track of which character is which (dragon → bear → "he"). ✗

This pattern — local coherence ✓, multi-sentence composition partial — is expected at this scale. Narrative arc requires planning across many tokens, which is one of the last capabilities to emerge in language models even at frontier scale.

Intended use

In scope:

Educational reference for the GPT-style transformer architecture
Demonstration of end-to-end LLM training on consumer hardware
Generating short, simple, TinyStories-style English children's narratives
Exploring how sampling parameters (temperature, top-k, top-p) affect output
Comparison baseline for tiny-model research

Out of scope:

General-purpose text generation (vocabulary is restricted to TinyStories)
Question answering, instruction following, or chat (no SFT or RLHF stage)
Anything requiring factual accuracy (no factual grounding)
Non-English text (English-only training data)
Long-form generation (256-token context window)

Limitations and biases

Distribution lock-in: Trained exclusively on synthetic children's stories. Generation outside this distribution (e.g., technical text, adult themes, dialogue formats) will be incoherent.
No instruction following: This is a base model — pre-training only. It completes text; it does not answer questions or follow instructions.
Hallucination: No factual grounding. The model has no concept of "I don't know" — it produces the most statistically plausible continuation, which is often false outside the training distribution.
Context window: 256 tokens is too short to model long dependencies.
Synthetic data biases: TinyStories was generated by GPT-3.5/4 with prompted constraints, so it inherits some of that generator's stylistic patterns and any biases encoded therein.
No safety training: No RLHF, no Constitutional AI, no content filtering. While the training data is innocuous, prompts that push toward harmful outputs receive no safeguards.
Memorization vs generalization: Some completions ("She was very happy and they played all day") are likely memorized stylistic patterns rather than novel generation.

How to use

Inference

from inference import NanoSLMInference

slm = NanoSLMInference("ckpt.pt", "tokenizer.json")

text = slm.generate(
    "Once upon a time, there was a little",
    max_new_tokens=200,
    temperature=0.8,
    top_k=40,
)
print(text)

Sampling parameters

Parameter	Effect
`temperature`	Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0.
`top_k`	Keep only the k highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100.
`top_p` (nucleus)	Keep the smallest set of tokens with cumulative probability ≥ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95.
`seed`	Sets PyTorch RNG for reproducibility.

How this model is served

A live demo is hosted on Hugging Face Spaces. The serving stack is intentionally minimal:

User browser
    ↓ HTTPS
HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
    ↓
Gradio + FastAPI/uvicorn
    ↓
PyTorch eager-mode forward pass on CPU
    ↓
Autoregressive token generation, one token per pass

Approximate latency for 100 generated tokens: ~3 seconds on Spaces' free CPU, ~0.5 seconds on Apple M-series with MPS.

What this serving setup deliberately does not implement (each is a separate upgrade and a useful learning exercise):

KV-caching — every generation step re-processes all prior tokens. A real implementation caches K/V tensors and pays only for the new token.
Continuous batching — multiple users would queue serially. Production servers (vLLM, TGI) batch concurrent requests dynamically.
Quantization — weights are float32. int8/int4 would shrink memory ~4×.
Compiled graphs — eager-mode PyTorch leaves performance on the table vs torch.compile(), ONNX Runtime, or a dedicated engine.

For a model this small the overheads don't matter. At any production scale, every one of the above becomes critical to unit economics.

Comparison with frontier models

The architecture is structurally identical to GPT-2/3, Llama, Mistral, and Claude. The differences below are evolutionary refinements, not categorical changes — the core "decoder-only transformer trained with next-token prediction" recipe is the same.

	microGPT (this)	Llama 3 70B
Parameters	1.35M	70B (~52,000× larger)
Layers	4	80
`d_model`	128	8,192
Heads	4 (multi-head)	64 (grouped-query attention)
Context	256	128,000
Vocab	4,096	128,256
Position	Learned absolute	Rotary (RoPE)
Activation	GELU	SwiGLU
Normalization	LayerNorm	RMSNorm
Training tokens	~327M	~~15T (~~46,000× more)
Training compute	~5 kWh laptop	many MW-months on H100 clusters

Glossary

A short reference for the terminology used above. Worth absorbing — these terms come up constantly in AI literature and interviews.

Parameter / weight. A single learnable number stored in the model. Updated during training, read during inference. A "1.35M parameter model" literally has 1.35M of these numbers.

Embedding. A learned vector representation of a discrete object (token, position). Implemented as a lookup table.

Token. The atomic unit of text the model operates on. Produced by the tokenizer; typically ~4 characters of English per token for byte-level BPE.

Tokenizer. The deterministic, reversible function that converts strings to integer ID sequences and back. Decisions made here (vocab size, BPE merges) propagate through the entire model.

BPE (Byte-Pair Encoding). A subword tokenization algorithm that iteratively merges the most frequent adjacent pairs of symbols into new vocabulary entries.

Logits. The raw, unnormalized scores the model outputs — one per vocabulary token at each position. Becomes a probability distribution after softmax.

Softmax. Function that converts logits to probabilities by exponentiating and normalizing.

Cross-entropy loss. The training objective: how surprised the model is by the correct next token. Equals 0 if the model assigned probability 1 to the right answer; equals ln(vocab_size) if the model is uniformly uninformed.

Perplexity. exp(loss). The "effective number of choices" the model is deciding between. Useful because it has a more intuitive scale than loss.

Decoder-only / autoregressive. The model only attends to past tokens (causal mask), and generates one token at a time conditioned on what it has already produced.

Self-attention. The mechanism by which each position computes a weighted combination of all (allowed) other positions, where the weights depend on the content at each position.

Multi-head attention. Self-attention computed in parallel across n subspaces ("heads"), each with d_model / n dimensions. Different heads empirically learn to specialize.

KV cache. At inference time, the Key and Value tensors from previous tokens can be cached and reused, avoiding redundant computation. Critical for production serving; not implemented in this model.

Pre-LayerNorm. Applying LayerNorm before the attention/MLP sublayers, not after. Stabilizes training of deep transformers.

Weight tying. Sharing parameters between the input embedding matrix and the output projection matrix. Saves memory; usually improves quality.

Cosine learning-rate schedule. Learning rate ramps up linearly during warmup, then decays following a cosine curve. Standard for transformer training.

Gradient clipping. Capping the global L2 norm of gradients during backpropagation to prevent destabilizing weight updates.

MPS (Metal Performance Shaders). Apple's GPU acceleration backend for PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.

Pre-training. The stage of training described here: minimize next-token prediction loss on a large corpus. Produces a base model.

SFT (Supervised Fine-Tuning). A subsequent training stage on (instruction, ideal response) pairs. Teaches the model to follow instructions. Not done for this model.

RLHF (Reinforcement Learning from Human Feedback). A further training stage using preference data. Aligns model behavior with human preferences. Not done for this model.

Citation

If this model or its companion code helped you, please cite or link to:

@misc{microgpt,
  author = {Brett Lee Hary},
  title  = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
}

Acknowledgements

Andrej Karpathy's nanoGPT — the reference implementation that made this approachable.
Eldan & Li (2023), TinyStories: How Small Can Language Models Be and Still Speak Coherent English? — the dataset and the insight that data quality can substitute for model scale.
Vaswani et al. (2017), Attention Is All You Need — the original transformer.
The Hugging Face transformers, tokenizers, and datasets teams for the infrastructure that makes projects like this trivial to share.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train brettleehari/microgpt

Papers for brettleehari/microgpt

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 45

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 122

Evaluation results

Validation cross-entropy loss on TinyStories (validation split)
self-reported

2.250
Validation perplexity on TinyStories (validation split)
self-reported

9.490