microGPT
A 1.35M-parameter decoder-only transformer trained from scratch on the TinyStories dataset. The entire training run took roughly two hours on an Apple Silicon laptop. At ~50,000Γ smaller than GPT-3, it can still produce coherent simple children's stories.
This is an educational artifact, not a production model. Its purpose is to make every component of a modern LLM legible, debuggable, and rebuildable on consumer hardware.
Quick facts
| Architecture | Decoder-only transformer (GPT-style) |
| Parameters | 1,345,792 trainable (1.35M) |
| File size on disk | ~5.1 MB (float32) |
| Training data | ~470M tokens of TinyStories |
| Training compute | ~1.5 hours on Apple Silicon (MPS) |
| Final val loss | 2.25 (perplexity 9.49) |
| Context window | 256 tokens |
| Tokenizer | Byte-level BPE, vocab=4096 |
| License | MIT |
Architecture in detail
Input tokens (B, T)
β
βββΊ Token Embedding (4096 β 128)
β β
βββΊ Position Embedding βββββ β element-wise sum
β
βΌ (B, T, 128)
βββββ Block Γ 4 βββββββββββββββββββββββββββββ
β β
β x = LayerNorm(x) β
β x = x + CausalSelfAttention(x) β 4 headsβ
β x = LayerNorm(x) β
β x = x + MLP(x) β 128β512β128, GELU
β β
ββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ (B, T, 128)
LayerNorm
β
βΌ
Linear (128 β 4096) β weight-tied with token embedding
β
βΌ (B, T, 4096)
Logits
| Hyperparameter | Value | Notes |
|---|---|---|
n_layers |
4 | Stacked transformer blocks |
d_model |
128 | Hidden dimension |
n_heads |
4 | Each head is 128/4 = 32 dim |
head_dim |
32 | Per-head dimensionality |
ffn_dim |
512 | MLP intermediate width (4Γd_model) |
ctx_len |
256 | Maximum input length in tokens |
vocab_size |
4,096 | BPE-derived vocabulary |
| Normalization | LayerNorm | Pre-LN (applied before sublayers) |
| Position encoding | Learned | Absolute, additive |
| Activation | GELU | In the MLP |
| Attention | Multi-head, causal | Implemented via F.scaled_dot_product_attention |
| Embedding tying | Yes | Output projection shares weight with tok_emb |
| Bias on linear layers | No | Following common modern practice |
| Dropout | 0.1 (training) | 0.0 at inference |
Parameter breakdown β where the 1.35M live
| Component | Shape | Params | % |
|---|---|---|---|
Token embeddings (tok_emb.weight) |
(4096, 128) | 524,288 | 38.9% |
Position embeddings (pos_emb.weight) |
(256, 128) | 32,768 | 2.4% |
| 4 Γ transformer block | β | 788,480 | 58.6% |
ββ Per block: ln1 (Ξ³, Ξ²) |
(128,) Γ 2 | 256 | |
ββ Per block: attn.qkv |
(384, 128) | 49,152 | |
ββ Per block: attn.proj |
(128, 128) | 16,384 | |
ββ Per block: ln2 (Ξ³, Ξ²) |
(128,) Γ 2 | 256 | |
ββ Per block: mlp.fc1 |
(512, 128) | 65,536 | |
ββ Per block: mlp.fc2 |
(128, 512) | 65,536 | |
Final LayerNorm (ln_f) |
(128,) Γ 2 | 256 | 0.02% |
Output projection (head.weight) |
(4096, 128) | 0 | tied |
| Total | 1,345,792 |
Two observations worth absorbing:
- Embeddings are 41% of total parameters at this scale. This is typical of small models β the vocab Γ d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
- MLPs (
fc1+fc2) account for half of every block's params: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.
Training
Data
- Dataset:
roneneldan/TinyStories(Eldan & Li, 2023) - Stories: ~2.1M (train) + ~22K (validation)
- Tokens (after BPE): ~470M (train) + ~5M (validation)
- Why TinyStories specifically: synthetic dataset designed so vocabulary and grammar stay within what a 3β4 year-old understands, making coherent generation possible at very small model scales. Without this curation, a 1.35M-param model on general web text produces gibberish.
Tokenizer
- Type: byte-level Byte-Pair Encoding (BPE)
- Vocabulary: 4,096 tokens (including special tokens
<unk>,<eos>) - Trained on: 50,000 stories from the train split (vocab converges quickly; full corpus produces a near-identical tokenizer)
- Avg compression: ~4 characters per token on TinyStories text
Optimization
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Ξ²β, Ξ²β | 0.9, 0.95 |
| Weight decay | 0.1 |
| Peak learning rate | 3e-4 |
| Min learning rate | 3e-5 |
| Schedule | Linear warmup (200 steps) β cosine decay |
| Batch size (sequences) | 64 |
| Sequence length | 256 |
| Tokens per step | 16,384 |
| Total steps | 20,000 |
| Total tokens seen | ~327M |
| Gradient clipping | 1.0 (global L2 norm) |
| Random seed | 1337 |
Hardware & wall-clock
| Hardware | Apple M-series laptop (MPS backend) |
| Precision | float32 |
| Wall-clock | ~1.5 hours |
| Peak memory | ~1.5 GB |
| Disk footprint | ~1 GB tokenized corpus + 5.1 MB checkpoint |
Evaluation
Held-out validation loss
| Step | Val loss | Perplexity |
|---|---|---|
| 0 (init) | 8.32 | 4096 |
| ~17,500 | 2.26 | 9.59 |
| ~20,000 | 2.25 | 9.49 |
For context: a uniform random predictor over the 4,096-token vocabulary
would have loss ln(4096) = 8.32 and perplexity 4,096. A perplexity of
9.49 means the model has effectively narrowed each prediction down to
~9.5 plausible next-token choices, on average.
Note on missing checkpoint progression. This release contains only the final checkpoint. The original plan was to capture step-tagged snapshots (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability emergence across training β gibberish β words β grammar β narrative. A monitoring script was started after training had already passed the early steps, so only the final checkpoint was preserved. A v0.2 release with multi-checkpoint snapshots and an emergence comparison is planned.
Capability probes
Three diagnostic prompts designed to probe specific capabilities, evaluated on the final checkpoint (step 20,000, T=0.8, top-k=40):
Short-range coherence β "Lily picked up the red ball and"
...put it on the floor. She threw the red ball with her hands and pulled it onto the floor. She tried to make it shine, but it made her feel good. "Oh no, no!" she shouted. "Sharing is very dangerous!"
Pronoun continuity holds; ball persists as object. β
Causal reasoning β "Tom was very sad because"
...he had no one to play. He thought that it would be fun and share his toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"
"because" produces a real cause of sadness; emotional consistency holds for 2β3 sentences before drifting. β (with caveats)
Narrative composition β "The little dragon wanted to fly but"
...her mommy said no. The bear was very sad that he was gone. He wanted to fly anymore and get lost.
Initial obstacle is set up correctly, but the model loses track of which character is which (dragon β bear β "he"). β
This pattern β local coherence β, multi-sentence composition partial β is expected at this scale. Narrative arc requires planning across many tokens, which is one of the last capabilities to emerge in language models even at frontier scale.
Intended use
In scope:
- Educational reference for the GPT-style transformer architecture
- Demonstration of end-to-end LLM training on consumer hardware
- Generating short, simple, TinyStories-style English children's narratives
- Exploring how sampling parameters (temperature, top-k, top-p) affect output
- Comparison baseline for tiny-model research
Out of scope:
- General-purpose text generation (vocabulary is restricted to TinyStories)
- Question answering, instruction following, or chat (no SFT or RLHF stage)
- Anything requiring factual accuracy (no factual grounding)
- Non-English text (English-only training data)
- Long-form generation (256-token context window)
Limitations and biases
- Distribution lock-in: Trained exclusively on synthetic children's stories. Generation outside this distribution (e.g., technical text, adult themes, dialogue formats) will be incoherent.
- No instruction following: This is a base model β pre-training only. It completes text; it does not answer questions or follow instructions.
- Hallucination: No factual grounding. The model has no concept of "I don't know" β it produces the most statistically plausible continuation, which is often false outside the training distribution.
- Context window: 256 tokens is too short to model long dependencies.
- Synthetic data biases: TinyStories was generated by GPT-3.5/4 with prompted constraints, so it inherits some of that generator's stylistic patterns and any biases encoded therein.
- No safety training: No RLHF, no Constitutional AI, no content filtering. While the training data is innocuous, prompts that push toward harmful outputs receive no safeguards.
- Memorization vs generalization: Some completions ("She was very happy and they played all day") are likely memorized stylistic patterns rather than novel generation.
How to use
Inference
from inference import NanoSLMInference
slm = NanoSLMInference("ckpt.pt", "tokenizer.json")
text = slm.generate(
"Once upon a time, there was a little",
max_new_tokens=200,
temperature=0.8,
top_k=40,
)
print(text)
Sampling parameters
| Parameter | Effect |
|---|---|
temperature |
Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7β1.0. |
top_k |
Keep only the k highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40β100. |
top_p (nucleus) |
Keep the smallest set of tokens with cumulative probability β₯ p. Adapts the cutoff to distribution shape. Typical: 0.9β0.95. |
seed |
Sets PyTorch RNG for reproducibility. |
How this model is served
A live demo is hosted on Hugging Face Spaces. The serving stack is intentionally minimal:
User browser
β HTTPS
HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
β
Gradio + FastAPI/uvicorn
β
PyTorch eager-mode forward pass on CPU
β
Autoregressive token generation, one token per pass
Approximate latency for 100 generated tokens: ~3 seconds on Spaces' free CPU, ~0.5 seconds on Apple M-series with MPS.
What this serving setup deliberately does not implement (each is a separate upgrade and a useful learning exercise):
- KV-caching β every generation step re-processes all prior tokens. A real implementation caches K/V tensors and pays only for the new token.
- Continuous batching β multiple users would queue serially. Production servers (vLLM, TGI) batch concurrent requests dynamically.
- Quantization β weights are float32. int8/int4 would shrink memory ~4Γ.
- Compiled graphs β eager-mode PyTorch leaves performance on the table
vs
torch.compile(), ONNX Runtime, or a dedicated engine.
For a model this small the overheads don't matter. At any production scale, every one of the above becomes critical to unit economics.
Comparison with frontier models
The architecture is structurally identical to GPT-2/3, Llama, Mistral, and Claude. The differences below are evolutionary refinements, not categorical changes β the core "decoder-only transformer trained with next-token prediction" recipe is the same.
| microGPT (this) | Llama 3 70B | |
|---|---|---|
| Parameters | 1.35M | 70B (~52,000Γ larger) |
| Layers | 4 | 80 |
d_model |
128 | 8,192 |
| Heads | 4 (multi-head) | 64 (grouped-query attention) |
| Context | 256 | 128,000 |
| Vocab | 4,096 | 128,256 |
| Position | Learned absolute | Rotary (RoPE) |
| Activation | GELU | SwiGLU |
| Normalization | LayerNorm | RMSNorm |
| Training tokens | ~327M | |
| Training compute | ~5 kWh laptop | many MW-months on H100 clusters |
Glossary
A short reference for the terminology used above. Worth absorbing β these terms come up constantly in AI literature and interviews.
Parameter / weight. A single learnable number stored in the model. Updated during training, read during inference. A "1.35M parameter model" literally has 1.35M of these numbers.
Embedding. A learned vector representation of a discrete object (token, position). Implemented as a lookup table.
Token. The atomic unit of text the model operates on. Produced by the tokenizer; typically ~4 characters of English per token for byte-level BPE.
Tokenizer. The deterministic, reversible function that converts strings to integer ID sequences and back. Decisions made here (vocab size, BPE merges) propagate through the entire model.
BPE (Byte-Pair Encoding). A subword tokenization algorithm that iteratively merges the most frequent adjacent pairs of symbols into new vocabulary entries.
Logits. The raw, unnormalized scores the model outputs β one per vocabulary token at each position. Becomes a probability distribution after softmax.
Softmax. Function that converts logits to probabilities by exponentiating and normalizing.
Cross-entropy loss. The training objective: how surprised the model is
by the correct next token. Equals 0 if the model assigned probability 1 to
the right answer; equals ln(vocab_size) if the model is uniformly
uninformed.
Perplexity. exp(loss). The "effective number of choices" the model is
deciding between. Useful because it has a more intuitive scale than loss.
Decoder-only / autoregressive. The model only attends to past tokens (causal mask), and generates one token at a time conditioned on what it has already produced.
Self-attention. The mechanism by which each position computes a weighted combination of all (allowed) other positions, where the weights depend on the content at each position.
Multi-head attention. Self-attention computed in parallel across n
subspaces ("heads"), each with d_model / n dimensions. Different heads
empirically learn to specialize.
KV cache. At inference time, the Key and Value tensors from previous tokens can be cached and reused, avoiding redundant computation. Critical for production serving; not implemented in this model.
Pre-LayerNorm. Applying LayerNorm before the attention/MLP sublayers, not after. Stabilizes training of deep transformers.
Weight tying. Sharing parameters between the input embedding matrix and the output projection matrix. Saves memory; usually improves quality.
Cosine learning-rate schedule. Learning rate ramps up linearly during warmup, then decays following a cosine curve. Standard for transformer training.
Gradient clipping. Capping the global L2 norm of gradients during backpropagation to prevent destabilizing weight updates.
MPS (Metal Performance Shaders). Apple's GPU acceleration backend for PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.
Pre-training. The stage of training described here: minimize next-token prediction loss on a large corpus. Produces a base model.
SFT (Supervised Fine-Tuning). A subsequent training stage on
(instruction, ideal response) pairs. Teaches the model to follow
instructions. Not done for this model.
RLHF (Reinforcement Learning from Human Feedback). A further training stage using preference data. Aligns model behavior with human preferences. Not done for this model.
Citation
If this model or its companion code helped you, please cite or link to:
@misc{microgpt,
author = {Brett Lee Hary},
title = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
year = {2026},
howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
}
Acknowledgements
- Andrej Karpathy's nanoGPT β the reference implementation that made this approachable.
- Eldan & Li (2023), TinyStories: How Small Can Language Models Be and Still Speak Coherent English? β the dataset and the insight that data quality can substitute for model scale.
- Vaswani et al. (2017), Attention Is All You Need β the original transformer.
- The Hugging Face
transformers,tokenizers, anddatasetsteams for the infrastructure that makes projects like this trivial to share.
Dataset used to train brettleehari/microgpt
Papers for brettleehari/microgpt
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Attention Is All You Need
Evaluation results
- Validation cross-entropy loss on TinyStories (validation split)self-reported2.250
- Validation perplexity on TinyStories (validation split)self-reported9.490