microgpt / README.md
brettleehari's picture
Initial microGPT upload
14c107a verified
---
license: mit
language:
- en
tags:
- text-generation
- transformer
- educational
- tiny-llm
- from-scratch
- decoder-only
- gpt
datasets:
- roneneldan/TinyStories
pipeline_tag: text-generation
library_name: pytorch
model-index:
- name: microgpt
results:
- task:
type: text-generation
name: Story completion
dataset:
name: TinyStories (validation split)
type: roneneldan/TinyStories
metrics:
- type: cross-entropy
value: 2.25
name: Validation cross-entropy loss
- type: perplexity
value: 9.49
name: Validation perplexity
---
# microGPT
A **1.35M-parameter decoder-only transformer** trained from scratch on the
[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.
The entire training run took roughly two hours on an Apple Silicon laptop.
At ~50,000Γ— smaller than GPT-3, it can still produce coherent simple
children's stories.
This is an **educational artifact**, not a production model. Its purpose is
to make every component of a modern LLM legible, debuggable, and rebuildable
on consumer hardware.
---
## Quick facts
| | |
|---|---|
| **Architecture** | Decoder-only transformer (GPT-style) |
| **Parameters** | 1,345,792 trainable (1.35M) |
| **File size on disk** | ~5.1 MB (float32) |
| **Training data** | ~470M tokens of TinyStories |
| **Training compute** | ~1.5 hours on Apple Silicon (MPS) |
| **Final val loss** | 2.25 (perplexity 9.49) |
| **Context window** | 256 tokens |
| **Tokenizer** | Byte-level BPE, vocab=4096 |
| **License** | MIT |
---
## Architecture in detail
```
Input tokens (B, T)
β”‚
β”œβ”€β–Ί Token Embedding (4096 β†’ 128)
β”‚ β”‚
└─► Position Embedding β”€β”€β”€β”€β”˜ ← element-wise sum
β”‚
β–Ό (B, T, 128)
β”Œβ”€β”€β”€β”€ Block Γ— 4 ────────────────────────────┐
β”‚ β”‚
β”‚ x = LayerNorm(x) β”‚
β”‚ x = x + CausalSelfAttention(x) ← 4 headsβ”‚
β”‚ x = LayerNorm(x) β”‚
β”‚ x = x + MLP(x) ← 128β†’512β†’128, GELU
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό (B, T, 128)
LayerNorm
β”‚
β–Ό
Linear (128 β†’ 4096) ← weight-tied with token embedding
β”‚
β–Ό (B, T, 4096)
Logits
```
| Hyperparameter | Value | Notes |
|---|---|---|
| `n_layers` | 4 | Stacked transformer blocks |
| `d_model` | 128 | Hidden dimension |
| `n_heads` | 4 | Each head is 128/4 = 32 dim |
| `head_dim` | 32 | Per-head dimensionality |
| `ffn_dim` | 512 | MLP intermediate width (4Γ—d_model) |
| `ctx_len` | 256 | Maximum input length in tokens |
| `vocab_size` | 4,096 | BPE-derived vocabulary |
| Normalization | LayerNorm | Pre-LN (applied before sublayers) |
| Position encoding | Learned | Absolute, additive |
| Activation | GELU | In the MLP |
| Attention | Multi-head, causal | Implemented via `F.scaled_dot_product_attention` |
| Embedding tying | Yes | Output projection shares weight with `tok_emb` |
| Bias on linear layers | No | Following common modern practice |
| Dropout | 0.1 (training) | 0.0 at inference |
### Parameter breakdown β€” where the 1.35M live
| Component | Shape | Params | % |
|---|---|---|---|
| Token embeddings (`tok_emb.weight`) | (4096, 128) | 524,288 | 38.9% |
| Position embeddings (`pos_emb.weight`) | (256, 128) | 32,768 | 2.4% |
| 4 Γ— transformer block | β€” | 788,480 | 58.6% |
| └─ Per block: `ln1` (Ξ³, Ξ²) | (128,) Γ— 2 | 256 | |
| └─ Per block: `attn.qkv` | (384, 128) | 49,152 | |
| └─ Per block: `attn.proj` | (128, 128) | 16,384 | |
| └─ Per block: `ln2` (Ξ³, Ξ²) | (128,) Γ— 2 | 256 | |
| └─ Per block: `mlp.fc1` | (512, 128) | 65,536 | |
| └─ Per block: `mlp.fc2` | (128, 512) | 65,536 | |
| Final LayerNorm (`ln_f`) | (128,) Γ— 2 | 256 | 0.02% |
| Output projection (`head.weight`) | (4096, 128) | 0 | tied |
| **Total** | | **1,345,792** | |
Two observations worth absorbing:
- **Embeddings are 41% of total parameters** at this scale. This is typical of small models β€” the vocab Γ— d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
- **MLPs (`fc1` + `fc2`) account for half of every block's params**: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.
---
## Training
### Data
- **Dataset:** [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) (Eldan & Li, 2023)
- **Stories:** ~2.1M (train) + ~22K (validation)
- **Tokens (after BPE):** ~470M (train) + ~5M (validation)
- **Why TinyStories specifically:** synthetic dataset designed so vocabulary
and grammar stay within what a 3–4 year-old understands, making coherent
generation possible at very small model scales. Without this curation, a
1.35M-param model on general web text produces gibberish.
### Tokenizer
- **Type:** byte-level Byte-Pair Encoding (BPE)
- **Vocabulary:** 4,096 tokens (including special tokens `<unk>`, `<eos>`)
- **Trained on:** 50,000 stories from the train split (vocab converges
quickly; full corpus produces a near-identical tokenizer)
- **Avg compression:** ~4 characters per token on TinyStories text
### Optimization
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| β₁, Ξ²β‚‚ | 0.9, 0.95 |
| Weight decay | 0.1 |
| Peak learning rate | 3e-4 |
| Min learning rate | 3e-5 |
| Schedule | Linear warmup (200 steps) β†’ cosine decay |
| Batch size (sequences) | 64 |
| Sequence length | 256 |
| Tokens per step | 16,384 |
| Total steps | 20,000 |
| Total tokens seen | ~327M |
| Gradient clipping | 1.0 (global L2 norm) |
| Random seed | 1337 |
### Hardware & wall-clock
| | |
|---|---|
| Hardware | Apple M-series laptop (MPS backend) |
| Precision | float32 |
| Wall-clock | ~1.5 hours |
| Peak memory | ~1.5 GB |
| Disk footprint | ~1 GB tokenized corpus + 5.1 MB checkpoint |
---
## Evaluation
### Held-out validation loss
| Step | Val loss | Perplexity |
|---|---|---|
| 0 (init) | 8.32 | 4096 |
| ~17,500 | 2.26 | 9.59 |
| ~20,000 | **2.25** | **9.49** |
For context: a uniform random predictor over the 4,096-token vocabulary
would have loss `ln(4096) = 8.32` and perplexity 4,096. A perplexity of
9.49 means the model has effectively narrowed each prediction down to
~9.5 plausible next-token choices, on average.
> **Note on missing checkpoint progression.** This release contains only the
> final checkpoint. The original plan was to capture step-tagged snapshots
> (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability *emergence*
> across training β€” gibberish β†’ words β†’ grammar β†’ narrative. A monitoring
> script was started after training had already passed the early steps, so
> only the final checkpoint was preserved. A v0.2 release with multi-checkpoint
> snapshots and an emergence comparison is planned.
### Capability probes
Three diagnostic prompts designed to probe specific capabilities, evaluated
on the final checkpoint (step 20,000, T=0.8, top-k=40):
**Short-range coherence** β€” *"Lily picked up the red ball and"*
> ...put it on the floor. She threw the red ball with her hands and pulled
> it onto the floor. She tried to make it shine, but it made her feel good.
> "Oh no, no!" she shouted. "Sharing is very dangerous!"
Pronoun continuity holds; ball persists as object. βœ“
**Causal reasoning** β€” *"Tom was very sad because"*
> ...he had no one to play. He thought that it would be fun and share his
> toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"
"because" produces a real cause of sadness; emotional consistency holds for
2–3 sentences before drifting. βœ“ (with caveats)
**Narrative composition** β€” *"The little dragon wanted to fly but"*
> ...her mommy said no. The bear was very sad that he was gone. He wanted
> to fly anymore and get lost.
Initial obstacle is set up correctly, but the model loses track of which
character is which (dragon β†’ bear β†’ "he"). βœ—
This pattern β€” local coherence βœ“, multi-sentence composition partial β€” is
expected at this scale. Narrative arc requires planning across many tokens,
which is one of the last capabilities to emerge in language models even at
frontier scale.
---
## Intended use
**In scope:**
- Educational reference for the GPT-style transformer architecture
- Demonstration of end-to-end LLM training on consumer hardware
- Generating short, simple, TinyStories-style English children's narratives
- Exploring how sampling parameters (temperature, top-k, top-p) affect output
- Comparison baseline for tiny-model research
**Out of scope:**
- General-purpose text generation (vocabulary is restricted to TinyStories)
- Question answering, instruction following, or chat (no SFT or RLHF stage)
- Anything requiring factual accuracy (no factual grounding)
- Non-English text (English-only training data)
- Long-form generation (256-token context window)
---
## Limitations and biases
- **Distribution lock-in:** Trained exclusively on synthetic children's
stories. Generation outside this distribution (e.g., technical text,
adult themes, dialogue formats) will be incoherent.
- **No instruction following:** This is a base model β€” pre-training only.
It completes text; it does not answer questions or follow instructions.
- **Hallucination:** No factual grounding. The model has no concept of
"I don't know" β€” it produces the most statistically plausible
continuation, which is often false outside the training distribution.
- **Context window:** 256 tokens is too short to model long dependencies.
- **Synthetic data biases:** TinyStories was generated by GPT-3.5/4 with
prompted constraints, so it inherits some of that generator's stylistic
patterns and any biases encoded therein.
- **No safety training:** No RLHF, no Constitutional AI, no content
filtering. While the training data is innocuous, prompts that
push toward harmful outputs receive no safeguards.
- **Memorization vs generalization:** Some completions ("She was very
happy and they played all day") are likely memorized stylistic
patterns rather than novel generation.
---
## How to use
### Inference
```python
from inference import NanoSLMInference
slm = NanoSLMInference("ckpt.pt", "tokenizer.json")
text = slm.generate(
"Once upon a time, there was a little",
max_new_tokens=200,
temperature=0.8,
top_k=40,
)
print(text)
```
### Sampling parameters
| Parameter | Effect |
|---|---|
| `temperature` | Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0. |
| `top_k` | Keep only the *k* highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100. |
| `top_p` (nucleus) | Keep the smallest set of tokens with cumulative probability β‰₯ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95. |
| `seed` | Sets PyTorch RNG for reproducibility. |
---
## How this model is served
A live demo is hosted on [Hugging Face Spaces](https://huggingface.co/spaces/brettleehari/microgpt-demo).
The serving stack is intentionally minimal:
```
User browser
↓ HTTPS
HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
↓
Gradio + FastAPI/uvicorn
↓
PyTorch eager-mode forward pass on CPU
↓
Autoregressive token generation, one token per pass
```
Approximate latency for 100 generated tokens: **~3 seconds on Spaces' free
CPU**, **~0.5 seconds on Apple M-series with MPS**.
What this serving setup deliberately does *not* implement (each is a separate
upgrade and a useful learning exercise):
- **KV-caching** β€” every generation step re-processes all prior tokens.
A real implementation caches K/V tensors and pays only for the new token.
- **Continuous batching** β€” multiple users would queue serially. Production
servers (vLLM, TGI) batch concurrent requests dynamically.
- **Quantization** β€” weights are float32. int8/int4 would shrink memory ~4Γ—.
- **Compiled graphs** β€” eager-mode PyTorch leaves performance on the table
vs `torch.compile()`, ONNX Runtime, or a dedicated engine.
For a model this small the overheads don't matter. At any production scale,
*every one of the above becomes critical to unit economics*.
---
## Comparison with frontier models
The architecture is structurally identical to GPT-2/3, Llama, Mistral, and
Claude. The differences below are evolutionary refinements, not categorical
changes β€” the core "decoder-only transformer trained with next-token
prediction" recipe is the same.
| | microGPT (this) | Llama 3 70B |
|---|---|---|
| Parameters | 1.35M | 70B (~52,000Γ— larger) |
| Layers | 4 | 80 |
| `d_model` | 128 | 8,192 |
| Heads | 4 (multi-head) | 64 (grouped-query attention) |
| Context | 256 | 128,000 |
| Vocab | 4,096 | 128,256 |
| Position | Learned absolute | Rotary (RoPE) |
| Activation | GELU | SwiGLU |
| Normalization | LayerNorm | RMSNorm |
| Training tokens | ~327M | ~15T (~46,000Γ— more) |
| Training compute | ~5 kWh laptop | many MW-months on H100 clusters |
---
## Glossary
A short reference for the terminology used above. Worth absorbing β€” these
terms come up constantly in AI literature and interviews.
**Parameter / weight.** A single learnable number stored in the model.
Updated during training, read during inference. A "1.35M parameter model"
literally has 1.35M of these numbers.
**Embedding.** A learned vector representation of a discrete object (token,
position). Implemented as a lookup table.
**Token.** The atomic unit of text the model operates on. Produced by the
tokenizer; typically ~4 characters of English per token for byte-level BPE.
**Tokenizer.** The deterministic, reversible function that converts strings
to integer ID sequences and back. Decisions made here (vocab size, BPE
merges) propagate through the entire model.
**BPE (Byte-Pair Encoding).** A subword tokenization algorithm that
iteratively merges the most frequent adjacent pairs of symbols into new
vocabulary entries.
**Logits.** The raw, unnormalized scores the model outputs β€” one per
vocabulary token at each position. Becomes a probability distribution after
softmax.
**Softmax.** Function that converts logits to probabilities by exponentiating
and normalizing.
**Cross-entropy loss.** The training objective: how surprised the model is
by the correct next token. Equals 0 if the model assigned probability 1 to
the right answer; equals `ln(vocab_size)` if the model is uniformly
uninformed.
**Perplexity.** `exp(loss)`. The "effective number of choices" the model is
deciding between. Useful because it has a more intuitive scale than loss.
**Decoder-only / autoregressive.** The model only attends to past tokens
(causal mask), and generates one token at a time conditioned on what it has
already produced.
**Self-attention.** The mechanism by which each position computes a
weighted combination of all (allowed) other positions, where the weights
depend on the content at each position.
**Multi-head attention.** Self-attention computed in parallel across `n`
subspaces ("heads"), each with `d_model / n` dimensions. Different heads
empirically learn to specialize.
**KV cache.** At inference time, the Key and Value tensors from previous
tokens can be cached and reused, avoiding redundant computation. Critical
for production serving; not implemented in this model.
**Pre-LayerNorm.** Applying LayerNorm *before* the attention/MLP sublayers,
not after. Stabilizes training of deep transformers.
**Weight tying.** Sharing parameters between the input embedding matrix and
the output projection matrix. Saves memory; usually improves quality.
**Cosine learning-rate schedule.** Learning rate ramps up linearly during
warmup, then decays following a cosine curve. Standard for transformer
training.
**Gradient clipping.** Capping the global L2 norm of gradients during
backpropagation to prevent destabilizing weight updates.
**MPS (Metal Performance Shaders).** Apple's GPU acceleration backend for
PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.
**Pre-training.** The stage of training described here: minimize next-token
prediction loss on a large corpus. Produces a *base model*.
**SFT (Supervised Fine-Tuning).** A subsequent training stage on
`(instruction, ideal response)` pairs. Teaches the model to follow
instructions. Not done for this model.
**RLHF (Reinforcement Learning from Human Feedback).** A further training
stage using preference data. Aligns model behavior with human preferences.
Not done for this model.
---
## Citation
If this model or its companion code helped you, please cite or link to:
```
@misc{microgpt,
author = {Brett Lee Hary},
title = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
year = {2026},
howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
}
```
### Acknowledgements
- Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT) β€” the
reference implementation that made this approachable.
- Eldan & Li (2023), [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) β€” the dataset and the insight that data quality can substitute for model scale.
- Vaswani et al. (2017), [Attention Is All You Need](https://arxiv.org/abs/1706.03762) β€” the original transformer.
- The Hugging Face `transformers`, `tokenizers`, and `datasets` teams for
the infrastructure that makes projects like this trivial to share.