--- license: mit language: - en tags: - text-generation - transformer - educational - tiny-llm - from-scratch - decoder-only - gpt datasets: - roneneldan/TinyStories pipeline_tag: text-generation library_name: pytorch model-index: - name: microgpt results: - task: type: text-generation name: Story completion dataset: name: TinyStories (validation split) type: roneneldan/TinyStories metrics: - type: cross-entropy value: 2.25 name: Validation cross-entropy loss - type: perplexity value: 9.49 name: Validation perplexity --- # microGPT A **1.35M-parameter decoder-only transformer** trained from scratch on the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. The entire training run took roughly two hours on an Apple Silicon laptop. At ~50,000× smaller than GPT-3, it can still produce coherent simple children's stories. This is an **educational artifact**, not a production model. Its purpose is to make every component of a modern LLM legible, debuggable, and rebuildable on consumer hardware. --- ## Quick facts | | | |---|---| | **Architecture** | Decoder-only transformer (GPT-style) | | **Parameters** | 1,345,792 trainable (1.35M) | | **File size on disk** | ~5.1 MB (float32) | | **Training data** | ~470M tokens of TinyStories | | **Training compute** | ~1.5 hours on Apple Silicon (MPS) | | **Final val loss** | 2.25 (perplexity 9.49) | | **Context window** | 256 tokens | | **Tokenizer** | Byte-level BPE, vocab=4096 | | **License** | MIT | --- ## Architecture in detail ``` Input tokens (B, T) │ ├─► Token Embedding (4096 → 128) │ │ └─► Position Embedding ────┘ ← element-wise sum │ ▼ (B, T, 128) ┌──── Block × 4 ────────────────────────────┐ │ │ │ x = LayerNorm(x) │ │ x = x + CausalSelfAttention(x) ← 4 heads│ │ x = LayerNorm(x) │ │ x = x + MLP(x) ← 128→512→128, GELU │ │ └────────────────────────────────────────────┘ │ ▼ (B, T, 128) LayerNorm │ ▼ Linear (128 → 4096) ← weight-tied with token embedding │ ▼ (B, T, 4096) Logits ``` | Hyperparameter | Value | Notes | |---|---|---| | `n_layers` | 4 | Stacked transformer blocks | | `d_model` | 128 | Hidden dimension | | `n_heads` | 4 | Each head is 128/4 = 32 dim | | `head_dim` | 32 | Per-head dimensionality | | `ffn_dim` | 512 | MLP intermediate width (4×d_model) | | `ctx_len` | 256 | Maximum input length in tokens | | `vocab_size` | 4,096 | BPE-derived vocabulary | | Normalization | LayerNorm | Pre-LN (applied before sublayers) | | Position encoding | Learned | Absolute, additive | | Activation | GELU | In the MLP | | Attention | Multi-head, causal | Implemented via `F.scaled_dot_product_attention` | | Embedding tying | Yes | Output projection shares weight with `tok_emb` | | Bias on linear layers | No | Following common modern practice | | Dropout | 0.1 (training) | 0.0 at inference | ### Parameter breakdown — where the 1.35M live | Component | Shape | Params | % | |---|---|---|---| | Token embeddings (`tok_emb.weight`) | (4096, 128) | 524,288 | 38.9% | | Position embeddings (`pos_emb.weight`) | (256, 128) | 32,768 | 2.4% | | 4 × transformer block | — | 788,480 | 58.6% | | └─ Per block: `ln1` (γ, β) | (128,) × 2 | 256 | | | └─ Per block: `attn.qkv` | (384, 128) | 49,152 | | | └─ Per block: `attn.proj` | (128, 128) | 16,384 | | | └─ Per block: `ln2` (γ, β) | (128,) × 2 | 256 | | | └─ Per block: `mlp.fc1` | (512, 128) | 65,536 | | | └─ Per block: `mlp.fc2` | (128, 512) | 65,536 | | | Final LayerNorm (`ln_f`) | (128,) × 2 | 256 | 0.02% | | Output projection (`head.weight`) | (4096, 128) | 0 | tied | | **Total** | | **1,345,792** | | Two observations worth absorbing: - **Embeddings are 41% of total parameters** at this scale. This is typical of small models — the vocab × d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error). - **MLPs (`fc1` + `fc2`) account for half of every block's params**: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true. --- ## Training ### Data - **Dataset:** [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) (Eldan & Li, 2023) - **Stories:** ~2.1M (train) + ~22K (validation) - **Tokens (after BPE):** ~470M (train) + ~5M (validation) - **Why TinyStories specifically:** synthetic dataset designed so vocabulary and grammar stay within what a 3–4 year-old understands, making coherent generation possible at very small model scales. Without this curation, a 1.35M-param model on general web text produces gibberish. ### Tokenizer - **Type:** byte-level Byte-Pair Encoding (BPE) - **Vocabulary:** 4,096 tokens (including special tokens ``, ``) - **Trained on:** 50,000 stories from the train split (vocab converges quickly; full corpus produces a near-identical tokenizer) - **Avg compression:** ~4 characters per token on TinyStories text ### Optimization | Hyperparameter | Value | |---|---| | Optimizer | AdamW | | β₁, β₂ | 0.9, 0.95 | | Weight decay | 0.1 | | Peak learning rate | 3e-4 | | Min learning rate | 3e-5 | | Schedule | Linear warmup (200 steps) → cosine decay | | Batch size (sequences) | 64 | | Sequence length | 256 | | Tokens per step | 16,384 | | Total steps | 20,000 | | Total tokens seen | ~327M | | Gradient clipping | 1.0 (global L2 norm) | | Random seed | 1337 | ### Hardware & wall-clock | | | |---|---| | Hardware | Apple M-series laptop (MPS backend) | | Precision | float32 | | Wall-clock | ~1.5 hours | | Peak memory | ~1.5 GB | | Disk footprint | ~1 GB tokenized corpus + 5.1 MB checkpoint | --- ## Evaluation ### Held-out validation loss | Step | Val loss | Perplexity | |---|---|---| | 0 (init) | 8.32 | 4096 | | ~17,500 | 2.26 | 9.59 | | ~20,000 | **2.25** | **9.49** | For context: a uniform random predictor over the 4,096-token vocabulary would have loss `ln(4096) = 8.32` and perplexity 4,096. A perplexity of 9.49 means the model has effectively narrowed each prediction down to ~9.5 plausible next-token choices, on average. > **Note on missing checkpoint progression.** This release contains only the > final checkpoint. The original plan was to capture step-tagged snapshots > (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability *emergence* > across training — gibberish → words → grammar → narrative. A monitoring > script was started after training had already passed the early steps, so > only the final checkpoint was preserved. A v0.2 release with multi-checkpoint > snapshots and an emergence comparison is planned. ### Capability probes Three diagnostic prompts designed to probe specific capabilities, evaluated on the final checkpoint (step 20,000, T=0.8, top-k=40): **Short-range coherence** — *"Lily picked up the red ball and"* > ...put it on the floor. She threw the red ball with her hands and pulled > it onto the floor. She tried to make it shine, but it made her feel good. > "Oh no, no!" she shouted. "Sharing is very dangerous!" Pronoun continuity holds; ball persists as object. ✓ **Causal reasoning** — *"Tom was very sad because"* > ...he had no one to play. He thought that it would be fun and share his > toys with Timmy. His mom took him inside and said, "I love you too, Timmy!" "because" produces a real cause of sadness; emotional consistency holds for 2–3 sentences before drifting. ✓ (with caveats) **Narrative composition** — *"The little dragon wanted to fly but"* > ...her mommy said no. The bear was very sad that he was gone. He wanted > to fly anymore and get lost. Initial obstacle is set up correctly, but the model loses track of which character is which (dragon → bear → "he"). ✗ This pattern — local coherence ✓, multi-sentence composition partial — is expected at this scale. Narrative arc requires planning across many tokens, which is one of the last capabilities to emerge in language models even at frontier scale. --- ## Intended use **In scope:** - Educational reference for the GPT-style transformer architecture - Demonstration of end-to-end LLM training on consumer hardware - Generating short, simple, TinyStories-style English children's narratives - Exploring how sampling parameters (temperature, top-k, top-p) affect output - Comparison baseline for tiny-model research **Out of scope:** - General-purpose text generation (vocabulary is restricted to TinyStories) - Question answering, instruction following, or chat (no SFT or RLHF stage) - Anything requiring factual accuracy (no factual grounding) - Non-English text (English-only training data) - Long-form generation (256-token context window) --- ## Limitations and biases - **Distribution lock-in:** Trained exclusively on synthetic children's stories. Generation outside this distribution (e.g., technical text, adult themes, dialogue formats) will be incoherent. - **No instruction following:** This is a base model — pre-training only. It completes text; it does not answer questions or follow instructions. - **Hallucination:** No factual grounding. The model has no concept of "I don't know" — it produces the most statistically plausible continuation, which is often false outside the training distribution. - **Context window:** 256 tokens is too short to model long dependencies. - **Synthetic data biases:** TinyStories was generated by GPT-3.5/4 with prompted constraints, so it inherits some of that generator's stylistic patterns and any biases encoded therein. - **No safety training:** No RLHF, no Constitutional AI, no content filtering. While the training data is innocuous, prompts that push toward harmful outputs receive no safeguards. - **Memorization vs generalization:** Some completions ("She was very happy and they played all day") are likely memorized stylistic patterns rather than novel generation. --- ## How to use ### Inference ```python from inference import NanoSLMInference slm = NanoSLMInference("ckpt.pt", "tokenizer.json") text = slm.generate( "Once upon a time, there was a little", max_new_tokens=200, temperature=0.8, top_k=40, ) print(text) ``` ### Sampling parameters | Parameter | Effect | |---|---| | `temperature` | Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0. | | `top_k` | Keep only the *k* highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100. | | `top_p` (nucleus) | Keep the smallest set of tokens with cumulative probability ≥ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95. | | `seed` | Sets PyTorch RNG for reproducibility. | --- ## How this model is served A live demo is hosted on [Hugging Face Spaces](https://huggingface.co/spaces/brettleehari/microgpt-demo). The serving stack is intentionally minimal: ``` User browser ↓ HTTPS HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM) ↓ Gradio + FastAPI/uvicorn ↓ PyTorch eager-mode forward pass on CPU ↓ Autoregressive token generation, one token per pass ``` Approximate latency for 100 generated tokens: **~3 seconds on Spaces' free CPU**, **~0.5 seconds on Apple M-series with MPS**. What this serving setup deliberately does *not* implement (each is a separate upgrade and a useful learning exercise): - **KV-caching** — every generation step re-processes all prior tokens. A real implementation caches K/V tensors and pays only for the new token. - **Continuous batching** — multiple users would queue serially. Production servers (vLLM, TGI) batch concurrent requests dynamically. - **Quantization** — weights are float32. int8/int4 would shrink memory ~4×. - **Compiled graphs** — eager-mode PyTorch leaves performance on the table vs `torch.compile()`, ONNX Runtime, or a dedicated engine. For a model this small the overheads don't matter. At any production scale, *every one of the above becomes critical to unit economics*. --- ## Comparison with frontier models The architecture is structurally identical to GPT-2/3, Llama, Mistral, and Claude. The differences below are evolutionary refinements, not categorical changes — the core "decoder-only transformer trained with next-token prediction" recipe is the same. | | microGPT (this) | Llama 3 70B | |---|---|---| | Parameters | 1.35M | 70B (~52,000× larger) | | Layers | 4 | 80 | | `d_model` | 128 | 8,192 | | Heads | 4 (multi-head) | 64 (grouped-query attention) | | Context | 256 | 128,000 | | Vocab | 4,096 | 128,256 | | Position | Learned absolute | Rotary (RoPE) | | Activation | GELU | SwiGLU | | Normalization | LayerNorm | RMSNorm | | Training tokens | ~327M | ~15T (~46,000× more) | | Training compute | ~5 kWh laptop | many MW-months on H100 clusters | --- ## Glossary A short reference for the terminology used above. Worth absorbing — these terms come up constantly in AI literature and interviews. **Parameter / weight.** A single learnable number stored in the model. Updated during training, read during inference. A "1.35M parameter model" literally has 1.35M of these numbers. **Embedding.** A learned vector representation of a discrete object (token, position). Implemented as a lookup table. **Token.** The atomic unit of text the model operates on. Produced by the tokenizer; typically ~4 characters of English per token for byte-level BPE. **Tokenizer.** The deterministic, reversible function that converts strings to integer ID sequences and back. Decisions made here (vocab size, BPE merges) propagate through the entire model. **BPE (Byte-Pair Encoding).** A subword tokenization algorithm that iteratively merges the most frequent adjacent pairs of symbols into new vocabulary entries. **Logits.** The raw, unnormalized scores the model outputs — one per vocabulary token at each position. Becomes a probability distribution after softmax. **Softmax.** Function that converts logits to probabilities by exponentiating and normalizing. **Cross-entropy loss.** The training objective: how surprised the model is by the correct next token. Equals 0 if the model assigned probability 1 to the right answer; equals `ln(vocab_size)` if the model is uniformly uninformed. **Perplexity.** `exp(loss)`. The "effective number of choices" the model is deciding between. Useful because it has a more intuitive scale than loss. **Decoder-only / autoregressive.** The model only attends to past tokens (causal mask), and generates one token at a time conditioned on what it has already produced. **Self-attention.** The mechanism by which each position computes a weighted combination of all (allowed) other positions, where the weights depend on the content at each position. **Multi-head attention.** Self-attention computed in parallel across `n` subspaces ("heads"), each with `d_model / n` dimensions. Different heads empirically learn to specialize. **KV cache.** At inference time, the Key and Value tensors from previous tokens can be cached and reused, avoiding redundant computation. Critical for production serving; not implemented in this model. **Pre-LayerNorm.** Applying LayerNorm *before* the attention/MLP sublayers, not after. Stabilizes training of deep transformers. **Weight tying.** Sharing parameters between the input embedding matrix and the output projection matrix. Saves memory; usually improves quality. **Cosine learning-rate schedule.** Learning rate ramps up linearly during warmup, then decays following a cosine curve. Standard for transformer training. **Gradient clipping.** Capping the global L2 norm of gradients during backpropagation to prevent destabilizing weight updates. **MPS (Metal Performance Shaders).** Apple's GPU acceleration backend for PyTorch on M-series chips. The Apple Silicon equivalent of CUDA. **Pre-training.** The stage of training described here: minimize next-token prediction loss on a large corpus. Produces a *base model*. **SFT (Supervised Fine-Tuning).** A subsequent training stage on `(instruction, ideal response)` pairs. Teaches the model to follow instructions. Not done for this model. **RLHF (Reinforcement Learning from Human Feedback).** A further training stage using preference data. Aligns model behavior with human preferences. Not done for this model. --- ## Citation If this model or its companion code helped you, please cite or link to: ``` @misc{microgpt, author = {Brett Lee Hary}, title = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories}, year = {2026}, howpublished = {\url{https://huggingface.co/brettleehari/microgpt}}, } ``` ### Acknowledgements - Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT) — the reference implementation that made this approachable. - Eldan & Li (2023), [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) — the dataset and the insight that data quality can substitute for model scale. - Vaswani et al. (2017), [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — the original transformer. - The Hugging Face `transformers`, `tokenizers`, and `datasets` teams for the infrastructure that makes projects like this trivial to share.