| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - text-generation |
| - transformer |
| - educational |
| - tiny-llm |
| - from-scratch |
| - decoder-only |
| - gpt |
| datasets: |
| - roneneldan/TinyStories |
| pipeline_tag: text-generation |
| library_name: pytorch |
| model-index: |
| - name: microgpt |
| results: |
| - task: |
| type: text-generation |
| name: Story completion |
| dataset: |
| name: TinyStories (validation split) |
| type: roneneldan/TinyStories |
| metrics: |
| - type: cross-entropy |
| value: 2.25 |
| name: Validation cross-entropy loss |
| - type: perplexity |
| value: 9.49 |
| name: Validation perplexity |
| --- |
| |
| # microGPT |
|
|
| A **1.35M-parameter decoder-only transformer** trained from scratch on the |
| [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset. |
| The entire training run took roughly two hours on an Apple Silicon laptop. |
| At ~50,000Γ smaller than GPT-3, it can still produce coherent simple |
| children's stories. |
|
|
| This is an **educational artifact**, not a production model. Its purpose is |
| to make every component of a modern LLM legible, debuggable, and rebuildable |
| on consumer hardware. |
|
|
| --- |
|
|
| ## Quick facts |
|
|
| | | | |
| |---|---| |
| | **Architecture** | Decoder-only transformer (GPT-style) | |
| | **Parameters** | 1,345,792 trainable (1.35M) | |
| | **File size on disk** | ~5.1 MB (float32) | |
| | **Training data** | ~470M tokens of TinyStories | |
| | **Training compute** | ~1.5 hours on Apple Silicon (MPS) | |
| | **Final val loss** | 2.25 (perplexity 9.49) | |
| | **Context window** | 256 tokens | |
| | **Tokenizer** | Byte-level BPE, vocab=4096 | |
| | **License** | MIT | |
|
|
| --- |
|
|
| ## Architecture in detail |
|
|
| ``` |
| Input tokens (B, T) |
| β |
| βββΊ Token Embedding (4096 β 128) |
| β β |
| βββΊ Position Embedding βββββ β element-wise sum |
| β |
| βΌ (B, T, 128) |
| βββββ Block Γ 4 βββββββββββββββββββββββββββββ |
| β β |
| β x = LayerNorm(x) β |
| β x = x + CausalSelfAttention(x) β 4 headsβ |
| β x = LayerNorm(x) β |
| β x = x + MLP(x) β 128β512β128, GELU |
| β β |
| ββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ (B, T, 128) |
| LayerNorm |
| β |
| βΌ |
| Linear (128 β 4096) β weight-tied with token embedding |
| β |
| βΌ (B, T, 4096) |
| Logits |
| ``` |
|
|
| | Hyperparameter | Value | Notes | |
| |---|---|---| |
| | `n_layers` | 4 | Stacked transformer blocks | |
| | `d_model` | 128 | Hidden dimension | |
| | `n_heads` | 4 | Each head is 128/4 = 32 dim | |
| | `head_dim` | 32 | Per-head dimensionality | |
| | `ffn_dim` | 512 | MLP intermediate width (4Γd_model) | |
| | `ctx_len` | 256 | Maximum input length in tokens | |
| | `vocab_size` | 4,096 | BPE-derived vocabulary | |
| | Normalization | LayerNorm | Pre-LN (applied before sublayers) | |
| | Position encoding | Learned | Absolute, additive | |
| | Activation | GELU | In the MLP | |
| | Attention | Multi-head, causal | Implemented via `F.scaled_dot_product_attention` | |
| | Embedding tying | Yes | Output projection shares weight with `tok_emb` | |
| | Bias on linear layers | No | Following common modern practice | |
| | Dropout | 0.1 (training) | 0.0 at inference | |
|
|
| ### Parameter breakdown β where the 1.35M live |
|
|
| | Component | Shape | Params | % | |
| |---|---|---|---| |
| | Token embeddings (`tok_emb.weight`) | (4096, 128) | 524,288 | 38.9% | |
| | Position embeddings (`pos_emb.weight`) | (256, 128) | 32,768 | 2.4% | |
| | 4 Γ transformer block | β | 788,480 | 58.6% | |
| | ββ Per block: `ln1` (Ξ³, Ξ²) | (128,) Γ 2 | 256 | | |
| | ββ Per block: `attn.qkv` | (384, 128) | 49,152 | | |
| | ββ Per block: `attn.proj` | (128, 128) | 16,384 | | |
| | ββ Per block: `ln2` (Ξ³, Ξ²) | (128,) Γ 2 | 256 | | |
| | ββ Per block: `mlp.fc1` | (512, 128) | 65,536 | | |
| | ββ Per block: `mlp.fc2` | (128, 512) | 65,536 | | |
| | Final LayerNorm (`ln_f`) | (128,) Γ 2 | 256 | 0.02% | |
| | Output projection (`head.weight`) | (4096, 128) | 0 | tied | |
| | **Total** | | **1,345,792** | | |
|
|
| Two observations worth absorbing: |
|
|
| - **Embeddings are 41% of total parameters** at this scale. This is typical of small models β the vocab Γ d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error). |
| - **MLPs (`fc1` + `fc2`) account for half of every block's params**: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true. |
| |
| --- |
| |
| ## Training |
| |
| ### Data |
| |
| - **Dataset:** [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) (Eldan & Li, 2023) |
| - **Stories:** ~2.1M (train) + ~22K (validation) |
| - **Tokens (after BPE):** ~470M (train) + ~5M (validation) |
| - **Why TinyStories specifically:** synthetic dataset designed so vocabulary |
| and grammar stay within what a 3β4 year-old understands, making coherent |
| generation possible at very small model scales. Without this curation, a |
| 1.35M-param model on general web text produces gibberish. |
| |
| ### Tokenizer |
| |
| - **Type:** byte-level Byte-Pair Encoding (BPE) |
| - **Vocabulary:** 4,096 tokens (including special tokens `<unk>`, `<eos>`) |
| - **Trained on:** 50,000 stories from the train split (vocab converges |
| quickly; full corpus produces a near-identical tokenizer) |
| - **Avg compression:** ~4 characters per token on TinyStories text |
| |
| ### Optimization |
| |
| | Hyperparameter | Value | |
| |---|---| |
| | Optimizer | AdamW | |
| | Ξ²β, Ξ²β | 0.9, 0.95 | |
| | Weight decay | 0.1 | |
| | Peak learning rate | 3e-4 | |
| | Min learning rate | 3e-5 | |
| | Schedule | Linear warmup (200 steps) β cosine decay | |
| | Batch size (sequences) | 64 | |
| | Sequence length | 256 | |
| | Tokens per step | 16,384 | |
| | Total steps | 20,000 | |
| | Total tokens seen | ~327M | |
| | Gradient clipping | 1.0 (global L2 norm) | |
| | Random seed | 1337 | |
| |
| ### Hardware & wall-clock |
| |
| | | | |
| |---|---| |
| | Hardware | Apple M-series laptop (MPS backend) | |
| | Precision | float32 | |
| | Wall-clock | ~1.5 hours | |
| | Peak memory | ~1.5 GB | |
| | Disk footprint | ~1 GB tokenized corpus + 5.1 MB checkpoint | |
| |
| --- |
| |
| ## Evaluation |
| |
| ### Held-out validation loss |
| |
| | Step | Val loss | Perplexity | |
| |---|---|---| |
| | 0 (init) | 8.32 | 4096 | |
| | ~17,500 | 2.26 | 9.59 | |
| | ~20,000 | **2.25** | **9.49** | |
| |
| For context: a uniform random predictor over the 4,096-token vocabulary |
| would have loss `ln(4096) = 8.32` and perplexity 4,096. A perplexity of |
| 9.49 means the model has effectively narrowed each prediction down to |
| ~9.5 plausible next-token choices, on average. |
| |
| > **Note on missing checkpoint progression.** This release contains only the |
| > final checkpoint. The original plan was to capture step-tagged snapshots |
| > (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability *emergence* |
| > across training β gibberish β words β grammar β narrative. A monitoring |
| > script was started after training had already passed the early steps, so |
| > only the final checkpoint was preserved. A v0.2 release with multi-checkpoint |
| > snapshots and an emergence comparison is planned. |
| |
| ### Capability probes |
| |
| Three diagnostic prompts designed to probe specific capabilities, evaluated |
| on the final checkpoint (step 20,000, T=0.8, top-k=40): |
| |
| **Short-range coherence** β *"Lily picked up the red ball and"* |
| > ...put it on the floor. She threw the red ball with her hands and pulled |
| > it onto the floor. She tried to make it shine, but it made her feel good. |
| > "Oh no, no!" she shouted. "Sharing is very dangerous!" |
| |
| Pronoun continuity holds; ball persists as object. β |
| |
| **Causal reasoning** β *"Tom was very sad because"* |
| > ...he had no one to play. He thought that it would be fun and share his |
| > toys with Timmy. His mom took him inside and said, "I love you too, Timmy!" |
| |
| "because" produces a real cause of sadness; emotional consistency holds for |
| 2β3 sentences before drifting. β (with caveats) |
| |
| **Narrative composition** β *"The little dragon wanted to fly but"* |
| > ...her mommy said no. The bear was very sad that he was gone. He wanted |
| > to fly anymore and get lost. |
| |
| Initial obstacle is set up correctly, but the model loses track of which |
| character is which (dragon β bear β "he"). β |
| |
| This pattern β local coherence β, multi-sentence composition partial β is |
| expected at this scale. Narrative arc requires planning across many tokens, |
| which is one of the last capabilities to emerge in language models even at |
| frontier scale. |
| |
| --- |
| |
| ## Intended use |
| |
| **In scope:** |
| - Educational reference for the GPT-style transformer architecture |
| - Demonstration of end-to-end LLM training on consumer hardware |
| - Generating short, simple, TinyStories-style English children's narratives |
| - Exploring how sampling parameters (temperature, top-k, top-p) affect output |
| - Comparison baseline for tiny-model research |
| |
| **Out of scope:** |
| - General-purpose text generation (vocabulary is restricted to TinyStories) |
| - Question answering, instruction following, or chat (no SFT or RLHF stage) |
| - Anything requiring factual accuracy (no factual grounding) |
| - Non-English text (English-only training data) |
| - Long-form generation (256-token context window) |
| |
| --- |
| |
| ## Limitations and biases |
| |
| - **Distribution lock-in:** Trained exclusively on synthetic children's |
| stories. Generation outside this distribution (e.g., technical text, |
| adult themes, dialogue formats) will be incoherent. |
| - **No instruction following:** This is a base model β pre-training only. |
| It completes text; it does not answer questions or follow instructions. |
| - **Hallucination:** No factual grounding. The model has no concept of |
| "I don't know" β it produces the most statistically plausible |
| continuation, which is often false outside the training distribution. |
| - **Context window:** 256 tokens is too short to model long dependencies. |
| - **Synthetic data biases:** TinyStories was generated by GPT-3.5/4 with |
| prompted constraints, so it inherits some of that generator's stylistic |
| patterns and any biases encoded therein. |
| - **No safety training:** No RLHF, no Constitutional AI, no content |
| filtering. While the training data is innocuous, prompts that |
| push toward harmful outputs receive no safeguards. |
| - **Memorization vs generalization:** Some completions ("She was very |
| happy and they played all day") are likely memorized stylistic |
| patterns rather than novel generation. |
| |
| --- |
| |
| ## How to use |
| |
| ### Inference |
| |
| ```python |
| from inference import NanoSLMInference |
| |
| slm = NanoSLMInference("ckpt.pt", "tokenizer.json") |
| |
| text = slm.generate( |
| "Once upon a time, there was a little", |
| max_new_tokens=200, |
| temperature=0.8, |
| top_k=40, |
| ) |
| print(text) |
| ``` |
| |
| ### Sampling parameters |
| |
| | Parameter | Effect | |
| |---|---| |
| | `temperature` | Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7β1.0. | |
| | `top_k` | Keep only the *k* highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40β100. | |
| | `top_p` (nucleus) | Keep the smallest set of tokens with cumulative probability β₯ p. Adapts the cutoff to distribution shape. Typical: 0.9β0.95. | |
| | `seed` | Sets PyTorch RNG for reproducibility. | |
| |
| --- |
| |
| ## How this model is served |
| |
| A live demo is hosted on [Hugging Face Spaces](https://huggingface.co/spaces/brettleehari/microgpt-demo). |
| The serving stack is intentionally minimal: |
| |
| ``` |
| User browser |
| β HTTPS |
| HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM) |
| β |
| Gradio + FastAPI/uvicorn |
| β |
| PyTorch eager-mode forward pass on CPU |
| β |
| Autoregressive token generation, one token per pass |
| ``` |
| |
| Approximate latency for 100 generated tokens: **~3 seconds on Spaces' free |
| CPU**, **~0.5 seconds on Apple M-series with MPS**. |
|
|
| What this serving setup deliberately does *not* implement (each is a separate |
| upgrade and a useful learning exercise): |
|
|
| - **KV-caching** β every generation step re-processes all prior tokens. |
| A real implementation caches K/V tensors and pays only for the new token. |
| - **Continuous batching** β multiple users would queue serially. Production |
| servers (vLLM, TGI) batch concurrent requests dynamically. |
| - **Quantization** β weights are float32. int8/int4 would shrink memory ~4Γ. |
| - **Compiled graphs** β eager-mode PyTorch leaves performance on the table |
| vs `torch.compile()`, ONNX Runtime, or a dedicated engine. |
|
|
| For a model this small the overheads don't matter. At any production scale, |
| *every one of the above becomes critical to unit economics*. |
|
|
| --- |
|
|
| ## Comparison with frontier models |
|
|
| The architecture is structurally identical to GPT-2/3, Llama, Mistral, and |
| Claude. The differences below are evolutionary refinements, not categorical |
| changes β the core "decoder-only transformer trained with next-token |
| prediction" recipe is the same. |
|
|
| | | microGPT (this) | Llama 3 70B | |
| |---|---|---| |
| | Parameters | 1.35M | 70B (~52,000Γ larger) | |
| | Layers | 4 | 80 | |
| | `d_model` | 128 | 8,192 | |
| | Heads | 4 (multi-head) | 64 (grouped-query attention) | |
| | Context | 256 | 128,000 | |
| | Vocab | 4,096 | 128,256 | |
| | Position | Learned absolute | Rotary (RoPE) | |
| | Activation | GELU | SwiGLU | |
| | Normalization | LayerNorm | RMSNorm | |
| | Training tokens | ~327M | ~15T (~46,000Γ more) | |
| | Training compute | ~5 kWh laptop | many MW-months on H100 clusters | |
|
|
| --- |
|
|
| ## Glossary |
|
|
| A short reference for the terminology used above. Worth absorbing β these |
| terms come up constantly in AI literature and interviews. |
|
|
| **Parameter / weight.** A single learnable number stored in the model. |
| Updated during training, read during inference. A "1.35M parameter model" |
| literally has 1.35M of these numbers. |
|
|
| **Embedding.** A learned vector representation of a discrete object (token, |
| position). Implemented as a lookup table. |
|
|
| **Token.** The atomic unit of text the model operates on. Produced by the |
| tokenizer; typically ~4 characters of English per token for byte-level BPE. |
|
|
| **Tokenizer.** The deterministic, reversible function that converts strings |
| to integer ID sequences and back. Decisions made here (vocab size, BPE |
| merges) propagate through the entire model. |
|
|
| **BPE (Byte-Pair Encoding).** A subword tokenization algorithm that |
| iteratively merges the most frequent adjacent pairs of symbols into new |
| vocabulary entries. |
|
|
| **Logits.** The raw, unnormalized scores the model outputs β one per |
| vocabulary token at each position. Becomes a probability distribution after |
| softmax. |
|
|
| **Softmax.** Function that converts logits to probabilities by exponentiating |
| and normalizing. |
|
|
| **Cross-entropy loss.** The training objective: how surprised the model is |
| by the correct next token. Equals 0 if the model assigned probability 1 to |
| the right answer; equals `ln(vocab_size)` if the model is uniformly |
| uninformed. |
|
|
| **Perplexity.** `exp(loss)`. The "effective number of choices" the model is |
| deciding between. Useful because it has a more intuitive scale than loss. |
|
|
| **Decoder-only / autoregressive.** The model only attends to past tokens |
| (causal mask), and generates one token at a time conditioned on what it has |
| already produced. |
|
|
| **Self-attention.** The mechanism by which each position computes a |
| weighted combination of all (allowed) other positions, where the weights |
| depend on the content at each position. |
|
|
| **Multi-head attention.** Self-attention computed in parallel across `n` |
| subspaces ("heads"), each with `d_model / n` dimensions. Different heads |
| empirically learn to specialize. |
|
|
| **KV cache.** At inference time, the Key and Value tensors from previous |
| tokens can be cached and reused, avoiding redundant computation. Critical |
| for production serving; not implemented in this model. |
|
|
| **Pre-LayerNorm.** Applying LayerNorm *before* the attention/MLP sublayers, |
| not after. Stabilizes training of deep transformers. |
|
|
| **Weight tying.** Sharing parameters between the input embedding matrix and |
| the output projection matrix. Saves memory; usually improves quality. |
|
|
| **Cosine learning-rate schedule.** Learning rate ramps up linearly during |
| warmup, then decays following a cosine curve. Standard for transformer |
| training. |
|
|
| **Gradient clipping.** Capping the global L2 norm of gradients during |
| backpropagation to prevent destabilizing weight updates. |
|
|
| **MPS (Metal Performance Shaders).** Apple's GPU acceleration backend for |
| PyTorch on M-series chips. The Apple Silicon equivalent of CUDA. |
|
|
| **Pre-training.** The stage of training described here: minimize next-token |
| prediction loss on a large corpus. Produces a *base model*. |
|
|
| **SFT (Supervised Fine-Tuning).** A subsequent training stage on |
| `(instruction, ideal response)` pairs. Teaches the model to follow |
| instructions. Not done for this model. |
|
|
| **RLHF (Reinforcement Learning from Human Feedback).** A further training |
| stage using preference data. Aligns model behavior with human preferences. |
| Not done for this model. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If this model or its companion code helped you, please cite or link to: |
|
|
| ``` |
| @misc{microgpt, |
| author = {Brett Lee Hary}, |
| title = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/brettleehari/microgpt}}, |
| } |
| ``` |
|
|
| ### Acknowledgements |
|
|
| - Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT) β the |
| reference implementation that made this approachable. |
| - Eldan & Li (2023), [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) β the dataset and the insight that data quality can substitute for model scale. |
| - Vaswani et al. (2017), [Attention Is All You Need](https://arxiv.org/abs/1706.03762) β the original transformer. |
| - The Hugging Face `transformers`, `tokenizers`, and `datasets` teams for |
| the infrastructure that makes projects like this trivial to share. |
|
|