---
license: mit
language:
- en
tags:
- text-generation
- transformer
- educational
- tiny-llm
- from-scratch
- decoder-only
- gpt
datasets:
- roneneldan/TinyStories
pipeline_tag: text-generation
library_name: pytorch
model-index:
- name: microgpt
  results:
  - task:
      type: text-generation
      name: Story completion
    dataset:
      name: TinyStories (validation split)
      type: roneneldan/TinyStories
    metrics:
    - type: cross-entropy
      value: 2.25
      name: Validation cross-entropy loss
    - type: perplexity
      value: 9.49
      name: Validation perplexity
---

# microGPT

A **1.35M-parameter decoder-only transformer** trained from scratch on the
[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.
The entire training run took roughly two hours on an Apple Silicon laptop.
At ~50,000× smaller than GPT-3, it can still produce coherent simple
children's stories.

This is an **educational artifact**, not a production model. Its purpose is
to make every component of a modern LLM legible, debuggable, and rebuildable
on consumer hardware.

---

## Quick facts

| | |
|---|---|
| **Architecture** | Decoder-only transformer (GPT-style) |
| **Parameters** | 1,345,792 trainable (1.35M) |
| **File size on disk** | ~5.1 MB (float32) |
| **Training data** | ~470M tokens of TinyStories |
| **Training compute** | ~1.5 hours on Apple Silicon (MPS) |
| **Final val loss** | 2.25 (perplexity 9.49) |
| **Context window** | 256 tokens |
| **Tokenizer** | Byte-level BPE, vocab=4096 |
| **License** | MIT |

---

## Architecture in detail

```
Input tokens (B, T)
    │
    ├─► Token Embedding   (4096 → 128)
    │                          │
    └─► Position Embedding ────┘ ← element-wise sum
            │
            ▼  (B, T, 128)
   ┌──── Block × 4 ────────────────────────────┐
   │                                            │
   │   x = LayerNorm(x)                         │
   │   x = x + CausalSelfAttention(x)  ← 4 heads│
   │   x = LayerNorm(x)                         │
   │   x = x + MLP(x)                  ← 128→512→128, GELU
   │                                            │
   └────────────────────────────────────────────┘
            │
            ▼  (B, T, 128)
        LayerNorm
            │
            ▼
   Linear (128 → 4096)   ← weight-tied with token embedding
            │
            ▼  (B, T, 4096)
        Logits
```

| Hyperparameter | Value | Notes |
|---|---|---|
| `n_layers` | 4 | Stacked transformer blocks |
| `d_model` | 128 | Hidden dimension |
| `n_heads` | 4 | Each head is 128/4 = 32 dim |
| `head_dim` | 32 | Per-head dimensionality |
| `ffn_dim` | 512 | MLP intermediate width (4×d_model) |
| `ctx_len` | 256 | Maximum input length in tokens |
| `vocab_size` | 4,096 | BPE-derived vocabulary |
| Normalization | LayerNorm | Pre-LN (applied before sublayers) |
| Position encoding | Learned | Absolute, additive |
| Activation | GELU | In the MLP |
| Attention | Multi-head, causal | Implemented via `F.scaled_dot_product_attention` |
| Embedding tying | Yes | Output projection shares weight with `tok_emb` |
| Bias on linear layers | No | Following common modern practice |
| Dropout | 0.1 (training) | 0.0 at inference |

### Parameter breakdown — where the 1.35M live

| Component | Shape | Params | % |
|---|---|---|---|
| Token embeddings (`tok_emb.weight`) | (4096, 128) | 524,288 | 38.9% |
| Position embeddings (`pos_emb.weight`) | (256, 128) | 32,768 | 2.4% |
| 4 × transformer block | — | 788,480 | 58.6% |
|     └─ Per block: `ln1` (γ, β) | (128,) × 2 | 256 | |
|     └─ Per block: `attn.qkv` | (384, 128) | 49,152 | |
|     └─ Per block: `attn.proj` | (128, 128) | 16,384 | |
|     └─ Per block: `ln2` (γ, β) | (128,) × 2 | 256 | |
|     └─ Per block: `mlp.fc1` | (512, 128) | 65,536 | |
|     └─ Per block: `mlp.fc2` | (128, 512) | 65,536 | |
| Final LayerNorm (`ln_f`) | (128,) × 2 | 256 | 0.02% |
| Output projection (`head.weight`) | (4096, 128) | 0 | tied |
| **Total** | | **1,345,792** | |

Two observations worth absorbing:

- **Embeddings are 41% of total parameters** at this scale. This is typical of small models — the vocab × d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
- **MLPs (`fc1` + `fc2`) account for half of every block's params**: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.

---

## Training

### Data

- **Dataset:** [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) (Eldan & Li, 2023)
- **Stories:** ~2.1M (train) + ~22K (validation)
- **Tokens (after BPE):** ~470M (train) + ~5M (validation)
- **Why TinyStories specifically:** synthetic dataset designed so vocabulary
  and grammar stay within what a 3–4 year-old understands, making coherent
  generation possible at very small model scales. Without this curation, a
  1.35M-param model on general web text produces gibberish.

### Tokenizer

- **Type:** byte-level Byte-Pair Encoding (BPE)
- **Vocabulary:** 4,096 tokens (including special tokens `<unk>`, `<eos>`)
- **Trained on:** 50,000 stories from the train split (vocab converges
  quickly; full corpus produces a near-identical tokenizer)
- **Avg compression:** ~4 characters per token on TinyStories text

### Optimization

| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| β₁, β₂ | 0.9, 0.95 |
| Weight decay | 0.1 |
| Peak learning rate | 3e-4 |
| Min learning rate | 3e-5 |
| Schedule | Linear warmup (200 steps) → cosine decay |
| Batch size (sequences) | 64 |
| Sequence length | 256 |
| Tokens per step | 16,384 |
| Total steps | 20,000 |
| Total tokens seen | ~327M |
| Gradient clipping | 1.0 (global L2 norm) |
| Random seed | 1337 |

### Hardware & wall-clock

| | |
|---|---|
| Hardware | Apple M-series laptop (MPS backend) |
| Precision | float32 |
| Wall-clock | ~1.5 hours |
| Peak memory | ~1.5 GB |
| Disk footprint | ~1 GB tokenized corpus + 5.1 MB checkpoint |

---

## Evaluation

### Held-out validation loss

| Step | Val loss | Perplexity |
|---|---|---|
| 0 (init) | 8.32 | 4096 |
| ~17,500 | 2.26 | 9.59 |
| ~20,000 | **2.25** | **9.49** |

For context: a uniform random predictor over the 4,096-token vocabulary
would have loss `ln(4096) = 8.32` and perplexity 4,096. A perplexity of
9.49 means the model has effectively narrowed each prediction down to
~9.5 plausible next-token choices, on average.

> **Note on missing checkpoint progression.** This release contains only the
> final checkpoint. The original plan was to capture step-tagged snapshots
> (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability *emergence*
> across training — gibberish → words → grammar → narrative. A monitoring
> script was started after training had already passed the early steps, so
> only the final checkpoint was preserved. A v0.2 release with multi-checkpoint
> snapshots and an emergence comparison is planned.

### Capability probes

Three diagnostic prompts designed to probe specific capabilities, evaluated
on the final checkpoint (step 20,000, T=0.8, top-k=40):

**Short-range coherence** — *"Lily picked up the red ball and"*
> ...put it on the floor. She threw the red ball with her hands and pulled
> it onto the floor. She tried to make it shine, but it made her feel good.
> "Oh no, no!" she shouted. "Sharing is very dangerous!"

Pronoun continuity holds; ball persists as object. ✓

**Causal reasoning** — *"Tom was very sad because"*
> ...he had no one to play. He thought that it would be fun and share his
> toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"

"because" produces a real cause of sadness; emotional consistency holds for
2–3 sentences before drifting. ✓ (with caveats)

**Narrative composition** — *"The little dragon wanted to fly but"*
> ...her mommy said no. The bear was very sad that he was gone. He wanted
> to fly anymore and get lost.

Initial obstacle is set up correctly, but the model loses track of which
character is which (dragon → bear → "he"). ✗

This pattern — local coherence ✓, multi-sentence composition partial — is
expected at this scale. Narrative arc requires planning across many tokens,
which is one of the last capabilities to emerge in language models even at
frontier scale.

---

## Intended use

**In scope:**
- Educational reference for the GPT-style transformer architecture
- Demonstration of end-to-end LLM training on consumer hardware
- Generating short, simple, TinyStories-style English children's narratives
- Exploring how sampling parameters (temperature, top-k, top-p) affect output
- Comparison baseline for tiny-model research

**Out of scope:**
- General-purpose text generation (vocabulary is restricted to TinyStories)
- Question answering, instruction following, or chat (no SFT or RLHF stage)
- Anything requiring factual accuracy (no factual grounding)
- Non-English text (English-only training data)
- Long-form generation (256-token context window)

---

## Limitations and biases

- **Distribution lock-in:** Trained exclusively on synthetic children's
  stories. Generation outside this distribution (e.g., technical text,
  adult themes, dialogue formats) will be incoherent.
- **No instruction following:** This is a base model — pre-training only.
  It completes text; it does not answer questions or follow instructions.
- **Hallucination:** No factual grounding. The model has no concept of
  "I don't know" — it produces the most statistically plausible
  continuation, which is often false outside the training distribution.
- **Context window:** 256 tokens is too short to model long dependencies.
- **Synthetic data biases:** TinyStories was generated by GPT-3.5/4 with
  prompted constraints, so it inherits some of that generator's stylistic
  patterns and any biases encoded therein.
- **No safety training:** No RLHF, no Constitutional AI, no content
  filtering. While the training data is innocuous, prompts that
  push toward harmful outputs receive no safeguards.
- **Memorization vs generalization:** Some completions ("She was very
  happy and they played all day") are likely memorized stylistic
  patterns rather than novel generation.

---

## How to use

### Inference

```python
from inference import NanoSLMInference

slm = NanoSLMInference("ckpt.pt", "tokenizer.json")

text = slm.generate(
    "Once upon a time, there was a little",
    max_new_tokens=200,
    temperature=0.8,
    top_k=40,
)
print(text)
```

### Sampling parameters

| Parameter | Effect |
|---|---|
| `temperature` | Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0. |
| `top_k` | Keep only the *k* highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100. |
| `top_p` (nucleus) | Keep the smallest set of tokens with cumulative probability ≥ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95. |
| `seed` | Sets PyTorch RNG for reproducibility. |

---

## How this model is served

A live demo is hosted on [Hugging Face Spaces](https://huggingface.co/spaces/brettleehari/microgpt-demo).
The serving stack is intentionally minimal:

```
User browser
    ↓ HTTPS
HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
    ↓
Gradio + FastAPI/uvicorn
    ↓
PyTorch eager-mode forward pass on CPU
    ↓
Autoregressive token generation, one token per pass
```

Approximate latency for 100 generated tokens: **~3 seconds on Spaces' free
CPU**, **~0.5 seconds on Apple M-series with MPS**.

What this serving setup deliberately does *not* implement (each is a separate
upgrade and a useful learning exercise):

- **KV-caching** — every generation step re-processes all prior tokens.
  A real implementation caches K/V tensors and pays only for the new token.
- **Continuous batching** — multiple users would queue serially. Production
  servers (vLLM, TGI) batch concurrent requests dynamically.
- **Quantization** — weights are float32. int8/int4 would shrink memory ~4×.
- **Compiled graphs** — eager-mode PyTorch leaves performance on the table
  vs `torch.compile()`, ONNX Runtime, or a dedicated engine.

For a model this small the overheads don't matter. At any production scale,
*every one of the above becomes critical to unit economics*.

---

## Comparison with frontier models

The architecture is structurally identical to GPT-2/3, Llama, Mistral, and
Claude. The differences below are evolutionary refinements, not categorical
changes — the core "decoder-only transformer trained with next-token
prediction" recipe is the same.

| | microGPT (this) | Llama 3 70B |
|---|---|---|
| Parameters | 1.35M | 70B (~52,000× larger) |
| Layers | 4 | 80 |
| `d_model` | 128 | 8,192 |
| Heads | 4 (multi-head) | 64 (grouped-query attention) |
| Context | 256 | 128,000 |
| Vocab | 4,096 | 128,256 |
| Position | Learned absolute | Rotary (RoPE) |
| Activation | GELU | SwiGLU |
| Normalization | LayerNorm | RMSNorm |
| Training tokens | ~327M | ~15T (~46,000× more) |
| Training compute | ~5 kWh laptop | many MW-months on H100 clusters |

---

## Glossary

A short reference for the terminology used above. Worth absorbing — these
terms come up constantly in AI literature and interviews.

**Parameter / weight.** A single learnable number stored in the model.
Updated during training, read during inference. A "1.35M parameter model"
literally has 1.35M of these numbers.

**Embedding.** A learned vector representation of a discrete object (token,
position). Implemented as a lookup table.

**Token.** The atomic unit of text the model operates on. Produced by the
tokenizer; typically ~4 characters of English per token for byte-level BPE.

**Tokenizer.** The deterministic, reversible function that converts strings
to integer ID sequences and back. Decisions made here (vocab size, BPE
merges) propagate through the entire model.

**BPE (Byte-Pair Encoding).** A subword tokenization algorithm that
iteratively merges the most frequent adjacent pairs of symbols into new
vocabulary entries.

**Logits.** The raw, unnormalized scores the model outputs — one per
vocabulary token at each position. Becomes a probability distribution after
softmax.

**Softmax.** Function that converts logits to probabilities by exponentiating
and normalizing.

**Cross-entropy loss.** The training objective: how surprised the model is
by the correct next token. Equals 0 if the model assigned probability 1 to
the right answer; equals `ln(vocab_size)` if the model is uniformly
uninformed.

**Perplexity.** `exp(loss)`. The "effective number of choices" the model is
deciding between. Useful because it has a more intuitive scale than loss.

**Decoder-only / autoregressive.** The model only attends to past tokens
(causal mask), and generates one token at a time conditioned on what it has
already produced.

**Self-attention.** The mechanism by which each position computes a
weighted combination of all (allowed) other positions, where the weights
depend on the content at each position.

**Multi-head attention.** Self-attention computed in parallel across `n`
subspaces ("heads"), each with `d_model / n` dimensions. Different heads
empirically learn to specialize.

**KV cache.** At inference time, the Key and Value tensors from previous
tokens can be cached and reused, avoiding redundant computation. Critical
for production serving; not implemented in this model.

**Pre-LayerNorm.** Applying LayerNorm *before* the attention/MLP sublayers,
not after. Stabilizes training of deep transformers.

**Weight tying.** Sharing parameters between the input embedding matrix and
the output projection matrix. Saves memory; usually improves quality.

**Cosine learning-rate schedule.** Learning rate ramps up linearly during
warmup, then decays following a cosine curve. Standard for transformer
training.

**Gradient clipping.** Capping the global L2 norm of gradients during
backpropagation to prevent destabilizing weight updates.

**MPS (Metal Performance Shaders).** Apple's GPU acceleration backend for
PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.

**Pre-training.** The stage of training described here: minimize next-token
prediction loss on a large corpus. Produces a *base model*.

**SFT (Supervised Fine-Tuning).** A subsequent training stage on
`(instruction, ideal response)` pairs. Teaches the model to follow
instructions. Not done for this model.

**RLHF (Reinforcement Learning from Human Feedback).** A further training
stage using preference data. Aligns model behavior with human preferences.
Not done for this model.

---

## Citation

If this model or its companion code helped you, please cite or link to:

```
@misc{microgpt,
  author = {Brett Lee Hary},
  title  = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
}
```

### Acknowledgements

- Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT) — the
  reference implementation that made this approachable.
- Eldan & Li (2023), [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) — the dataset and the insight that data quality can substitute for model scale.
- Vaswani et al. (2017), [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — the original transformer.
- The Hugging Face `transformers`, `tokenizers`, and `datasets` teams for
  the infrastructure that makes projects like this trivial to share.