microGPT

A 1.35M-parameter decoder-only transformer trained from scratch on the TinyStories dataset. The entire training run took roughly two hours on an Apple Silicon laptop. At ~50,000Γ— smaller than GPT-3, it can still produce coherent simple children's stories.

This is an educational artifact, not a production model. Its purpose is to make every component of a modern LLM legible, debuggable, and rebuildable on consumer hardware.


Quick facts

Architecture Decoder-only transformer (GPT-style)
Parameters 1,345,792 trainable (1.35M)
File size on disk ~5.1 MB (float32)
Training data ~470M tokens of TinyStories
Training compute ~1.5 hours on Apple Silicon (MPS)
Final val loss 2.25 (perplexity 9.49)
Context window 256 tokens
Tokenizer Byte-level BPE, vocab=4096
License MIT

Architecture in detail

Input tokens (B, T)
    β”‚
    β”œβ”€β–Ί Token Embedding   (4096 β†’ 128)
    β”‚                          β”‚
    └─► Position Embedding β”€β”€β”€β”€β”˜ ← element-wise sum
            β”‚
            β–Ό  (B, T, 128)
   β”Œβ”€β”€β”€β”€ Block Γ— 4 ────────────────────────────┐
   β”‚                                            β”‚
   β”‚   x = LayerNorm(x)                         β”‚
   β”‚   x = x + CausalSelfAttention(x)  ← 4 headsβ”‚
   β”‚   x = LayerNorm(x)                         β”‚
   β”‚   x = x + MLP(x)                  ← 128β†’512β†’128, GELU
   β”‚                                            β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό  (B, T, 128)
        LayerNorm
            β”‚
            β–Ό
   Linear (128 β†’ 4096)   ← weight-tied with token embedding
            β”‚
            β–Ό  (B, T, 4096)
        Logits
Hyperparameter Value Notes
n_layers 4 Stacked transformer blocks
d_model 128 Hidden dimension
n_heads 4 Each head is 128/4 = 32 dim
head_dim 32 Per-head dimensionality
ffn_dim 512 MLP intermediate width (4Γ—d_model)
ctx_len 256 Maximum input length in tokens
vocab_size 4,096 BPE-derived vocabulary
Normalization LayerNorm Pre-LN (applied before sublayers)
Position encoding Learned Absolute, additive
Activation GELU In the MLP
Attention Multi-head, causal Implemented via F.scaled_dot_product_attention
Embedding tying Yes Output projection shares weight with tok_emb
Bias on linear layers No Following common modern practice
Dropout 0.1 (training) 0.0 at inference

Parameter breakdown β€” where the 1.35M live

Component Shape Params %
Token embeddings (tok_emb.weight) (4096, 128) 524,288 38.9%
Position embeddings (pos_emb.weight) (256, 128) 32,768 2.4%
4 Γ— transformer block β€” 788,480 58.6%
└─ Per block: ln1 (Ξ³, Ξ²) (128,) Γ— 2 256
└─ Per block: attn.qkv (384, 128) 49,152
└─ Per block: attn.proj (128, 128) 16,384
└─ Per block: ln2 (Ξ³, Ξ²) (128,) Γ— 2 256
└─ Per block: mlp.fc1 (512, 128) 65,536
└─ Per block: mlp.fc2 (128, 512) 65,536
Final LayerNorm (ln_f) (128,) Γ— 2 256 0.02%
Output projection (head.weight) (4096, 128) 0 tied
Total 1,345,792

Two observations worth absorbing:

  • Embeddings are 41% of total parameters at this scale. This is typical of small models β€” the vocab Γ— d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
  • MLPs (fc1 + fc2) account for half of every block's params: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.

Training

Data

  • Dataset: roneneldan/TinyStories (Eldan & Li, 2023)
  • Stories: ~2.1M (train) + ~22K (validation)
  • Tokens (after BPE): ~470M (train) + ~5M (validation)
  • Why TinyStories specifically: synthetic dataset designed so vocabulary and grammar stay within what a 3–4 year-old understands, making coherent generation possible at very small model scales. Without this curation, a 1.35M-param model on general web text produces gibberish.

Tokenizer

  • Type: byte-level Byte-Pair Encoding (BPE)
  • Vocabulary: 4,096 tokens (including special tokens <unk>, <eos>)
  • Trained on: 50,000 stories from the train split (vocab converges quickly; full corpus produces a near-identical tokenizer)
  • Avg compression: ~4 characters per token on TinyStories text

Optimization

Hyperparameter Value
Optimizer AdamW
β₁, Ξ²β‚‚ 0.9, 0.95
Weight decay 0.1
Peak learning rate 3e-4
Min learning rate 3e-5
Schedule Linear warmup (200 steps) β†’ cosine decay
Batch size (sequences) 64
Sequence length 256
Tokens per step 16,384
Total steps 20,000
Total tokens seen ~327M
Gradient clipping 1.0 (global L2 norm)
Random seed 1337

Hardware & wall-clock

Hardware Apple M-series laptop (MPS backend)
Precision float32
Wall-clock ~1.5 hours
Peak memory ~1.5 GB
Disk footprint ~1 GB tokenized corpus + 5.1 MB checkpoint

Evaluation

Held-out validation loss

Step Val loss Perplexity
0 (init) 8.32 4096
~17,500 2.26 9.59
~20,000 2.25 9.49

For context: a uniform random predictor over the 4,096-token vocabulary would have loss ln(4096) = 8.32 and perplexity 4,096. A perplexity of 9.49 means the model has effectively narrowed each prediction down to ~9.5 plausible next-token choices, on average.

Note on missing checkpoint progression. This release contains only the final checkpoint. The original plan was to capture step-tagged snapshots (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability emergence across training β€” gibberish β†’ words β†’ grammar β†’ narrative. A monitoring script was started after training had already passed the early steps, so only the final checkpoint was preserved. A v0.2 release with multi-checkpoint snapshots and an emergence comparison is planned.

Capability probes

Three diagnostic prompts designed to probe specific capabilities, evaluated on the final checkpoint (step 20,000, T=0.8, top-k=40):

Short-range coherence β€” "Lily picked up the red ball and"

...put it on the floor. She threw the red ball with her hands and pulled it onto the floor. She tried to make it shine, but it made her feel good. "Oh no, no!" she shouted. "Sharing is very dangerous!"

Pronoun continuity holds; ball persists as object. βœ“

Causal reasoning β€” "Tom was very sad because"

...he had no one to play. He thought that it would be fun and share his toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"

"because" produces a real cause of sadness; emotional consistency holds for 2–3 sentences before drifting. βœ“ (with caveats)

Narrative composition β€” "The little dragon wanted to fly but"

...her mommy said no. The bear was very sad that he was gone. He wanted to fly anymore and get lost.

Initial obstacle is set up correctly, but the model loses track of which character is which (dragon β†’ bear β†’ "he"). βœ—

This pattern β€” local coherence βœ“, multi-sentence composition partial β€” is expected at this scale. Narrative arc requires planning across many tokens, which is one of the last capabilities to emerge in language models even at frontier scale.


Intended use

In scope:

  • Educational reference for the GPT-style transformer architecture
  • Demonstration of end-to-end LLM training on consumer hardware
  • Generating short, simple, TinyStories-style English children's narratives
  • Exploring how sampling parameters (temperature, top-k, top-p) affect output
  • Comparison baseline for tiny-model research

Out of scope:

  • General-purpose text generation (vocabulary is restricted to TinyStories)
  • Question answering, instruction following, or chat (no SFT or RLHF stage)
  • Anything requiring factual accuracy (no factual grounding)
  • Non-English text (English-only training data)
  • Long-form generation (256-token context window)

Limitations and biases

  • Distribution lock-in: Trained exclusively on synthetic children's stories. Generation outside this distribution (e.g., technical text, adult themes, dialogue formats) will be incoherent.
  • No instruction following: This is a base model β€” pre-training only. It completes text; it does not answer questions or follow instructions.
  • Hallucination: No factual grounding. The model has no concept of "I don't know" β€” it produces the most statistically plausible continuation, which is often false outside the training distribution.
  • Context window: 256 tokens is too short to model long dependencies.
  • Synthetic data biases: TinyStories was generated by GPT-3.5/4 with prompted constraints, so it inherits some of that generator's stylistic patterns and any biases encoded therein.
  • No safety training: No RLHF, no Constitutional AI, no content filtering. While the training data is innocuous, prompts that push toward harmful outputs receive no safeguards.
  • Memorization vs generalization: Some completions ("She was very happy and they played all day") are likely memorized stylistic patterns rather than novel generation.

How to use

Inference

from inference import NanoSLMInference

slm = NanoSLMInference("ckpt.pt", "tokenizer.json")

text = slm.generate(
    "Once upon a time, there was a little",
    max_new_tokens=200,
    temperature=0.8,
    top_k=40,
)
print(text)

Sampling parameters

Parameter Effect
temperature Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0.
top_k Keep only the k highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100.
top_p (nucleus) Keep the smallest set of tokens with cumulative probability β‰₯ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95.
seed Sets PyTorch RNG for reproducibility.

How this model is served

A live demo is hosted on Hugging Face Spaces. The serving stack is intentionally minimal:

User browser
    ↓ HTTPS
HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
    ↓
Gradio + FastAPI/uvicorn
    ↓
PyTorch eager-mode forward pass on CPU
    ↓
Autoregressive token generation, one token per pass

Approximate latency for 100 generated tokens: ~3 seconds on Spaces' free CPU, ~0.5 seconds on Apple M-series with MPS.

What this serving setup deliberately does not implement (each is a separate upgrade and a useful learning exercise):

  • KV-caching β€” every generation step re-processes all prior tokens. A real implementation caches K/V tensors and pays only for the new token.
  • Continuous batching β€” multiple users would queue serially. Production servers (vLLM, TGI) batch concurrent requests dynamically.
  • Quantization β€” weights are float32. int8/int4 would shrink memory ~4Γ—.
  • Compiled graphs β€” eager-mode PyTorch leaves performance on the table vs torch.compile(), ONNX Runtime, or a dedicated engine.

For a model this small the overheads don't matter. At any production scale, every one of the above becomes critical to unit economics.


Comparison with frontier models

The architecture is structurally identical to GPT-2/3, Llama, Mistral, and Claude. The differences below are evolutionary refinements, not categorical changes β€” the core "decoder-only transformer trained with next-token prediction" recipe is the same.

microGPT (this) Llama 3 70B
Parameters 1.35M 70B (~52,000Γ— larger)
Layers 4 80
d_model 128 8,192
Heads 4 (multi-head) 64 (grouped-query attention)
Context 256 128,000
Vocab 4,096 128,256
Position Learned absolute Rotary (RoPE)
Activation GELU SwiGLU
Normalization LayerNorm RMSNorm
Training tokens ~327M 15T (46,000Γ— more)
Training compute ~5 kWh laptop many MW-months on H100 clusters

Glossary

A short reference for the terminology used above. Worth absorbing β€” these terms come up constantly in AI literature and interviews.

Parameter / weight. A single learnable number stored in the model. Updated during training, read during inference. A "1.35M parameter model" literally has 1.35M of these numbers.

Embedding. A learned vector representation of a discrete object (token, position). Implemented as a lookup table.

Token. The atomic unit of text the model operates on. Produced by the tokenizer; typically ~4 characters of English per token for byte-level BPE.

Tokenizer. The deterministic, reversible function that converts strings to integer ID sequences and back. Decisions made here (vocab size, BPE merges) propagate through the entire model.

BPE (Byte-Pair Encoding). A subword tokenization algorithm that iteratively merges the most frequent adjacent pairs of symbols into new vocabulary entries.

Logits. The raw, unnormalized scores the model outputs β€” one per vocabulary token at each position. Becomes a probability distribution after softmax.

Softmax. Function that converts logits to probabilities by exponentiating and normalizing.

Cross-entropy loss. The training objective: how surprised the model is by the correct next token. Equals 0 if the model assigned probability 1 to the right answer; equals ln(vocab_size) if the model is uniformly uninformed.

Perplexity. exp(loss). The "effective number of choices" the model is deciding between. Useful because it has a more intuitive scale than loss.

Decoder-only / autoregressive. The model only attends to past tokens (causal mask), and generates one token at a time conditioned on what it has already produced.

Self-attention. The mechanism by which each position computes a weighted combination of all (allowed) other positions, where the weights depend on the content at each position.

Multi-head attention. Self-attention computed in parallel across n subspaces ("heads"), each with d_model / n dimensions. Different heads empirically learn to specialize.

KV cache. At inference time, the Key and Value tensors from previous tokens can be cached and reused, avoiding redundant computation. Critical for production serving; not implemented in this model.

Pre-LayerNorm. Applying LayerNorm before the attention/MLP sublayers, not after. Stabilizes training of deep transformers.

Weight tying. Sharing parameters between the input embedding matrix and the output projection matrix. Saves memory; usually improves quality.

Cosine learning-rate schedule. Learning rate ramps up linearly during warmup, then decays following a cosine curve. Standard for transformer training.

Gradient clipping. Capping the global L2 norm of gradients during backpropagation to prevent destabilizing weight updates.

MPS (Metal Performance Shaders). Apple's GPU acceleration backend for PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.

Pre-training. The stage of training described here: minimize next-token prediction loss on a large corpus. Produces a base model.

SFT (Supervised Fine-Tuning). A subsequent training stage on (instruction, ideal response) pairs. Teaches the model to follow instructions. Not done for this model.

RLHF (Reinforcement Learning from Human Feedback). A further training stage using preference data. Aligns model behavior with human preferences. Not done for this model.


Citation

If this model or its companion code helped you, please cite or link to:

@misc{microgpt,
  author = {Brett Lee Hary},
  title  = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
}

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train brettleehari/microgpt

Papers for brettleehari/microgpt

Evaluation results

  • Validation cross-entropy loss on TinyStories (validation split)
    self-reported
    2.250
  • Validation perplexity on TinyStories (validation split)
    self-reported
    9.490