microgpt / README.md

Initial microGPT upload

14c107a verified 15 days ago

18.1 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-generation
	- transformer
	- educational
	- tiny-llm
	- from-scratch
	- decoder-only
	- gpt
	datasets:
	- roneneldan/TinyStories
	pipeline_tag: text-generation
	library_name: pytorch
	model-index:
	- name: microgpt
	results:
	- task:
	type: text-generation
	name: Story completion
	dataset:
	name: TinyStories (validation split)
	type: roneneldan/TinyStories
	metrics:
	- type: cross-entropy
	value: 2.25
	name: Validation cross-entropy loss
	- type: perplexity
	value: 9.49
	name: Validation perplexity
	---

	# microGPT

	A 1.35M-parameter decoder-only transformer trained from scratch on the
	[TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.
	The entire training run took roughly two hours on an Apple Silicon laptop.
	At ~50,000× smaller than GPT-3, it can still produce coherent simple
	children's stories.

	This is an educational artifact, not a production model. Its purpose is
	to make every component of a modern LLM legible, debuggable, and rebuildable
	on consumer hardware.

	---

	## Quick facts

	\| \| \|
	\|---\|---\|
	\| Architecture \| Decoder-only transformer (GPT-style) \|
	\| Parameters \| 1,345,792 trainable (1.35M) \|
	\| File size on disk \| ~5.1 MB (float32) \|
	\| Training data \| ~470M tokens of TinyStories \|
	\| Training compute \| ~1.5 hours on Apple Silicon (MPS) \|
	\| Final val loss \| 2.25 (perplexity 9.49) \|
	\| Context window \| 256 tokens \|
	\| Tokenizer \| Byte-level BPE, vocab=4096 \|
	\| License \| MIT \|

	---

	## Architecture in detail

	```
	Input tokens (B, T)
	│
	├─► Token Embedding (4096 → 128)
	│ │
	└─► Position Embedding ────┘ ← element-wise sum
	│
	▼ (B, T, 128)
	┌──── Block × 4 ────────────────────────────┐
	│ │
	│ x = LayerNorm(x) │
	│ x = x + CausalSelfAttention(x) ← 4 heads│
	│ x = LayerNorm(x) │
	│ x = x + MLP(x) ← 128→512→128, GELU
	│ │
	└────────────────────────────────────────────┘
	│
	▼ (B, T, 128)
	LayerNorm
	│
	▼
	Linear (128 → 4096) ← weight-tied with token embedding
	│
	▼ (B, T, 4096)
	Logits
	```

	\| Hyperparameter \| Value \| Notes \|
	\|---\|---\|---\|
	\| `n_layers` \| 4 \| Stacked transformer blocks \|
	\| `d_model` \| 128 \| Hidden dimension \|
	\| `n_heads` \| 4 \| Each head is 128/4 = 32 dim \|
	\| `head_dim` \| 32 \| Per-head dimensionality \|
	\| `ffn_dim` \| 512 \| MLP intermediate width (4×d_model) \|
	\| `ctx_len` \| 256 \| Maximum input length in tokens \|
	\| `vocab_size` \| 4,096 \| BPE-derived vocabulary \|
	\| Normalization \| LayerNorm \| Pre-LN (applied before sublayers) \|
	\| Position encoding \| Learned \| Absolute, additive \|
	\| Activation \| GELU \| In the MLP \|
	\| Attention \| Multi-head, causal \| Implemented via `F.scaled_dot_product_attention` \|
	\| Embedding tying \| Yes \| Output projection shares weight with `tok_emb` \|
	\| Bias on linear layers \| No \| Following common modern practice \|
	\| Dropout \| 0.1 (training) \| 0.0 at inference \|

	### Parameter breakdown — where the 1.35M live

	\| Component \| Shape \| Params \| % \|
	\|---\|---\|---\|---\|
	\| Token embeddings (`tok_emb.weight`) \| (4096, 128) \| 524,288 \| 38.9% \|
	\| Position embeddings (`pos_emb.weight`) \| (256, 128) \| 32,768 \| 2.4% \|
	\| 4 × transformer block \| — \| 788,480 \| 58.6% \|
	\| └─ Per block: `ln1` (γ, β) \| (128,) × 2 \| 256 \| \|
	\| └─ Per block: `attn.qkv` \| (384, 128) \| 49,152 \| \|
	\| └─ Per block: `attn.proj` \| (128, 128) \| 16,384 \| \|
	\| └─ Per block: `ln2` (γ, β) \| (128,) × 2 \| 256 \| \|
	\| └─ Per block: `mlp.fc1` \| (512, 128) \| 65,536 \| \|
	\| └─ Per block: `mlp.fc2` \| (128, 512) \| 65,536 \| \|
	\| Final LayerNorm (`ln_f`) \| (128,) × 2 \| 256 \| 0.02% \|
	\| Output projection (`head.weight`) \| (4096, 128) \| 0 \| tied \|
	\| Total \| \| 1,345,792 \| \|

	Two observations worth absorbing:

	- Embeddings are 41% of total parameters at this scale. This is typical of small models — the vocab × d_model matrix dominates. As models grow, the transformer blocks become the much larger fraction (frontier models are >90% transformer body, with embeddings a rounding error).
	- MLPs (`fc1` + `fc2`) account for half of every block's params: 131,072 of 197,120 = 66%. Recent interpretability research suggests MLPs are where most factual knowledge gets stored. At frontier scale this stays roughly true.

	---

	## Training

	### Data

	- Dataset: [`roneneldan/TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories) (Eldan & Li, 2023)
	- Stories: ~2.1M (train) + ~22K (validation)
	- Tokens (after BPE): ~470M (train) + ~5M (validation)
	- Why TinyStories specifically: synthetic dataset designed so vocabulary
	and grammar stay within what a 3–4 year-old understands, making coherent
	generation possible at very small model scales. Without this curation, a
	1.35M-param model on general web text produces gibberish.

	### Tokenizer

	- Type: byte-level Byte-Pair Encoding (BPE)
	- Vocabulary: 4,096 tokens (including special tokens `<unk>`, `<eos>`)
	- Trained on: 50,000 stories from the train split (vocab converges
	quickly; full corpus produces a near-identical tokenizer)
	- Avg compression: ~4 characters per token on TinyStories text

	### Optimization

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW \|
	\| β₁, β₂ \| 0.9, 0.95 \|
	\| Weight decay \| 0.1 \|
	\| Peak learning rate \| 3e-4 \|
	\| Min learning rate \| 3e-5 \|
	\| Schedule \| Linear warmup (200 steps) → cosine decay \|
	\| Batch size (sequences) \| 64 \|
	\| Sequence length \| 256 \|
	\| Tokens per step \| 16,384 \|
	\| Total steps \| 20,000 \|
	\| Total tokens seen \| ~327M \|
	\| Gradient clipping \| 1.0 (global L2 norm) \|
	\| Random seed \| 1337 \|

	### Hardware & wall-clock

	\| \| \|
	\|---\|---\|
	\| Hardware \| Apple M-series laptop (MPS backend) \|
	\| Precision \| float32 \|
	\| Wall-clock \| ~1.5 hours \|
	\| Peak memory \| ~1.5 GB \|
	\| Disk footprint \| ~1 GB tokenized corpus + 5.1 MB checkpoint \|

	---

	## Evaluation

	### Held-out validation loss

	\| Step \| Val loss \| Perplexity \|
	\|---\|---\|---\|
	\| 0 (init) \| 8.32 \| 4096 \|
	\| ~17,500 \| 2.26 \| 9.59 \|
	\| ~20,000 \| 2.25 \| 9.49 \|

	For context: a uniform random predictor over the 4,096-token vocabulary
	would have loss `ln(4096) = 8.32` and perplexity 4,096. A perplexity of
	9.49 means the model has effectively narrowed each prediction down to
	~9.5 plausible next-token choices, on average.

	> Note on missing checkpoint progression. This release contains only the
	> final checkpoint. The original plan was to capture step-tagged snapshots
	> (steps 200, 1000, 3000, 8000, 20000) to demonstrate capability emergence
	> across training — gibberish → words → grammar → narrative. A monitoring
	> script was started after training had already passed the early steps, so
	> only the final checkpoint was preserved. A v0.2 release with multi-checkpoint
	> snapshots and an emergence comparison is planned.

	### Capability probes

	Three diagnostic prompts designed to probe specific capabilities, evaluated
	on the final checkpoint (step 20,000, T=0.8, top-k=40):

	Short-range coherence — "Lily picked up the red ball and"
	> ...put it on the floor. She threw the red ball with her hands and pulled
	> it onto the floor. She tried to make it shine, but it made her feel good.
	> "Oh no, no!" she shouted. "Sharing is very dangerous!"

	Pronoun continuity holds; ball persists as object. ✓

	Causal reasoning — "Tom was very sad because"
	> ...he had no one to play. He thought that it would be fun and share his
	> toys with Timmy. His mom took him inside and said, "I love you too, Timmy!"

	"because" produces a real cause of sadness; emotional consistency holds for
	2–3 sentences before drifting. ✓ (with caveats)

	Narrative composition — "The little dragon wanted to fly but"
	> ...her mommy said no. The bear was very sad that he was gone. He wanted
	> to fly anymore and get lost.

	Initial obstacle is set up correctly, but the model loses track of which
	character is which (dragon → bear → "he"). ✗

	This pattern — local coherence ✓, multi-sentence composition partial — is
	expected at this scale. Narrative arc requires planning across many tokens,
	which is one of the last capabilities to emerge in language models even at
	frontier scale.

	---

	## Intended use

	In scope:
	- Educational reference for the GPT-style transformer architecture
	- Demonstration of end-to-end LLM training on consumer hardware
	- Generating short, simple, TinyStories-style English children's narratives
	- Exploring how sampling parameters (temperature, top-k, top-p) affect output
	- Comparison baseline for tiny-model research

	Out of scope:
	- General-purpose text generation (vocabulary is restricted to TinyStories)
	- Question answering, instruction following, or chat (no SFT or RLHF stage)
	- Anything requiring factual accuracy (no factual grounding)
	- Non-English text (English-only training data)
	- Long-form generation (256-token context window)

	---

	## Limitations and biases

	- Distribution lock-in: Trained exclusively on synthetic children's
	stories. Generation outside this distribution (e.g., technical text,
	adult themes, dialogue formats) will be incoherent.
	- No instruction following: This is a base model — pre-training only.
	It completes text; it does not answer questions or follow instructions.
	- Hallucination: No factual grounding. The model has no concept of
	"I don't know" — it produces the most statistically plausible
	continuation, which is often false outside the training distribution.
	- Context window: 256 tokens is too short to model long dependencies.
	- Synthetic data biases: TinyStories was generated by GPT-3.5/4 with
	prompted constraints, so it inherits some of that generator's stylistic
	patterns and any biases encoded therein.
	- No safety training: No RLHF, no Constitutional AI, no content
	filtering. While the training data is innocuous, prompts that
	push toward harmful outputs receive no safeguards.
	- Memorization vs generalization: Some completions ("She was very
	happy and they played all day") are likely memorized stylistic
	patterns rather than novel generation.

	---

	## How to use

	### Inference

	```python
	from inference import NanoSLMInference

	slm = NanoSLMInference("ckpt.pt", "tokenizer.json")

	text = slm.generate(
	"Once upon a time, there was a little",
	max_new_tokens=200,
	temperature=0.8,
	top_k=40,
	)
	print(text)
	```

	### Sampling parameters

	\| Parameter \| Effect \|
	\|---\|---\|
	\| `temperature` \| Scales logits before softmax. 0 = greedy (deterministic, often repetitive). 1.0 = no scaling. >1 = more random. Typical: 0.7–1.0. \|
	\| `top_k` \| Keep only the k highest-probability tokens. Filters tail-of-distribution garbage. Typical: 40–100. \|
	\| `top_p` (nucleus) \| Keep the smallest set of tokens with cumulative probability ≥ p. Adapts the cutoff to distribution shape. Typical: 0.9–0.95. \|
	\| `seed` \| Sets PyTorch RNG for reproducibility. \|

	---

	## How this model is served

	A live demo is hosted on [Hugging Face Spaces](https://huggingface.co/spaces/brettleehari/microgpt-demo).
	The serving stack is intentionally minimal:

	```
	User browser
	↓ HTTPS
	HF Spaces (free CPU instance, 2 vCPU / 16 GB RAM)
	↓
	Gradio + FastAPI/uvicorn
	↓
	PyTorch eager-mode forward pass on CPU
	↓
	Autoregressive token generation, one token per pass
	```

	Approximate latency for 100 generated tokens: **~3 seconds on Spaces' free
	CPU, ~0.5 seconds on Apple M-series with MPS**.

	What this serving setup deliberately does not implement (each is a separate
	upgrade and a useful learning exercise):

	- KV-caching — every generation step re-processes all prior tokens.
	A real implementation caches K/V tensors and pays only for the new token.
	- Continuous batching — multiple users would queue serially. Production
	servers (vLLM, TGI) batch concurrent requests dynamically.
	- Quantization — weights are float32. int8/int4 would shrink memory ~4×.
	- Compiled graphs — eager-mode PyTorch leaves performance on the table
	vs `torch.compile()`, ONNX Runtime, or a dedicated engine.

	For a model this small the overheads don't matter. At any production scale,
	every one of the above becomes critical to unit economics.

	---

	## Comparison with frontier models

	The architecture is structurally identical to GPT-2/3, Llama, Mistral, and
	Claude. The differences below are evolutionary refinements, not categorical
	changes — the core "decoder-only transformer trained with next-token
	prediction" recipe is the same.

	\| \| microGPT (this) \| Llama 3 70B \|
	\|---\|---\|---\|
	\| Parameters \| 1.35M \| 70B (~52,000× larger) \|
	\| Layers \| 4 \| 80 \|
	\| `d_model` \| 128 \| 8,192 \|
	\| Heads \| 4 (multi-head) \| 64 (grouped-query attention) \|
	\| Context \| 256 \| 128,000 \|
	\| Vocab \| 4,096 \| 128,256 \|
	\| Position \| Learned absolute \| Rotary (RoPE) \|
	\| Activation \| GELU \| SwiGLU \|
	\| Normalization \| LayerNorm \| RMSNorm \|
	\| Training tokens \| ~327M \| ~15T (~46,000× more) \|
	\| Training compute \| ~5 kWh laptop \| many MW-months on H100 clusters \|

	---

	## Glossary

	A short reference for the terminology used above. Worth absorbing — these
	terms come up constantly in AI literature and interviews.

	Parameter / weight. A single learnable number stored in the model.
	Updated during training, read during inference. A "1.35M parameter model"
	literally has 1.35M of these numbers.

	Embedding. A learned vector representation of a discrete object (token,
	position). Implemented as a lookup table.

	Token. The atomic unit of text the model operates on. Produced by the
	tokenizer; typically ~4 characters of English per token for byte-level BPE.

	Tokenizer. The deterministic, reversible function that converts strings
	to integer ID sequences and back. Decisions made here (vocab size, BPE
	merges) propagate through the entire model.

	BPE (Byte-Pair Encoding). A subword tokenization algorithm that
	iteratively merges the most frequent adjacent pairs of symbols into new
	vocabulary entries.

	Logits. The raw, unnormalized scores the model outputs — one per
	vocabulary token at each position. Becomes a probability distribution after
	softmax.

	Softmax. Function that converts logits to probabilities by exponentiating
	and normalizing.

	Cross-entropy loss. The training objective: how surprised the model is
	by the correct next token. Equals 0 if the model assigned probability 1 to
	the right answer; equals `ln(vocab_size)` if the model is uniformly
	uninformed.

	Perplexity. `exp(loss)`. The "effective number of choices" the model is
	deciding between. Useful because it has a more intuitive scale than loss.

	Decoder-only / autoregressive. The model only attends to past tokens
	(causal mask), and generates one token at a time conditioned on what it has
	already produced.

	Self-attention. The mechanism by which each position computes a
	weighted combination of all (allowed) other positions, where the weights
	depend on the content at each position.

	Multi-head attention. Self-attention computed in parallel across `n`
	subspaces ("heads"), each with `d_model / n` dimensions. Different heads
	empirically learn to specialize.

	KV cache. At inference time, the Key and Value tensors from previous
	tokens can be cached and reused, avoiding redundant computation. Critical
	for production serving; not implemented in this model.

	Pre-LayerNorm. Applying LayerNorm before the attention/MLP sublayers,
	not after. Stabilizes training of deep transformers.

	Weight tying. Sharing parameters between the input embedding matrix and
	the output projection matrix. Saves memory; usually improves quality.

	Cosine learning-rate schedule. Learning rate ramps up linearly during
	warmup, then decays following a cosine curve. Standard for transformer
	training.

	Gradient clipping. Capping the global L2 norm of gradients during
	backpropagation to prevent destabilizing weight updates.

	MPS (Metal Performance Shaders). Apple's GPU acceleration backend for
	PyTorch on M-series chips. The Apple Silicon equivalent of CUDA.

	Pre-training. The stage of training described here: minimize next-token
	prediction loss on a large corpus. Produces a base model.

	SFT (Supervised Fine-Tuning). A subsequent training stage on
	`(instruction, ideal response)` pairs. Teaches the model to follow
	instructions. Not done for this model.

	RLHF (Reinforcement Learning from Human Feedback). A further training
	stage using preference data. Aligns model behavior with human preferences.
	Not done for this model.

	---

	## Citation

	If this model or its companion code helped you, please cite or link to:

	```
	@misc{microgpt,
	author = {Brett Lee Hary},
	title = {microGPT: a 1.35M-parameter transformer trained from scratch on TinyStories},
	year = {2026},
	howpublished = {\url{https://huggingface.co/brettleehari/microgpt}},
	}
	```

	### Acknowledgements

	- Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT) — the
	reference implementation that made this approachable.
	- Eldan & Li (2023), [TinyStories: How Small Can Language Models Be and Still Speak Coherent English?](https://arxiv.org/abs/2305.07759) — the dataset and the insight that data quality can substitute for model scale.
	- Vaswani et al. (2017), [Attention Is All You Need](https://arxiv.org/abs/1706.03762) — the original transformer.
	- The Hugging Face `transformers`, `tokenizers`, and `datasets` teams for
	the infrastructure that makes projects like this trivial to share.