tiny-gpt-shakespeare

A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.

Model Description

Architecture: Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
Parameters: 10.6M (modern) / 10.8M (vanilla)
Training data: Tiny Shakespeare (~1.1MB, 65 unique characters)
Tokenization: Character-level (65 tokens)
Context length: 256 tokens
License: MIT

Architecture Details

Component	Implementation
Layers	6 transformer blocks
Attention	6 heads, 64 dims each, with RoPE
FFN	SwiGLU (384 → 1024 → 384)
Normalization	RMSNorm (pre-norm)
Position encoding	Rotary Position Embeddings (RoPE)
Inference	KV cache for autoregressive generation
Weight tying	lm_head shares weights with token embedding

Training

Parameter	Value
Optimizer	AdamW
Learning rate	3e-4
Batch size	64
Block size	256
Dropout	0.3
Training steps	5,000 (best checkpoint at step 2,500)
Hardware	Google Colab T4 GPU
Training time	~64 minutes

Training Results

Model	Parameters	Best Val Loss	Best Step
Vanilla (LayerNorm + ReLU + learned pos)	10.8M	1.4804	3,000
Modern (RMSNorm + SwiGLU + RoPE + KV cache)	10.6M	1.4754	2,500

Early stopping was used — the model checkpointed at the lowest validation loss.

Component Comparison

Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):

Component	Val Loss at Step 500	vs Vanilla
Vanilla (baseline)	1.99	—
RMSNorm	1.99	No change
SwiGLU	1.88	-0.11
RoPE	1.68	-0.31

Intended Use

This is an educational model. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.

Sample Outputs

Modern model, prompt: "ROMEO:", temperature=0.8:

ROMEO:
A gallant-house! what says the woe?

MERCUTIO:
Good madam, my lord.

ROMEO:
Villain, for I do not say it is true,
Which hath a sin by him come to the crown,
That he is reports for me; for ever is he.

Vanilla model, prompt: "ROMEO:", temperature=0.8:

ROMEO:
Good father, cousin, my lord, I could not need me.

First Servant:
Sir, but you came to this humour of the king,
Lest hear him withis heart flowers.

Limitations

Tiny dataset: Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
Character-level tokenization: Inefficient compared to BPE. Each character is a separate token.
No instruction tuning: This is a base model — it completes text, it does not follow instructions or answer questions.
Small context window: 256 tokens maximum.
Quality: Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.

How to Use

import torch
import sys
sys.path.append('src')
from model_modern import ModernGPT
from tokenizer import encode, decode

device = "cuda" if torch.cuda.is_available() else "cpu"
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model = ModernGPT(**ckpt["config"]).to(device)
model.load_state_dict(ckpt["model_state"])
model.eval()

idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
out = model.generate(idx, max_new_tokens=200, temperature=0.8)
print(decode(out[0].tolist()))

Source Code

Full implementation with detailed documentation: github.com/brianmeyer/tinyllm

References

Karpathy, A. build-nanogpt — primary reference
Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding
Shazeer, N. (2020). GLU Variants Improve Transformer
Zhang & Sennrich (2019). Root Mean Square Layer Normalization

Downloads last month: 10

Papers for bmeyer2025/tiny-gpt-shakespeare