--- license: mit language: - en tags: - pytorch - transformer - language-model - from-scratch - educational - shakespeare - rope - swiglu - rmsnorm - kv-cache datasets: - tiny-shakespeare pipeline_tag: text-generation --- # tiny-gpt-shakespeare A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself. ## Model Description - **Architecture:** Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache) - **Parameters:** 10.6M (modern) / 10.8M (vanilla) - **Training data:** [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters) - **Tokenization:** Character-level (65 tokens) - **Context length:** 256 tokens - **License:** MIT ## Architecture Details | Component | Implementation | |-----------|---------------| | Layers | 6 transformer blocks | | Attention | 6 heads, 64 dims each, with RoPE | | FFN | SwiGLU (384 → 1024 → 384) | | Normalization | RMSNorm (pre-norm) | | Position encoding | Rotary Position Embeddings (RoPE) | | Inference | KV cache for autoregressive generation | | Weight tying | lm_head shares weights with token embedding | ## Training | Parameter | Value | |-----------|-------| | Optimizer | AdamW | | Learning rate | 3e-4 | | Batch size | 64 | | Block size | 256 | | Dropout | 0.3 | | Training steps | 5,000 (best checkpoint at step 2,500) | | Hardware | Google Colab T4 GPU | | Training time | ~64 minutes | ### Training Results | Model | Parameters | Best Val Loss | Best Step | |-------|-----------|-------------|-----------| | Vanilla (LayerNorm + ReLU + learned pos) | 10.8M | 1.4804 | 3,000 | | Modern (RMSNorm + SwiGLU + RoPE + KV cache) | 10.6M | 1.4754 | 2,500 | Early stopping was used — the model checkpointed at the lowest validation loss. ### Component Comparison Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each): | Component | Val Loss at Step 500 | vs Vanilla | |-----------|---------------------|-----------| | Vanilla (baseline) | 1.99 | — | | RMSNorm | 1.99 | No change | | SwiGLU | 1.88 | -0.11 | | RoPE | 1.68 | -0.31 | ## Intended Use This is an **educational model**. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures. ## Sample Outputs **Modern model, prompt: "ROMEO:", temperature=0.8:** ``` ROMEO: A gallant-house! what says the woe? MERCUTIO: Good madam, my lord. ROMEO: Villain, for I do not say it is true, Which hath a sin by him come to the crown, That he is reports for me; for ever is he. ``` **Vanilla model, prompt: "ROMEO:", temperature=0.8:** ``` ROMEO: Good father, cousin, my lord, I could not need me. First Servant: Sir, but you came to this humour of the king, Lest hear him withis heart flowers. ``` ## Limitations - **Tiny dataset:** Trained on only 1.1MB of text. The model overfits after ~2,500 steps. - **Character-level tokenization:** Inefficient compared to BPE. Each character is a separate token. - **No instruction tuning:** This is a base model — it completes text, it does not follow instructions or answer questions. - **Small context window:** 256 tokens maximum. - **Quality:** Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays. ## How to Use ```python import torch import sys sys.path.append('src') from model_modern import ModernGPT from tokenizer import encode, decode device = "cuda" if torch.cuda.is_available() else "cpu" ckpt = torch.load("model.pt", map_location=device, weights_only=False) model = ModernGPT(**ckpt["config"]).to(device) model.load_state_dict(ckpt["model_state"]) model.eval() idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device) out = model.generate(idx, max_new_tokens=200, temperature=0.8) print(decode(out[0].tolist())) ``` ## Source Code Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm) ## References - Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference - Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) - Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) - Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)