bmeyer2025

Upload README.md with huggingface_hub

3b1bd10 verified 6 days ago

4.66 kB

	---
	license: mit
	language:
	- en
	tags:
	- pytorch
	- transformer
	- language-model
	- from-scratch
	- educational
	- shakespeare
	- rope
	- swiglu
	- rmsnorm
	- kv-cache
	datasets:
	- tiny-shakespeare
	pipeline_tag: text-generation
	---

	# tiny-gpt-shakespeare

	A 10M parameter decoder-only transformer trained on the Tiny Shakespeare dataset. Built from scratch in PyTorch as an educational project — no pretrained weights or external libraries used for the model itself.

	## Model Description

	- Architecture: Decoder-only transformer with modern components (RMSNorm, SwiGLU, RoPE, KV cache)
	- Parameters: 10.6M (modern) / 10.8M (vanilla)
	- Training data: [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (~1.1MB, 65 unique characters)
	- Tokenization: Character-level (65 tokens)
	- Context length: 256 tokens
	- License: MIT

	## Architecture Details

	\| Component \| Implementation \|
	\|-----------\|---------------\|
	\| Layers \| 6 transformer blocks \|
	\| Attention \| 6 heads, 64 dims each, with RoPE \|
	\| FFN \| SwiGLU (384 → 1024 → 384) \|
	\| Normalization \| RMSNorm (pre-norm) \|
	\| Position encoding \| Rotary Position Embeddings (RoPE) \|
	\| Inference \| KV cache for autoregressive generation \|
	\| Weight tying \| lm_head shares weights with token embedding \|

	## Training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 3e-4 \|
	\| Batch size \| 64 \|
	\| Block size \| 256 \|
	\| Dropout \| 0.3 \|
	\| Training steps \| 5,000 (best checkpoint at step 2,500) \|
	\| Hardware \| Google Colab T4 GPU \|
	\| Training time \| ~64 minutes \|

	### Training Results

	\| Model \| Parameters \| Best Val Loss \| Best Step \|
	\|-------\|-----------\|-------------\|-----------\|
	\| Vanilla (LayerNorm + ReLU + learned pos) \| 10.8M \| 1.4804 \| 3,000 \|
	\| Modern (RMSNorm + SwiGLU + RoPE + KV cache) \| 10.6M \| 1.4754 \| 2,500 \|

	Early stopping was used — the model checkpointed at the lowest validation loss.

	### Component Comparison

	Each modern component was tested in isolation against the vanilla baseline (2,000 training steps each):

	\| Component \| Val Loss at Step 500 \| vs Vanilla \|
	\|-----------\|---------------------\|-----------\|
	\| Vanilla (baseline) \| 1.99 \| — \|
	\| RMSNorm \| 1.99 \| No change \|
	\| SwiGLU \| 1.88 \| -0.11 \|
	\| RoPE \| 1.68 \| -0.31 \|

	## Intended Use

	This is an educational model. It is not intended for production use. It generates Shakespeare-style text and serves as a reference implementation for understanding transformer architectures.

	## Sample Outputs

	Modern model, prompt: "ROMEO:", temperature=0.8:
	```
	ROMEO:
	A gallant-house! what says the woe?

	MERCUTIO:
	Good madam, my lord.

	ROMEO:
	Villain, for I do not say it is true,
	Which hath a sin by him come to the crown,
	That he is reports for me; for ever is he.
	```

	Vanilla model, prompt: "ROMEO:", temperature=0.8:
	```
	ROMEO:
	Good father, cousin, my lord, I could not need me.

	First Servant:
	Sir, but you came to this humour of the king,
	Lest hear him withis heart flowers.
	```

	## Limitations

	- Tiny dataset: Trained on only 1.1MB of text. The model overfits after ~2,500 steps.
	- Character-level tokenization: Inefficient compared to BPE. Each character is a separate token.
	- No instruction tuning: This is a base model — it completes text, it does not follow instructions or answer questions.
	- Small context window: 256 tokens maximum.
	- Quality: Output is recognizably Shakespeare-like but contains grammatical errors and occasionally mixes characters from different plays.

	## How to Use

	```python
	import torch
	import sys
	sys.path.append('src')
	from model_modern import ModernGPT
	from tokenizer import encode, decode

	device = "cuda" if torch.cuda.is_available() else "cpu"
	ckpt = torch.load("model.pt", map_location=device, weights_only=False)
	model = ModernGPT(**ckpt["config"]).to(device)
	model.load_state_dict(ckpt["model_state"])
	model.eval()

	idx = torch.tensor([encode("ROMEO:")], dtype=torch.long, device=device)
	out = model.generate(idx, max_new_tokens=200, temperature=0.8)
	print(decode(out[0].tolist()))
	```

	## Source Code

	Full implementation with detailed documentation: [github.com/brianmeyer/tinyllm](https://github.com/brianmeyer/tinyllm)

	## References

	- Karpathy, A. [build-nanogpt](https://github.com/karpathy/build-nanogpt) — primary reference
	- Su et al. (2021). [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
	- Shazeer, N. (2020). [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
	- Zhang & Sennrich (2019). [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)