Model specification 
- Params: 21 million 
- Architecture: Decoder-only transformer
- Training data: 1.1 million tokens from Shakespeare text
- Context length: 256