Model specification - Params: 21 million - Architecture: Decoder-only transformer - Training data: 1.1 million tokens from Shakespeare text - Context length: 256