Mini Modern LLM
Mini Modern LLM is a compact, educational language model built with PyTorch. It takes the ideas implemented in the accompanying notebook and turns them into a small runnable codebase with reusable modules for tokenization, dataset preparation, training, and text generation.
The project is designed to be easy to read while still using several ideas from modern transformer architectures:
RMSNorminstead of standard LayerNormRoPE(Rotary Positional Embeddings) instead of learned positional embeddingsGrouped Query Attention (GQA)to reduce KV head costSwiGLUas the feed-forward activation block
Architecture
The core model lives in src/model.py and is implemented as a pre-norm transformer.
Model Flow
- BPE token ids are mapped to embeddings.
- Tokens pass through a stack of transformer blocks.
- Each block applies:
RMSNormGroupedQueryAttentionwithRoPE- residual connection
RMSNormSwiGLU- residual connection
- A final
RMSNormis applied. - A tied output projection predicts the next token.
Modern Components
RMSNormNormalizes activations using root-mean-square statistics with a learned scale parameter.RoPEEncodes token position by rotating query and key vectors, allowing attention to capture relative positions naturally.Grouped Query AttentionUses more query heads than key/value heads, which keeps attention expressive while reducing KV projection cost.SwiGLUUses a SiLU-gated feed-forward pathway that is commonly used in modern LLMs.
Default Training Configuration
The current training script uses:
d_model = 256n_layers = 4n_heads = 8n_kv_heads = 2ffn_hidden_dim = 680context_length = 256batch_size = 64
Dataset
The project trains on the TinyStories dataset from Hugging Face.
- Source:
roneneldan/TinyStories - Raw text file:
data/dataset.txt - Tokenization: byte-level BPE via
tokenizer.json - Split: 90% train / 10% validation over the token stream
The dataset is downloaded automatically the first time you run download_dataset.py, train_tokenizer.py, or train.py.
Installation
Create a virtual environment if you want, then install dependencies:
pip install -r requirements.txt
For Google Colab, install only the missing libraries so you keep Colab's GPU-enabled PyTorch:
pip install tokenizers datasets numpy
How To Train
Run the full pipeline:
python download_dataset.py
python train_tokenizer.py
python train.py
What the training script does:
- ensures
data/dataset.txtexists - loads
tokenizer.json - converts TinyStories into BPE token ids
- initializes
MiniLLM - trains with
AdamW - prints training loss every 100 steps
- saves a checkpoint to
model.pt
The saved checkpoint includes:
- model weights
- model config
- tokenizer metadata
Publish Notes
If you push this repository to Hugging Face, do not commit generated training artifacts such as data/dataset.txt, token caches, or model.pt. The included .gitignore already excludes them.
How To Generate Text
After training, generate text with:
python generate.py --prompt "Once upon a time"
You can also control sampling temperature and output length:
python generate.py --prompt "Once upon a time" --temperature 0.8 --max-new-tokens 300
Useful flags:
--promptstarting text for generation--temperaturesampling temperature, where0is greedy decoding--max-new-tokensnumber of tokens to generate--model-pathoptional path to a checkpoint--seedoptional random seed for reproducible sampling
Repository Structure
.
βββ src/
β βββ model.py
β βββ tokenizer.py
β βββ dataset.py
βββ notebooks/
β βββ first-mini-modern-llm.ipynb
βββ train.py
βββ generate.py
βββ requirements.txt
Summary
This repository is a small but modern LLM training project focused on clarity and learning. It is a good starting point for experimenting with transformer internals, BPE tokenization, and compact end-to-end language model workflows.