Mini Modern LLM

Mini Modern LLM is a compact, educational language model built with PyTorch. It takes the ideas implemented in the accompanying notebook and turns them into a small runnable codebase with reusable modules for tokenization, dataset preparation, training, and text generation.

The project is designed to be easy to read while still using several ideas from modern transformer architectures:

RMSNorm instead of standard LayerNorm
RoPE (Rotary Positional Embeddings) instead of learned positional embeddings
Grouped Query Attention (GQA) to reduce KV head cost
SwiGLU as the feed-forward activation block

Architecture

The core model lives in src/model.py and is implemented as a pre-norm transformer.

Model Flow

BPE token ids are mapped to embeddings.
Tokens pass through a stack of transformer blocks.
Each block applies:
- RMSNorm
- GroupedQueryAttention with RoPE
- residual connection
- RMSNorm
- SwiGLU
- residual connection
A final RMSNorm is applied.
A tied output projection predicts the next token.

Modern Components

RMSNorm Normalizes activations using root-mean-square statistics with a learned scale parameter.
RoPE Encodes token position by rotating query and key vectors, allowing attention to capture relative positions naturally.
Grouped Query Attention Uses more query heads than key/value heads, which keeps attention expressive while reducing KV projection cost.
SwiGLU Uses a SiLU-gated feed-forward pathway that is commonly used in modern LLMs.

Default Training Configuration

The current training script uses:

d_model = 256
n_layers = 4
n_heads = 8
n_kv_heads = 2
ffn_hidden_dim = 680
context_length = 256
batch_size = 64

Dataset

The project trains on the TinyStories dataset from Hugging Face.

Source: roneneldan/TinyStories
Raw text file: data/dataset.txt
Tokenization: byte-level BPE via tokenizer.json
Split: 90% train / 10% validation over the token stream

The dataset is downloaded automatically the first time you run download_dataset.py, train_tokenizer.py, or train.py.

Installation

Create a virtual environment if you want, then install dependencies:

pip install -r requirements.txt

For Google Colab, install only the missing libraries so you keep Colab's GPU-enabled PyTorch:

pip install tokenizers datasets numpy

How To Train

Run the full pipeline:

python download_dataset.py

python train_tokenizer.py

python train.py

What the training script does:

ensures data/dataset.txt exists
loads tokenizer.json
converts TinyStories into BPE token ids
initializes MiniLLM
trains with AdamW
prints training loss every 100 steps
saves a checkpoint to model.pt

The saved checkpoint includes:

model weights
model config
tokenizer metadata

Publish Notes

If you push this repository to Hugging Face, do not commit generated training artifacts such as data/dataset.txt, token caches, or model.pt. The included .gitignore already excludes them.

How To Generate Text

After training, generate text with:

python generate.py --prompt "Once upon a time"

You can also control sampling temperature and output length:

python generate.py --prompt "Once upon a time" --temperature 0.8 --max-new-tokens 300

Useful flags:

--prompt starting text for generation
--temperature sampling temperature, where 0 is greedy decoding
--max-new-tokens number of tokens to generate
--model-path optional path to a checkpoint
--seed optional random seed for reproducible sampling

Repository Structure

.
├── src/
│   ├── model.py
│   ├── tokenizer.py
│   └── dataset.py
├── notebooks/
│   └── first-mini-modern-llm.ipynb
├── train.py
├── generate.py
└── requirements.txt

Summary

This repository is a small but modern LLM training project focused on clarity and learning. It is a good starting point for experimenting with transformer internals, BPE tokenization, and compact end-to-end language model workflows.

Downloads last month: -; Downloads are not tracked for this model. How to track