- SLM β Decoder-only Language Model with Curriculum Learning
- Project Structure
- Architecture
- Curriculum Stages
- Step-by-Step: How to Run
- Step 1 β Install dependencies
- Step 2 β Train tokenizers
- Step 3 β Choose your config (one decision each)
- Step 4 β Train Stage 0 (TinyStories)
- Step 5 β Train Stage 1 (SimpleWiki + BabyLM)
- Step 6 β Train Stage 2 (FineWeb-Edu)
- (Alternative) Run full curriculum automatically
- Step 7 β Generate text
- Step 8 β Benchmark KV cache vs no cache
- Log Files
- Changing Configs
- Ablation: Curriculum vs Baseline
- Hardware Requirements
- Project Structure
SLM β Decoder-only Language Model with Curriculum Learning
A ~47-51M parameter decoder-only transformer trained from scratch using a 3-stage curriculum (easy β hard text). Built entirely in PyTorch.
Project Structure
slm/
βββ model.py Transformer architecture (RoPE / learnable pos, KV cache)
βββ tokenizer.py BPE tokenizer training (two vocab options)
βββ dataset.py Per-stage datasets + val set builders
βββ train.py Single-stage training loop (3 exit conditions + logging)
βββ curriculum.py Multi-stage orchestrator
βββ logger.py CSV + console logging
βββ configs/
β βββ stage0.yaml TinyStories β seq=256, max=200M tokens
β βββ stage1.yaml SimpleWiki + BabyLM β seq=384, max=220M tokens
β βββ stage2.yaml FineWeb-Edu β seq=512, max=500M tokens
βββ notebooks/
β βββ run.ipynb All training commands (run from here)
βββ tokenizers/ Created after step 2
βββ checkpoints/ Created during training
βββ logs/ CSV logs per stage
βββ cache/ Preprocessed token chunks (created automatically)
Architecture
| Parameter | Value |
|---|---|
| d_model | 512 |
| n_layers | 6 |
| n_heads | 8 |
| d_ff | 2048 (SwiGLU) |
| ctx_len | 512 |
| Norm | RMSNorm |
| Activation | SwiGLU |
| Position | Learnable OR RoPE (switchable) |
| Bias | False |
| Weight tying | True (lm_head = tok_emb) |
| Params (50k tok) | ~51M (learnable) / ~49M (RoPE) |
Curriculum Stages
| Stage | Dataset | Seq Len | Max Tokens | Val Source |
|---|---|---|---|---|
| 0 | TinyStories | 256 | 200M | TinyStories val |
| 1 | SimpleWiki + BabyLM | 384 | 220M | SimpleWiki sample |
| 2 | FineWeb-Edu (β₯3 edu) | 512 | 500M | FineWeb sample |
All 3 val losses are logged at every eval step regardless of current stage. This lets you observe cross-stage forgetting in the logs.
Early Exit Conditions (any one triggers stage exit)
- Plateau β val loss on current stage hasn't improved by >0.01 over
patienceevals - Token budget β hard ceiling of max tokens per stage
- Loss spike β train loss rises > threshold over last K steps
Step-by-Step: How to Run
Step 1 β Install dependencies
pip install torch tokenizers datasets pyyaml
Or run Cell 0 in notebooks/run.ipynb.
Step 2 β Train tokenizers
Train both tokenizers on a sample of all stage data combined.
# Tokenizer A: corpus-derived vocab (~32-40k)
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which corpus
# Tokenizer B: fixed 50k vocab
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which fixed
# Or both at once:
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which both
Output files:
tokenizers/tokenizer_corpus.jsontokenizers/tokenizer_50k.json
Step 3 β Choose your config (one decision each)
Tokenizer:
tokenizers/tokenizer_50k.jsonβ 50k vocab, larger embedding tabletokenizers/tokenizer_corpus.jsonβ smaller natural vocab, leaner model
Positional encoding:
--pos_type learnableβ standard learned position embeddings--pos_type ropeβ RoPE (no learned params, better length extrapolation)
Step 4 β Train Stage 0 (TinyStories)
python train.py \
--stage 0 \
--config configs/stage0.yaml \
--tokenizer tokenizers/tokenizer_50k.json \
--pos_type learnable \
--checkpoint_dir checkpoints/ \
--log_dir logs/ \
--cache_dir cache/
Resume if interrupted:
python train.py --stage 0 --config configs/stage0.yaml \
--tokenizer tokenizers/tokenizer_50k.json \
--pos_type learnable \
--checkpoint_dir checkpoints/ --log_dir logs/ --cache_dir cache/ \
--resume
Output: checkpoints/stage0_best.pt
Step 5 β Train Stage 1 (SimpleWiki + BabyLM)
Loads Stage 0 weights as starting point.
python train.py \
--stage 1 \
--config configs/stage1.yaml \
--tokenizer tokenizers/tokenizer_50k.json \
--pos_type learnable \
--checkpoint_dir checkpoints/ \
--log_dir logs/ \
--cache_dir cache/ \
--prev_checkpoint checkpoints/stage0_best.pt
Output: checkpoints/stage1_best.pt
Step 6 β Train Stage 2 (FineWeb-Edu)
Loads Stage 1 weights as starting point.
python train.py \
--stage 2 \
--config configs/stage2.yaml \
--tokenizer tokenizers/tokenizer_50k.json \
--pos_type learnable \
--checkpoint_dir checkpoints/ \
--log_dir logs/ \
--cache_dir cache/ \
--prev_checkpoint checkpoints/stage1_best.pt
Output: checkpoints/stage2_best.pt
(Alternative) Run full curriculum automatically
python curriculum.py \
--tokenizer tokenizers/tokenizer_50k.json \
--pos_type learnable \
--checkpoint_dir checkpoints/ \
--log_dir logs/ \
--cache_dir cache/
Start from a specific stage:
python curriculum.py --start_stage 1 --tokenizer tokenizers/tokenizer_50k.json ...
Run specific stages only:
python curriculum.py --stages 1 2 --tokenizer tokenizers/tokenizer_50k.json ...
Step 7 β Generate text
# In notebook: Cell 10
# Or directly:
python -c "
import torch
from tokenizers import Tokenizer
from model import SLM
tok = Tokenizer.from_file('tokenizers/tokenizer_50k.json')
ckpt = torch.load('checkpoints/stage2_best.pt', map_location='cuda')
m = SLM(ckpt['config']).cuda()
m.load_state_dict(ckpt['model_state'])
m.eval()
ids = tok.encode('Once upon a time').ids
out = m.generate(torch.tensor([ids]).cuda(), max_new=100, temperature=0.8, top_k=50)
print(tok.decode(out[0].tolist()))
"
Step 8 β Benchmark KV cache vs no cache
Run Cell 11 in notebooks/run.ipynb after training.
Log Files
Each stage produces a CSV in logs/stage{N}_{timestamp}.csv:
step | stage | tokens_seen | train_loss | val_s0 | val_s1 | val_s2 | lr | note
Plot all curves with Cell 9 in the notebook:
# Or from command line:
python -c "
import glob, pandas as pd, matplotlib.pyplot as plt
# ... see notebook cell 9
"
Changing Configs
Edit configs/stage{N}.yaml to adjust:
batch_sizeβ reduce if OOMmax_tokensβ reduce for faster experimentspatienceβ how many evals before plateau exitlearning_rateβ per-stage LReval_intervalβ steps between val evaluations
Ablation: Curriculum vs Baseline
To run the baseline (random data order, same compute), train with all data mixed without staged configs. The easiest way is to:
- Concatenate all datasets in
dataset.py's_train_iterfor a single stage - Use
configs/stage2.yamlsettings (full ctx_len, total token budget = sum of all stages) - Compare val losses at matched tokens seen
The per-step CSV logs make this comparison straightforward in the notebook.
Hardware Requirements
| Component | Spec |
|---|---|
| GPU | NVIDIA RTX 4060 Ti 16GB |
| VRAM used | ~2-3GB (batch=32) |
| Precision | bf16 (auto-detected) |
| Storage | ~10GB for cached data |