SLM — Decoder-only Language Model with Curriculum Learning

A ~47-51M parameter decoder-only transformer trained from scratch using a 3-stage curriculum (easy → hard text). Built entirely in PyTorch.

Project Structure

slm/
├── model.py              Transformer architecture (RoPE / learnable pos, KV cache)
├── tokenizer.py          BPE tokenizer training (two vocab options)
├── dataset.py            Per-stage datasets + val set builders
├── train.py              Single-stage training loop (3 exit conditions + logging)
├── curriculum.py         Multi-stage orchestrator
├── logger.py             CSV + console logging
├── configs/
│   ├── stage0.yaml       TinyStories  — seq=256, max=200M tokens
│   ├── stage1.yaml       SimpleWiki + BabyLM — seq=384, max=220M tokens
│   └── stage2.yaml       FineWeb-Edu — seq=512, max=500M tokens
├── notebooks/
│   └── run.ipynb         All training commands (run from here)
├── tokenizers/           Created after step 2
├── checkpoints/          Created during training
├── logs/                 CSV logs per stage
└── cache/                Preprocessed token chunks (created automatically)

Architecture

Parameter	Value
d_model	512
n_layers	6
n_heads	8
d_ff	2048 (SwiGLU)
ctx_len	512
Norm	RMSNorm
Activation	SwiGLU
Position	Learnable OR RoPE (switchable)
Bias	False
Weight tying	True (lm_head = tok_emb)
Params (50k tok)	~51M (learnable) / ~49M (RoPE)

Curriculum Stages

Stage	Dataset	Seq Len	Max Tokens	Val Source
0	TinyStories	256	200M	TinyStories val
1	SimpleWiki + BabyLM	384	220M	SimpleWiki sample
2	FineWeb-Edu (≥3 edu)	512	500M	FineWeb sample

All 3 val losses are logged at every eval step regardless of current stage. This lets you observe cross-stage forgetting in the logs.

Early Exit Conditions (any one triggers stage exit)

Plateau — val loss on current stage hasn't improved by >0.01 over patience evals
Token budget — hard ceiling of max tokens per stage
Loss spike — train loss rises > threshold over last K steps

Step-by-Step: How to Run

Step 1 — Install dependencies

pip install torch tokenizers datasets pyyaml

Or run Cell 0 in notebooks/run.ipynb.

Step 2 — Train tokenizers

Train both tokenizers on a sample of all stage data combined.

# Tokenizer A: corpus-derived vocab (~32-40k)
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which corpus

# Tokenizer B: fixed 50k vocab
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which fixed

# Or both at once:
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which both

Output files:

tokenizers/tokenizer_corpus.json
tokenizers/tokenizer_50k.json

Step 3 — Choose your config (one decision each)

Tokenizer:

tokenizers/tokenizer_50k.json — 50k vocab, larger embedding table
tokenizers/tokenizer_corpus.json — smaller natural vocab, leaner model

Positional encoding:

--pos_type learnable — standard learned position embeddings
--pos_type rope — RoPE (no learned params, better length extrapolation)

Step 4 — Train Stage 0 (TinyStories)

python train.py \
    --stage 0 \
    --config configs/stage0.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/

Resume if interrupted:

python train.py --stage 0 --config configs/stage0.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ --log_dir logs/ --cache_dir cache/ \
    --resume

Output: checkpoints/stage0_best.pt

Step 5 — Train Stage 1 (SimpleWiki + BabyLM)

Loads Stage 0 weights as starting point.

python train.py \
    --stage 1 \
    --config configs/stage1.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/ \
    --prev_checkpoint checkpoints/stage0_best.pt

Output: checkpoints/stage1_best.pt

Step 6 — Train Stage 2 (FineWeb-Edu)

Loads Stage 1 weights as starting point.

python train.py \
    --stage 2 \
    --config configs/stage2.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/ \
    --prev_checkpoint checkpoints/stage1_best.pt

Output: checkpoints/stage2_best.pt

(Alternative) Run full curriculum automatically

python curriculum.py \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/

Start from a specific stage:

python curriculum.py --start_stage 1 --tokenizer tokenizers/tokenizer_50k.json ...

Run specific stages only:

python curriculum.py --stages 1 2 --tokenizer tokenizers/tokenizer_50k.json ...

Step 7 — Generate text

# In notebook: Cell 10
# Or directly:
python -c "
import torch
from tokenizers import Tokenizer
from model import SLM

tok  = Tokenizer.from_file('tokenizers/tokenizer_50k.json')
ckpt = torch.load('checkpoints/stage2_best.pt', map_location='cuda')
m    = SLM(ckpt['config']).cuda()
m.load_state_dict(ckpt['model_state'])
m.eval()

ids = tok.encode('Once upon a time').ids
out = m.generate(torch.tensor([ids]).cuda(), max_new=100, temperature=0.8, top_k=50)
print(tok.decode(out[0].tolist()))
"

Step 8 — Benchmark KV cache vs no cache

Run Cell 11 in notebooks/run.ipynb after training.

Log Files

Each stage produces a CSV in logs/stage{N}_{timestamp}.csv:

step | stage | tokens_seen | train_loss | val_s0 | val_s1 | val_s2 | lr | note

Plot all curves with Cell 9 in the notebook:

# Or from command line:
python -c "
import glob, pandas as pd, matplotlib.pyplot as plt
# ... see notebook cell 9
"

Changing Configs

Edit configs/stage{N}.yaml to adjust:

batch_size — reduce if OOM
max_tokens — reduce for faster experiments
patience — how many evals before plateau exit
learning_rate — per-stage LR
eval_interval — steps between val evaluations

Ablation: Curriculum vs Baseline

To run the baseline (random data order, same compute), train with all data mixed without staged configs. The easiest way is to:

Concatenate all datasets in dataset.py's _train_iter for a single stage
Use configs/stage2.yaml settings (full ctx_len, total token budget = sum of all stages)
Compare val losses at matched tokens seen

The per-step CSV logs make this comparison straightforward in the notebook.

Hardware Requirements

Component	Spec
GPU	NVIDIA RTX 4060 Ti 16GB
VRAM used	~2-3GB (batch=32)
Precision	bf16 (auto-detected)
Storage	~10GB for cached data

Downloads last month: -; Downloads are not tracked for this model. How to track