SLM β€” Decoder-only Language Model with Curriculum Learning

A ~47-51M parameter decoder-only transformer trained from scratch using a 3-stage curriculum (easy β†’ hard text). Built entirely in PyTorch.


Project Structure

slm/
β”œβ”€β”€ model.py              Transformer architecture (RoPE / learnable pos, KV cache)
β”œβ”€β”€ tokenizer.py          BPE tokenizer training (two vocab options)
β”œβ”€β”€ dataset.py            Per-stage datasets + val set builders
β”œβ”€β”€ train.py              Single-stage training loop (3 exit conditions + logging)
β”œβ”€β”€ curriculum.py         Multi-stage orchestrator
β”œβ”€β”€ logger.py             CSV + console logging
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ stage0.yaml       TinyStories  β€” seq=256, max=200M tokens
β”‚   β”œβ”€β”€ stage1.yaml       SimpleWiki + BabyLM β€” seq=384, max=220M tokens
β”‚   └── stage2.yaml       FineWeb-Edu β€” seq=512, max=500M tokens
β”œβ”€β”€ notebooks/
β”‚   └── run.ipynb         All training commands (run from here)
β”œβ”€β”€ tokenizers/           Created after step 2
β”œβ”€β”€ checkpoints/          Created during training
β”œβ”€β”€ logs/                 CSV logs per stage
└── cache/                Preprocessed token chunks (created automatically)

Architecture

Parameter Value
d_model 512
n_layers 6
n_heads 8
d_ff 2048 (SwiGLU)
ctx_len 512
Norm RMSNorm
Activation SwiGLU
Position Learnable OR RoPE (switchable)
Bias False
Weight tying True (lm_head = tok_emb)
Params (50k tok) ~51M (learnable) / ~49M (RoPE)

Curriculum Stages

Stage Dataset Seq Len Max Tokens Val Source
0 TinyStories 256 200M TinyStories val
1 SimpleWiki + BabyLM 384 220M SimpleWiki sample
2 FineWeb-Edu (β‰₯3 edu) 512 500M FineWeb sample

All 3 val losses are logged at every eval step regardless of current stage. This lets you observe cross-stage forgetting in the logs.

Early Exit Conditions (any one triggers stage exit)

  1. Plateau β€” val loss on current stage hasn't improved by >0.01 over patience evals
  2. Token budget β€” hard ceiling of max tokens per stage
  3. Loss spike β€” train loss rises > threshold over last K steps

Step-by-Step: How to Run

Step 1 β€” Install dependencies

pip install torch tokenizers datasets pyyaml

Or run Cell 0 in notebooks/run.ipynb.


Step 2 β€” Train tokenizers

Train both tokenizers on a sample of all stage data combined.

# Tokenizer A: corpus-derived vocab (~32-40k)
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which corpus

# Tokenizer B: fixed 50k vocab
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which fixed

# Or both at once:
python tokenizer.py --output_dir tokenizers/ --sample_size 2_000_000 --which both

Output files:

  • tokenizers/tokenizer_corpus.json
  • tokenizers/tokenizer_50k.json

Step 3 β€” Choose your config (one decision each)

Tokenizer:

  • tokenizers/tokenizer_50k.json β€” 50k vocab, larger embedding table
  • tokenizers/tokenizer_corpus.json β€” smaller natural vocab, leaner model

Positional encoding:

  • --pos_type learnable β€” standard learned position embeddings
  • --pos_type rope β€” RoPE (no learned params, better length extrapolation)

Step 4 β€” Train Stage 0 (TinyStories)

python train.py \
    --stage 0 \
    --config configs/stage0.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/

Resume if interrupted:

python train.py --stage 0 --config configs/stage0.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ --log_dir logs/ --cache_dir cache/ \
    --resume

Output: checkpoints/stage0_best.pt


Step 5 β€” Train Stage 1 (SimpleWiki + BabyLM)

Loads Stage 0 weights as starting point.

python train.py \
    --stage 1 \
    --config configs/stage1.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/ \
    --prev_checkpoint checkpoints/stage0_best.pt

Output: checkpoints/stage1_best.pt


Step 6 β€” Train Stage 2 (FineWeb-Edu)

Loads Stage 1 weights as starting point.

python train.py \
    --stage 2 \
    --config configs/stage2.yaml \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/ \
    --prev_checkpoint checkpoints/stage1_best.pt

Output: checkpoints/stage2_best.pt


(Alternative) Run full curriculum automatically

python curriculum.py \
    --tokenizer tokenizers/tokenizer_50k.json \
    --pos_type learnable \
    --checkpoint_dir checkpoints/ \
    --log_dir logs/ \
    --cache_dir cache/

Start from a specific stage:

python curriculum.py --start_stage 1 --tokenizer tokenizers/tokenizer_50k.json ...

Run specific stages only:

python curriculum.py --stages 1 2 --tokenizer tokenizers/tokenizer_50k.json ...

Step 7 β€” Generate text

# In notebook: Cell 10
# Or directly:
python -c "
import torch
from tokenizers import Tokenizer
from model import SLM

tok  = Tokenizer.from_file('tokenizers/tokenizer_50k.json')
ckpt = torch.load('checkpoints/stage2_best.pt', map_location='cuda')
m    = SLM(ckpt['config']).cuda()
m.load_state_dict(ckpt['model_state'])
m.eval()

ids = tok.encode('Once upon a time').ids
out = m.generate(torch.tensor([ids]).cuda(), max_new=100, temperature=0.8, top_k=50)
print(tok.decode(out[0].tolist()))
"

Step 8 β€” Benchmark KV cache vs no cache

Run Cell 11 in notebooks/run.ipynb after training.


Log Files

Each stage produces a CSV in logs/stage{N}_{timestamp}.csv:

step | stage | tokens_seen | train_loss | val_s0 | val_s1 | val_s2 | lr | note

Plot all curves with Cell 9 in the notebook:

# Or from command line:
python -c "
import glob, pandas as pd, matplotlib.pyplot as plt
# ... see notebook cell 9
"

Changing Configs

Edit configs/stage{N}.yaml to adjust:

  • batch_size β€” reduce if OOM
  • max_tokens β€” reduce for faster experiments
  • patience β€” how many evals before plateau exit
  • learning_rate β€” per-stage LR
  • eval_interval β€” steps between val evaluations

Ablation: Curriculum vs Baseline

To run the baseline (random data order, same compute), train with all data mixed without staged configs. The easiest way is to:

  1. Concatenate all datasets in dataset.py's _train_iter for a single stage
  2. Use configs/stage2.yaml settings (full ctx_len, total token budget = sum of all stages)
  3. Compare val losses at matched tokens seen

The per-step CSV logs make this comparison straightforward in the notebook.


Hardware Requirements

Component Spec
GPU NVIDIA RTX 4060 Ti 16GB
VRAM used ~2-3GB (batch=32)
Precision bf16 (auto-detected)
Storage ~10GB for cached data
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support