SLLM β Small Language Model from Scratch
A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050).
β¨ Features
- Architecture: Decoder-only transformer (GPT-style) with modern improvements
- RMSNorm instead of LayerNorm (faster, no bias)
- RoPE (Rotary Position Embeddings) β used in LLaMA, Mistral, Gemma
- SwiGLU feed-forward network β outperforms GELU at the same parameter count
- Flash Attention via
F.scaled_dot_product_attention(O(TΒ²) memory avoided) - Weight-tied token embeddings + LM head (saves ~32M parameters)
- Training
- bf16 mixed-precision with gradient accumulation
- Gradient checkpointing for low-VRAM GPUs
- Cosine LR schedule with linear warmup
- Resumable checkpointing (
--resume,--extra_steps) - JSONL metric logging + live training dashboard
- Custom BPE Tokenizer β trained on FineWeb-Edu with byte fallback (zero OOV)
- Supervised Fine-Tuning (SFT) β chat model pipeline included in
finetune/
ποΈ Project Structure
sllm/
βββ model/ # Model architecture
β βββ config.py # ModelConfig dataclass (SLLM_100M, SLLM_150M presets)
β βββ model.py # SLLM β full model assembly, weight init, gradient checkpointing
β βββ block.py # TransformerBlock (pre-norm, residual)
β βββ attention.py # Causal multi-head self-attention + RoPE
β βββ mlp.py # SwiGLU feed-forward network
β βββ norm.py # RMSNorm
β βββ rope.py # Rotary Position Embeddings
β
βββ tokenizer/ # Custom BPE tokenizer
β βββ normalizer.py # HTML stripping, unicode NFC, whitespace cleanup
β βββ pretokenizer.py # Regex pre-tokenizer (code-aware, contraction-aware)
β βββ bpe.py # BPE model config with byte fallback (32k vocab)
β βββ traintokenizer.py # Train on FineWeb-Edu stream
β βββ post_processor.py # Append <|endoftext|> to every sequence
β βββ wrap_tokenizer.py # Wrap into PreTrainedTokenizerFast
β βββ tokenize_dataset.py # Pack tokens into flat binary .bin shards
β
βββ data/
β βββ dataloader.py # Memory-mapped shard dataloader
β
βββ finetune/ # Supervised fine-tuning (SFT) pipeline
β βββ prepare_data.py # Prepare chat data
β βββ sft_train.py # SFT training loop
β βββ sft_dataset.py # Chat dataset
β βββ chat.py # Interactive chat with the fine-tuned model
β
βββ train.py # Pre-training loop
βββ plot_training.py # Training dashboard (static + live mode)
βββ requirements.txt
βββ model_explained.md # Deep-dive into every model component
βββ tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough
π Model Configs
| Config | d_model | Heads | Layers | Parameters |
|---|---|---|---|---|
SLLM_100M |
768 | 12 | 12 | ~109.5M |
SLLM_150M |
1024 | 16 | 9 | ~148.4M |
Both configs use:
- Context length: 1024 tokens
- Vocab size: 32,000 (custom BPE)
- SwiGLU d_ff: computed as
round_up_256(β2/3 Γ 4 Γ d_modelβ)
βοΈ Installation
Requires: Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended)
# Create and activate a conda environment
conda create -n pytorch python=3.11
conda activate pytorch
# Install dependencies
pip install -r requirements.txt
π Training
Start a new run (RTX 3050 4GB recommended settings)
python train.py \
--config 150M \
--data_dir tokenizer/data \
--batch_size 2 \
--grad_accum 16 \
--grad_checkpoint \
--dtype bf16 \
--max_steps 5000 \
--run_dir runs/sllm_150m \
--log_every 10 \
--save_every 500 \
--val_every 500 \
--warmup_steps 200
Resume from a checkpoint
python train.py \
--resume \
--run_dir runs/sllm_150m \
--extra_steps 5000 \
--data_dir tokenizer/data \
--batch_size 2 \
--grad_accum 16 \
--grad_checkpoint \
--dtype bf16
Key training flags
| Flag | Default | Description |
|---|---|---|
--config |
100M |
Model size (100M or 150M) |
--batch_size |
4 |
Per-device micro-batch size |
--grad_accum |
8 |
Gradient accumulation steps |
--max_steps |
unlimited | Absolute step target |
--extra_steps |
β | Run N more steps from current checkpoint |
--resume |
β | Resume from latest checkpoint in --run_dir |
--grad_checkpoint |
β | Enable gradient checkpointing (saves VRAM) |
--dtype |
bf16 |
Mixed precision dtype (fp32, fp16, bf16) |
--synthetic |
β | Use random data (for testing without real shards) |
π Training Dashboard
Visualize training metrics in a dark-mode 6-panel dashboard:
# Static plot
python plot_training.py --run_dir runs/sllm_150m
# Live mode β refresh every 30 seconds while training
python plot_training.py --run_dir runs/sllm_150m --live --interval 30
# Compare two runs
python plot_training.py --run_dir runs/run_a runs/run_b
# Save to file
python plot_training.py --run_dir runs/sllm_150m --save dashboard.png
Dashboard panels: Training Loss (raw + EMA) Β· Validation Loss Β· Learning Rate Β· Tokens/sec Β· VRAM usage Β· Gradient norm
π¬ Fine-Tuning (Chat Model)
After pre-training, you can fine-tune with supervised instruction data:
# 1. Prepare chat data
python finetune/prepare_data.py
# 2. Fine-tune
python finetune/sft_train.py \
--base_ckpt runs/sllm_150m/ckpt_0011500.pt \
--run_dir runs/sllm_150m_chat \
--max_steps 2500 \
--batch_size 4 \
--grad_accum 8 \
--grad_checkpoint
# 3. Chat interactively
python finetune/chat.py --run_dir runs/sllm_150m_chat
π‘ Tokenizer
A custom BPE tokenizer trained on the educational subset of FineWeb-Edu:
- 32,000 token vocabulary
- Byte fallback β zero out-of-vocabulary tokens (even math symbols and emojis work)
- Code-aware β preserves
snake_case, operators (==,->,**), and indentation - Contraction-aware β
don't,I've,they'reare split correctly - Packaged as a
PreTrainedTokenizerFast(HuggingFace-compatible)
Training data is packed into flat binary .bin shards (np.uint16, 100M tokens each) for fast memory-mapped loading.
See tokenizer_walkthrough.md for a full pipeline deep-dive.
π§ Architecture Deep-Dive
See model_explained.md for a plain-language walkthrough of every model component, including:
- Why RMSNorm is faster than LayerNorm
- How RoPE encodes relative position without extra parameters
- Why SwiGLU outperforms GELU
- How weight tying saves 32M parameters
- Flash Attention and gradient checkpointing explained
π Checkpoints & Logging
- Checkpoints are saved to
<run_dir>/ckpt_NNNNNNN.ptevery--save_everysteps and on clean exit (Ctrl+C) - Metrics are appended to
<run_dir>/train_log.jsonl(one JSON line per log step) - Each checkpoint stores: model weights, optimizer state, step number, loss, and config name
- Resuming auto-detects the correct model config from the checkpoint
π¦ Requirements
torch>=2.3.0
datasets>=2.14.0 # HuggingFace datasets (streaming)
tokenizers>=0.15.0 # Fast BPE tokenizer
transformers>=4.40.0 # PreTrainedTokenizerFast
numpy>=1.26.0
tqdm
matplotlib
π License
This project is released for educational purposes.