SLLM — Small Language Model from Scratch

A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050).

✨ Features

Architecture: Decoder-only transformer (GPT-style) with modern improvements
- RMSNorm instead of LayerNorm (faster, no bias)
- RoPE (Rotary Position Embeddings) — used in LLaMA, Mistral, Gemma
- SwiGLU feed-forward network — outperforms GELU at the same parameter count
- Flash Attention via F.scaled_dot_product_attention (O(T²) memory avoided)
- Weight-tied token embeddings + LM head (saves ~32M parameters)
Training
- bf16 mixed-precision with gradient accumulation
- Gradient checkpointing for low-VRAM GPUs
- Cosine LR schedule with linear warmup
- Resumable checkpointing (--resume, --extra_steps)
- JSONL metric logging + live training dashboard
Custom BPE Tokenizer — trained on FineWeb-Edu with byte fallback (zero OOV)
Supervised Fine-Tuning (SFT) — chat model pipeline included in finetune/

🏗️ Project Structure

sllm/
├── model/                   # Model architecture
│   ├── config.py            # ModelConfig dataclass (SLLM_100M, SLLM_150M presets)
│   ├── model.py             # SLLM — full model assembly, weight init, gradient checkpointing
│   ├── block.py             # TransformerBlock (pre-norm, residual)
│   ├── attention.py         # Causal multi-head self-attention + RoPE
│   ├── mlp.py               # SwiGLU feed-forward network
│   ├── norm.py              # RMSNorm
│   └── rope.py              # Rotary Position Embeddings
│
├── tokenizer/               # Custom BPE tokenizer
│   ├── normalizer.py        # HTML stripping, unicode NFC, whitespace cleanup
│   ├── pretokenizer.py      # Regex pre-tokenizer (code-aware, contraction-aware)
│   ├── bpe.py               # BPE model config with byte fallback (32k vocab)
│   ├── traintokenizer.py    # Train on FineWeb-Edu stream
│   ├── post_processor.py    # Append <|endoftext|> to every sequence
│   ├── wrap_tokenizer.py    # Wrap into PreTrainedTokenizerFast
│   └── tokenize_dataset.py  # Pack tokens into flat binary .bin shards
│
├── data/
│   └── dataloader.py        # Memory-mapped shard dataloader
│
├── finetune/                # Supervised fine-tuning (SFT) pipeline
│   ├── prepare_data.py      # Prepare chat data
│   ├── sft_train.py         # SFT training loop
│   ├── sft_dataset.py       # Chat dataset
│   └── chat.py              # Interactive chat with the fine-tuned model
│
├── train.py                 # Pre-training loop
├── plot_training.py         # Training dashboard (static + live mode)
├── requirements.txt
├── model_explained.md       # Deep-dive into every model component
└── tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough

📐 Model Configs

Config	d_model	Heads	Layers	Parameters
`SLLM_100M`	768	12	12	~109.5M
`SLLM_150M`	1024	16	9	~148.4M

Both configs use:

Context length: 1024 tokens
Vocab size: 32,000 (custom BPE)
SwiGLU d_ff: computed as round_up_256(⌊2/3 × 4 × d_model⌋)

⚙️ Installation

Requires: Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended)

# Create and activate a conda environment
conda create -n pytorch python=3.11
conda activate pytorch

# Install dependencies
pip install -r requirements.txt

🚀 Training

Start a new run (RTX 3050 4GB recommended settings)

python train.py \
  --config 150M \
  --data_dir tokenizer/data \
  --batch_size 2 \
  --grad_accum 16 \
  --grad_checkpoint \
  --dtype bf16 \
  --max_steps 5000 \
  --run_dir runs/sllm_150m \
  --log_every 10 \
  --save_every 500 \
  --val_every 500 \
  --warmup_steps 200

Resume from a checkpoint

python train.py \
  --resume \
  --run_dir runs/sllm_150m \
  --extra_steps 5000 \
  --data_dir tokenizer/data \
  --batch_size 2 \
  --grad_accum 16 \
  --grad_checkpoint \
  --dtype bf16

Key training flags

Flag	Default	Description
`--config`	`100M`	Model size (`100M` or `150M`)
`--batch_size`	`4`	Per-device micro-batch size
`--grad_accum`	`8`	Gradient accumulation steps
`--max_steps`	unlimited	Absolute step target
`--extra_steps`	—	Run N more steps from current checkpoint
`--resume`	—	Resume from latest checkpoint in `--run_dir`
`--grad_checkpoint`	—	Enable gradient checkpointing (saves VRAM)
`--dtype`	`bf16`	Mixed precision dtype (`fp32`, `fp16`, `bf16`)
`--synthetic`	—	Use random data (for testing without real shards)

📊 Training Dashboard

Visualize training metrics in a dark-mode 6-panel dashboard:

# Static plot
python plot_training.py --run_dir runs/sllm_150m

# Live mode — refresh every 30 seconds while training
python plot_training.py --run_dir runs/sllm_150m --live --interval 30

# Compare two runs
python plot_training.py --run_dir runs/run_a runs/run_b

# Save to file
python plot_training.py --run_dir runs/sllm_150m --save dashboard.png

Dashboard panels: Training Loss (raw + EMA) · Validation Loss · Learning Rate · Tokens/sec · VRAM usage · Gradient norm

💬 Fine-Tuning (Chat Model)

After pre-training, you can fine-tune with supervised instruction data:

# 1. Prepare chat data
python finetune/prepare_data.py

# 2. Fine-tune
python finetune/sft_train.py \
  --base_ckpt runs/sllm_150m/ckpt_0011500.pt \
  --run_dir runs/sllm_150m_chat \
  --max_steps 2500 \
  --batch_size 4 \
  --grad_accum 8 \
  --grad_checkpoint

# 3. Chat interactively
python finetune/chat.py --run_dir runs/sllm_150m_chat

🔡 Tokenizer

A custom BPE tokenizer trained on the educational subset of FineWeb-Edu:

32,000 token vocabulary
Byte fallback — zero out-of-vocabulary tokens (even math symbols and emojis work)
Code-aware — preserves snake_case, operators (==, ->, **), and indentation
Contraction-aware — don't, I've, they're are split correctly
Packaged as a PreTrainedTokenizerFast (HuggingFace-compatible)

Training data is packed into flat binary .bin shards (np.uint16, 100M tokens each) for fast memory-mapped loading.

See tokenizer_walkthrough.md for a full pipeline deep-dive.

🧠 Architecture Deep-Dive

See model_explained.md for a plain-language walkthrough of every model component, including:

Why RMSNorm is faster than LayerNorm
How RoPE encodes relative position without extra parameters
Why SwiGLU outperforms GELU
How weight tying saves 32M parameters
Flash Attention and gradient checkpointing explained

📋 Checkpoints & Logging

Checkpoints are saved to <run_dir>/ckpt_NNNNNNN.pt every --save_every steps and on clean exit (Ctrl+C)
Metrics are appended to <run_dir>/train_log.jsonl (one JSON line per log step)
Each checkpoint stores: model weights, optimizer state, step number, loss, and config name
Resuming auto-detects the correct model config from the checkpoint

📦 Requirements

torch>=2.3.0
datasets>=2.14.0       # HuggingFace datasets (streaming)
tokenizers>=0.15.0     # Fast BPE tokenizer
transformers>=4.40.0   # PreTrainedTokenizerFast
numpy>=1.26.0
tqdm
matplotlib

📄 License

This project is released for educational purposes.

Downloads last month: -; Downloads are not tracked for this model. How to track