sllm / README.md

Add source code and docs

6eae939 verified 3 days ago

8.12 kB

	---
	language:
	- en
	license: mit
	tags:
	- pytorch
	- language-model
	- gpt
	- transformer
	- from-scratch
	- causal-lm
	pipeline_tag: text-generation
	---

	# SLLM — Small Language Model from Scratch

	A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050).

	---

	## ✨ Features

	- Architecture: Decoder-only transformer (GPT-style) with modern improvements
	- RMSNorm instead of LayerNorm (faster, no bias)
	- RoPE (Rotary Position Embeddings) — used in LLaMA, Mistral, Gemma
	- SwiGLU feed-forward network — outperforms GELU at the same parameter count
	- Flash Attention via `F.scaled_dot_product_attention` (O(T²) memory avoided)
	- Weight-tied token embeddings + LM head (saves ~32M parameters)
	- Training
	- bf16 mixed-precision with gradient accumulation
	- Gradient checkpointing for low-VRAM GPUs
	- Cosine LR schedule with linear warmup
	- Resumable checkpointing (`--resume`, `--extra_steps`)
	- JSONL metric logging + live training dashboard
	- Custom BPE Tokenizer — trained on FineWeb-Edu with byte fallback (zero OOV)
	- Supervised Fine-Tuning (SFT) — chat model pipeline included in `finetune/`

	---

	## 🏗️ Project Structure

	```
	sllm/
	├── model/ # Model architecture
	│ ├── config.py # ModelConfig dataclass (SLLM_100M, SLLM_150M presets)
	│ ├── model.py # SLLM — full model assembly, weight init, gradient checkpointing
	│ ├── block.py # TransformerBlock (pre-norm, residual)
	│ ├── attention.py # Causal multi-head self-attention + RoPE
	│ ├── mlp.py # SwiGLU feed-forward network
	│ ├── norm.py # RMSNorm
	│ └── rope.py # Rotary Position Embeddings
	│
	├── tokenizer/ # Custom BPE tokenizer
	│ ├── normalizer.py # HTML stripping, unicode NFC, whitespace cleanup
	│ ├── pretokenizer.py # Regex pre-tokenizer (code-aware, contraction-aware)
	│ ├── bpe.py # BPE model config with byte fallback (32k vocab)
	│ ├── traintokenizer.py # Train on FineWeb-Edu stream
	│ ├── post_processor.py # Append <\|endoftext\|> to every sequence
	│ ├── wrap_tokenizer.py # Wrap into PreTrainedTokenizerFast
	│ └── tokenize_dataset.py # Pack tokens into flat binary .bin shards
	│
	├── data/
	│ └── dataloader.py # Memory-mapped shard dataloader
	│
	├── finetune/ # Supervised fine-tuning (SFT) pipeline
	│ ├── prepare_data.py # Prepare chat data
	│ ├── sft_train.py # SFT training loop
	│ ├── sft_dataset.py # Chat dataset
	│ └── chat.py # Interactive chat with the fine-tuned model
	│
	├── train.py # Pre-training loop
	├── plot_training.py # Training dashboard (static + live mode)
	├── requirements.txt
	├── model_explained.md # Deep-dive into every model component
	└── tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough
	```

	---

	## 📐 Model Configs

	\| Config \| d_model \| Heads \| Layers \| Parameters \|
	\|------------\|---------\|-------\|--------\|------------\|
	\| `SLLM_100M` \| 768 \| 12 \| 12 \| ~109.5M \|
	\| `SLLM_150M` \| 1024 \| 16 \| 9 \| ~148.4M \|

	Both configs use:
	- Context length: 1024 tokens
	- Vocab size: 32,000 (custom BPE)
	- SwiGLU d_ff: computed as `round_up_256(⌊2/3 × 4 × d_model⌋)`

	---

	## ⚙️ Installation

	Requires: Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended)

	```bash
	# Create and activate a conda environment
	conda create -n pytorch python=3.11
	conda activate pytorch

	# Install dependencies
	pip install -r requirements.txt
	```

	---

	## 🚀 Training

	### Start a new run (RTX 3050 4GB recommended settings)

	```bash
	python train.py \
	--config 150M \
	--data_dir tokenizer/data \
	--batch_size 2 \
	--grad_accum 16 \
	--grad_checkpoint \
	--dtype bf16 \
	--max_steps 5000 \
	--run_dir runs/sllm_150m \
	--log_every 10 \
	--save_every 500 \
	--val_every 500 \
	--warmup_steps 200
	```

	### Resume from a checkpoint

	```bash
	python train.py \
	--resume \
	--run_dir runs/sllm_150m \
	--extra_steps 5000 \
	--data_dir tokenizer/data \
	--batch_size 2 \
	--grad_accum 16 \
	--grad_checkpoint \
	--dtype bf16
	```

	### Key training flags

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--config` \| `100M` \| Model size (`100M` or `150M`) \|
	\| `--batch_size` \| `4` \| Per-device micro-batch size \|
	\| `--grad_accum` \| `8` \| Gradient accumulation steps \|
	\| `--max_steps` \| unlimited \| Absolute step target \|
	\| `--extra_steps` \| — \| Run N more steps from current checkpoint \|
	\| `--resume` \| — \| Resume from latest checkpoint in `--run_dir` \|
	\| `--grad_checkpoint` \| — \| Enable gradient checkpointing (saves VRAM) \|
	\| `--dtype` \| `bf16` \| Mixed precision dtype (`fp32`, `fp16`, `bf16`) \|
	\| `--synthetic` \| — \| Use random data (for testing without real shards) \|

	---

	## 📊 Training Dashboard

	Visualize training metrics in a dark-mode 6-panel dashboard:

	```bash
	# Static plot
	python plot_training.py --run_dir runs/sllm_150m

	# Live mode — refresh every 30 seconds while training
	python plot_training.py --run_dir runs/sllm_150m --live --interval 30

	# Compare two runs
	python plot_training.py --run_dir runs/run_a runs/run_b

	# Save to file
	python plot_training.py --run_dir runs/sllm_150m --save dashboard.png
	```

	Dashboard panels: Training Loss (raw + EMA) · Validation Loss · Learning Rate · Tokens/sec · VRAM usage · Gradient norm

	---

	## 💬 Fine-Tuning (Chat Model)

	After pre-training, you can fine-tune with supervised instruction data:

	```bash
	# 1. Prepare chat data
	python finetune/prepare_data.py

	# 2. Fine-tune
	python finetune/sft_train.py \
	--base_ckpt runs/sllm_150m/ckpt_0011500.pt \
	--run_dir runs/sllm_150m_chat \
	--max_steps 2500 \
	--batch_size 4 \
	--grad_accum 8 \
	--grad_checkpoint

	# 3. Chat interactively
	python finetune/chat.py --run_dir runs/sllm_150m_chat
	```

	---

	## 🔡 Tokenizer

	A custom BPE tokenizer trained on the educational subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu):

	- 32,000 token vocabulary
	- Byte fallback — zero out-of-vocabulary tokens (even math symbols and emojis work)
	- Code-aware — preserves `snake_case`, operators (`==`, `->`, `**`), and indentation
	- Contraction-aware — `don't`, `I've`, `they're` are split correctly
	- Packaged as a `PreTrainedTokenizerFast` (HuggingFace-compatible)

	Training data is packed into flat binary `.bin` shards (`np.uint16`, 100M tokens each) for fast memory-mapped loading.

	See [`tokenizer_walkthrough.md`](tokenizer_walkthrough.md) for a full pipeline deep-dive.

	---

	## 🧠 Architecture Deep-Dive

	See [`model_explained.md`](model_explained.md) for a plain-language walkthrough of every model component, including:
	- Why RMSNorm is faster than LayerNorm
	- How RoPE encodes relative position without extra parameters
	- Why SwiGLU outperforms GELU
	- How weight tying saves 32M parameters
	- Flash Attention and gradient checkpointing explained

	---

	## 📋 Checkpoints & Logging

	- Checkpoints are saved to `<run_dir>/ckpt_NNNNNNN.pt` every `--save_every` steps and on clean exit (Ctrl+C)
	- Metrics are appended to `<run_dir>/train_log.jsonl` (one JSON line per log step)
	- Each checkpoint stores: model weights, optimizer state, step number, loss, and config name
	- Resuming auto-detects the correct model config from the checkpoint

	---

	## 📦 Requirements

	```
	torch>=2.3.0
	datasets>=2.14.0 # HuggingFace datasets (streaming)
	tokenizers>=0.15.0 # Fast BPE tokenizer
	transformers>=4.40.0 # PreTrainedTokenizerFast
	numpy>=1.26.0
	tqdm
	matplotlib
	```

	---

	## 📄 License

	This project is released for educational purposes.