--- language: - en license: mit tags: - pytorch - language-model - gpt - transformer - from-scratch - causal-lm pipeline_tag: text-generation --- # SLLM — Small Language Model from Scratch A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050). --- ## ✨ Features - **Architecture**: Decoder-only transformer (GPT-style) with modern improvements - RMSNorm instead of LayerNorm (faster, no bias) - RoPE (Rotary Position Embeddings) — used in LLaMA, Mistral, Gemma - SwiGLU feed-forward network — outperforms GELU at the same parameter count - Flash Attention via `F.scaled_dot_product_attention` (O(T²) memory avoided) - Weight-tied token embeddings + LM head (saves ~32M parameters) - **Training** - bf16 mixed-precision with gradient accumulation - Gradient checkpointing for low-VRAM GPUs - Cosine LR schedule with linear warmup - Resumable checkpointing (`--resume`, `--extra_steps`) - JSONL metric logging + live training dashboard - **Custom BPE Tokenizer** — trained on FineWeb-Edu with byte fallback (zero OOV) - **Supervised Fine-Tuning (SFT)** — chat model pipeline included in `finetune/` --- ## 🏗️ Project Structure ``` sllm/ ├── model/ # Model architecture │ ├── config.py # ModelConfig dataclass (SLLM_100M, SLLM_150M presets) │ ├── model.py # SLLM — full model assembly, weight init, gradient checkpointing │ ├── block.py # TransformerBlock (pre-norm, residual) │ ├── attention.py # Causal multi-head self-attention + RoPE │ ├── mlp.py # SwiGLU feed-forward network │ ├── norm.py # RMSNorm │ └── rope.py # Rotary Position Embeddings │ ├── tokenizer/ # Custom BPE tokenizer │ ├── normalizer.py # HTML stripping, unicode NFC, whitespace cleanup │ ├── pretokenizer.py # Regex pre-tokenizer (code-aware, contraction-aware) │ ├── bpe.py # BPE model config with byte fallback (32k vocab) │ ├── traintokenizer.py # Train on FineWeb-Edu stream │ ├── post_processor.py # Append <|endoftext|> to every sequence │ ├── wrap_tokenizer.py # Wrap into PreTrainedTokenizerFast │ └── tokenize_dataset.py # Pack tokens into flat binary .bin shards │ ├── data/ │ └── dataloader.py # Memory-mapped shard dataloader │ ├── finetune/ # Supervised fine-tuning (SFT) pipeline │ ├── prepare_data.py # Prepare chat data │ ├── sft_train.py # SFT training loop │ ├── sft_dataset.py # Chat dataset │ └── chat.py # Interactive chat with the fine-tuned model │ ├── train.py # Pre-training loop ├── plot_training.py # Training dashboard (static + live mode) ├── requirements.txt ├── model_explained.md # Deep-dive into every model component └── tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough ``` --- ## 📐 Model Configs | Config | d_model | Heads | Layers | Parameters | |------------|---------|-------|--------|------------| | `SLLM_100M` | 768 | 12 | 12 | ~109.5M | | `SLLM_150M` | 1024 | 16 | 9 | ~148.4M | Both configs use: - Context length: **1024 tokens** - Vocab size: **32,000** (custom BPE) - SwiGLU d_ff: computed as `round_up_256(⌊2/3 × 4 × d_model⌋)` --- ## ⚙️ Installation **Requires:** Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended) ```bash # Create and activate a conda environment conda create -n pytorch python=3.11 conda activate pytorch # Install dependencies pip install -r requirements.txt ``` --- ## 🚀 Training ### Start a new run (RTX 3050 4GB recommended settings) ```bash python train.py \ --config 150M \ --data_dir tokenizer/data \ --batch_size 2 \ --grad_accum 16 \ --grad_checkpoint \ --dtype bf16 \ --max_steps 5000 \ --run_dir runs/sllm_150m \ --log_every 10 \ --save_every 500 \ --val_every 500 \ --warmup_steps 200 ``` ### Resume from a checkpoint ```bash python train.py \ --resume \ --run_dir runs/sllm_150m \ --extra_steps 5000 \ --data_dir tokenizer/data \ --batch_size 2 \ --grad_accum 16 \ --grad_checkpoint \ --dtype bf16 ``` ### Key training flags | Flag | Default | Description | |------|---------|-------------| | `--config` | `100M` | Model size (`100M` or `150M`) | | `--batch_size` | `4` | Per-device micro-batch size | | `--grad_accum` | `8` | Gradient accumulation steps | | `--max_steps` | unlimited | Absolute step target | | `--extra_steps` | — | Run N more steps from current checkpoint | | `--resume` | — | Resume from latest checkpoint in `--run_dir` | | `--grad_checkpoint` | — | Enable gradient checkpointing (saves VRAM) | | `--dtype` | `bf16` | Mixed precision dtype (`fp32`, `fp16`, `bf16`) | | `--synthetic` | — | Use random data (for testing without real shards) | --- ## 📊 Training Dashboard Visualize training metrics in a dark-mode 6-panel dashboard: ```bash # Static plot python plot_training.py --run_dir runs/sllm_150m # Live mode — refresh every 30 seconds while training python plot_training.py --run_dir runs/sllm_150m --live --interval 30 # Compare two runs python plot_training.py --run_dir runs/run_a runs/run_b # Save to file python plot_training.py --run_dir runs/sllm_150m --save dashboard.png ``` **Dashboard panels:** Training Loss (raw + EMA) · Validation Loss · Learning Rate · Tokens/sec · VRAM usage · Gradient norm --- ## 💬 Fine-Tuning (Chat Model) After pre-training, you can fine-tune with supervised instruction data: ```bash # 1. Prepare chat data python finetune/prepare_data.py # 2. Fine-tune python finetune/sft_train.py \ --base_ckpt runs/sllm_150m/ckpt_0011500.pt \ --run_dir runs/sllm_150m_chat \ --max_steps 2500 \ --batch_size 4 \ --grad_accum 8 \ --grad_checkpoint # 3. Chat interactively python finetune/chat.py --run_dir runs/sllm_150m_chat ``` --- ## 🔡 Tokenizer A custom BPE tokenizer trained on the educational subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu): - **32,000 token vocabulary** - **Byte fallback** — zero out-of-vocabulary tokens (even math symbols and emojis work) - **Code-aware** — preserves `snake_case`, operators (`==`, `->`, `**`), and indentation - **Contraction-aware** — `don't`, `I've`, `they're` are split correctly - Packaged as a `PreTrainedTokenizerFast` (HuggingFace-compatible) Training data is packed into flat binary `.bin` shards (`np.uint16`, 100M tokens each) for fast memory-mapped loading. See [`tokenizer_walkthrough.md`](tokenizer_walkthrough.md) for a full pipeline deep-dive. --- ## 🧠 Architecture Deep-Dive See [`model_explained.md`](model_explained.md) for a plain-language walkthrough of every model component, including: - Why RMSNorm is faster than LayerNorm - How RoPE encodes relative position without extra parameters - Why SwiGLU outperforms GELU - How weight tying saves 32M parameters - Flash Attention and gradient checkpointing explained --- ## 📋 Checkpoints & Logging - Checkpoints are saved to `/ckpt_NNNNNNN.pt` every `--save_every` steps and on clean exit (Ctrl+C) - Metrics are appended to `/train_log.jsonl` (one JSON line per log step) - Each checkpoint stores: model weights, optimizer state, step number, loss, and config name - Resuming auto-detects the correct model config from the checkpoint --- ## 📦 Requirements ``` torch>=2.3.0 datasets>=2.14.0 # HuggingFace datasets (streaming) tokenizers>=0.15.0 # Fast BPE tokenizer transformers>=4.40.0 # PreTrainedTokenizerFast numpy>=1.26.0 tqdm matplotlib ``` --- ## 📄 License This project is released for educational purposes.