sllm / README.md
geeteshcodes's picture
Add source code and docs
6eae939 verified
---
language:
- en
license: mit
tags:
- pytorch
- language-model
- gpt
- transformer
- from-scratch
- causal-lm
pipeline_tag: text-generation
---
# SLLM β€” Small Language Model from Scratch
A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050).
---
## ✨ Features
- **Architecture**: Decoder-only transformer (GPT-style) with modern improvements
- RMSNorm instead of LayerNorm (faster, no bias)
- RoPE (Rotary Position Embeddings) β€” used in LLaMA, Mistral, Gemma
- SwiGLU feed-forward network β€” outperforms GELU at the same parameter count
- Flash Attention via `F.scaled_dot_product_attention` (O(TΒ²) memory avoided)
- Weight-tied token embeddings + LM head (saves ~32M parameters)
- **Training**
- bf16 mixed-precision with gradient accumulation
- Gradient checkpointing for low-VRAM GPUs
- Cosine LR schedule with linear warmup
- Resumable checkpointing (`--resume`, `--extra_steps`)
- JSONL metric logging + live training dashboard
- **Custom BPE Tokenizer** β€” trained on FineWeb-Edu with byte fallback (zero OOV)
- **Supervised Fine-Tuning (SFT)** β€” chat model pipeline included in `finetune/`
---
## πŸ—οΈ Project Structure
```
sllm/
β”œβ”€β”€ model/ # Model architecture
β”‚ β”œβ”€β”€ config.py # ModelConfig dataclass (SLLM_100M, SLLM_150M presets)
β”‚ β”œβ”€β”€ model.py # SLLM β€” full model assembly, weight init, gradient checkpointing
β”‚ β”œβ”€β”€ block.py # TransformerBlock (pre-norm, residual)
β”‚ β”œβ”€β”€ attention.py # Causal multi-head self-attention + RoPE
β”‚ β”œβ”€β”€ mlp.py # SwiGLU feed-forward network
β”‚ β”œβ”€β”€ norm.py # RMSNorm
β”‚ └── rope.py # Rotary Position Embeddings
β”‚
β”œβ”€β”€ tokenizer/ # Custom BPE tokenizer
β”‚ β”œβ”€β”€ normalizer.py # HTML stripping, unicode NFC, whitespace cleanup
β”‚ β”œβ”€β”€ pretokenizer.py # Regex pre-tokenizer (code-aware, contraction-aware)
β”‚ β”œβ”€β”€ bpe.py # BPE model config with byte fallback (32k vocab)
β”‚ β”œβ”€β”€ traintokenizer.py # Train on FineWeb-Edu stream
β”‚ β”œβ”€β”€ post_processor.py # Append <|endoftext|> to every sequence
β”‚ β”œβ”€β”€ wrap_tokenizer.py # Wrap into PreTrainedTokenizerFast
β”‚ └── tokenize_dataset.py # Pack tokens into flat binary .bin shards
β”‚
β”œβ”€β”€ data/
β”‚ └── dataloader.py # Memory-mapped shard dataloader
β”‚
β”œβ”€β”€ finetune/ # Supervised fine-tuning (SFT) pipeline
β”‚ β”œβ”€β”€ prepare_data.py # Prepare chat data
β”‚ β”œβ”€β”€ sft_train.py # SFT training loop
β”‚ β”œβ”€β”€ sft_dataset.py # Chat dataset
β”‚ └── chat.py # Interactive chat with the fine-tuned model
β”‚
β”œβ”€β”€ train.py # Pre-training loop
β”œβ”€β”€ plot_training.py # Training dashboard (static + live mode)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ model_explained.md # Deep-dive into every model component
└── tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough
```
---
## πŸ“ Model Configs
| Config | d_model | Heads | Layers | Parameters |
|------------|---------|-------|--------|------------|
| `SLLM_100M` | 768 | 12 | 12 | ~109.5M |
| `SLLM_150M` | 1024 | 16 | 9 | ~148.4M |
Both configs use:
- Context length: **1024 tokens**
- Vocab size: **32,000** (custom BPE)
- SwiGLU d_ff: computed as `round_up_256(⌊2/3 Γ— 4 Γ— d_modelβŒ‹)`
---
## βš™οΈ Installation
**Requires:** Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended)
```bash
# Create and activate a conda environment
conda create -n pytorch python=3.11
conda activate pytorch
# Install dependencies
pip install -r requirements.txt
```
---
## πŸš€ Training
### Start a new run (RTX 3050 4GB recommended settings)
```bash
python train.py \
--config 150M \
--data_dir tokenizer/data \
--batch_size 2 \
--grad_accum 16 \
--grad_checkpoint \
--dtype bf16 \
--max_steps 5000 \
--run_dir runs/sllm_150m \
--log_every 10 \
--save_every 500 \
--val_every 500 \
--warmup_steps 200
```
### Resume from a checkpoint
```bash
python train.py \
--resume \
--run_dir runs/sllm_150m \
--extra_steps 5000 \
--data_dir tokenizer/data \
--batch_size 2 \
--grad_accum 16 \
--grad_checkpoint \
--dtype bf16
```
### Key training flags
| Flag | Default | Description |
|------|---------|-------------|
| `--config` | `100M` | Model size (`100M` or `150M`) |
| `--batch_size` | `4` | Per-device micro-batch size |
| `--grad_accum` | `8` | Gradient accumulation steps |
| `--max_steps` | unlimited | Absolute step target |
| `--extra_steps` | β€” | Run N more steps from current checkpoint |
| `--resume` | β€” | Resume from latest checkpoint in `--run_dir` |
| `--grad_checkpoint` | β€” | Enable gradient checkpointing (saves VRAM) |
| `--dtype` | `bf16` | Mixed precision dtype (`fp32`, `fp16`, `bf16`) |
| `--synthetic` | β€” | Use random data (for testing without real shards) |
---
## πŸ“Š Training Dashboard
Visualize training metrics in a dark-mode 6-panel dashboard:
```bash
# Static plot
python plot_training.py --run_dir runs/sllm_150m
# Live mode β€” refresh every 30 seconds while training
python plot_training.py --run_dir runs/sllm_150m --live --interval 30
# Compare two runs
python plot_training.py --run_dir runs/run_a runs/run_b
# Save to file
python plot_training.py --run_dir runs/sllm_150m --save dashboard.png
```
**Dashboard panels:** Training Loss (raw + EMA) Β· Validation Loss Β· Learning Rate Β· Tokens/sec Β· VRAM usage Β· Gradient norm
---
## πŸ’¬ Fine-Tuning (Chat Model)
After pre-training, you can fine-tune with supervised instruction data:
```bash
# 1. Prepare chat data
python finetune/prepare_data.py
# 2. Fine-tune
python finetune/sft_train.py \
--base_ckpt runs/sllm_150m/ckpt_0011500.pt \
--run_dir runs/sllm_150m_chat \
--max_steps 2500 \
--batch_size 4 \
--grad_accum 8 \
--grad_checkpoint
# 3. Chat interactively
python finetune/chat.py --run_dir runs/sllm_150m_chat
```
---
## πŸ”‘ Tokenizer
A custom BPE tokenizer trained on the educational subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu):
- **32,000 token vocabulary**
- **Byte fallback** β€” zero out-of-vocabulary tokens (even math symbols and emojis work)
- **Code-aware** β€” preserves `snake_case`, operators (`==`, `->`, `**`), and indentation
- **Contraction-aware** β€” `don't`, `I've`, `they're` are split correctly
- Packaged as a `PreTrainedTokenizerFast` (HuggingFace-compatible)
Training data is packed into flat binary `.bin` shards (`np.uint16`, 100M tokens each) for fast memory-mapped loading.
See [`tokenizer_walkthrough.md`](tokenizer_walkthrough.md) for a full pipeline deep-dive.
---
## 🧠 Architecture Deep-Dive
See [`model_explained.md`](model_explained.md) for a plain-language walkthrough of every model component, including:
- Why RMSNorm is faster than LayerNorm
- How RoPE encodes relative position without extra parameters
- Why SwiGLU outperforms GELU
- How weight tying saves 32M parameters
- Flash Attention and gradient checkpointing explained
---
## πŸ“‹ Checkpoints & Logging
- Checkpoints are saved to `<run_dir>/ckpt_NNNNNNN.pt` every `--save_every` steps and on clean exit (Ctrl+C)
- Metrics are appended to `<run_dir>/train_log.jsonl` (one JSON line per log step)
- Each checkpoint stores: model weights, optimizer state, step number, loss, and config name
- Resuming auto-detects the correct model config from the checkpoint
---
## πŸ“¦ Requirements
```
torch>=2.3.0
datasets>=2.14.0 # HuggingFace datasets (streaming)
tokenizers>=0.15.0 # Fast BPE tokenizer
transformers>=4.40.0 # PreTrainedTokenizerFast
numpy>=1.26.0
tqdm
matplotlib
```
---
## πŸ“„ License
This project is released for educational purposes.