| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - pytorch |
| - language-model |
| - gpt |
| - transformer |
| - from-scratch |
| - causal-lm |
| pipeline_tag: text-generation |
| --- |
| |
| # SLLM β Small Language Model from Scratch |
|
|
| A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050). |
|
|
| --- |
|
|
| ## β¨ Features |
|
|
| - **Architecture**: Decoder-only transformer (GPT-style) with modern improvements |
| - RMSNorm instead of LayerNorm (faster, no bias) |
| - RoPE (Rotary Position Embeddings) β used in LLaMA, Mistral, Gemma |
| - SwiGLU feed-forward network β outperforms GELU at the same parameter count |
| - Flash Attention via `F.scaled_dot_product_attention` (O(TΒ²) memory avoided) |
| - Weight-tied token embeddings + LM head (saves ~32M parameters) |
| - **Training** |
| - bf16 mixed-precision with gradient accumulation |
| - Gradient checkpointing for low-VRAM GPUs |
| - Cosine LR schedule with linear warmup |
| - Resumable checkpointing (`--resume`, `--extra_steps`) |
| - JSONL metric logging + live training dashboard |
| - **Custom BPE Tokenizer** β trained on FineWeb-Edu with byte fallback (zero OOV) |
| - **Supervised Fine-Tuning (SFT)** β chat model pipeline included in `finetune/` |
|
|
| --- |
|
|
| ## ποΈ Project Structure |
|
|
| ``` |
| sllm/ |
| βββ model/ # Model architecture |
| β βββ config.py # ModelConfig dataclass (SLLM_100M, SLLM_150M presets) |
| β βββ model.py # SLLM β full model assembly, weight init, gradient checkpointing |
| β βββ block.py # TransformerBlock (pre-norm, residual) |
| β βββ attention.py # Causal multi-head self-attention + RoPE |
| β βββ mlp.py # SwiGLU feed-forward network |
| β βββ norm.py # RMSNorm |
| β βββ rope.py # Rotary Position Embeddings |
| β |
| βββ tokenizer/ # Custom BPE tokenizer |
| β βββ normalizer.py # HTML stripping, unicode NFC, whitespace cleanup |
| β βββ pretokenizer.py # Regex pre-tokenizer (code-aware, contraction-aware) |
| β βββ bpe.py # BPE model config with byte fallback (32k vocab) |
| β βββ traintokenizer.py # Train on FineWeb-Edu stream |
| β βββ post_processor.py # Append <|endoftext|> to every sequence |
| β βββ wrap_tokenizer.py # Wrap into PreTrainedTokenizerFast |
| β βββ tokenize_dataset.py # Pack tokens into flat binary .bin shards |
| β |
| βββ data/ |
| β βββ dataloader.py # Memory-mapped shard dataloader |
| β |
| βββ finetune/ # Supervised fine-tuning (SFT) pipeline |
| β βββ prepare_data.py # Prepare chat data |
| β βββ sft_train.py # SFT training loop |
| β βββ sft_dataset.py # Chat dataset |
| β βββ chat.py # Interactive chat with the fine-tuned model |
| β |
| βββ train.py # Pre-training loop |
| βββ plot_training.py # Training dashboard (static + live mode) |
| βββ requirements.txt |
| βββ model_explained.md # Deep-dive into every model component |
| βββ tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough |
| ``` |
|
|
| --- |
|
|
| ## π Model Configs |
|
|
| | Config | d_model | Heads | Layers | Parameters | |
| |------------|---------|-------|--------|------------| |
| | `SLLM_100M` | 768 | 12 | 12 | ~109.5M | |
| | `SLLM_150M` | 1024 | 16 | 9 | ~148.4M | |
|
|
| Both configs use: |
| - Context length: **1024 tokens** |
| - Vocab size: **32,000** (custom BPE) |
| - SwiGLU d_ff: computed as `round_up_256(β2/3 Γ 4 Γ d_modelβ)` |
|
|
| --- |
|
|
| ## βοΈ Installation |
|
|
| **Requires:** Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended) |
|
|
| ```bash |
| # Create and activate a conda environment |
| conda create -n pytorch python=3.11 |
| conda activate pytorch |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| ``` |
|
|
| --- |
|
|
| ## π Training |
|
|
| ### Start a new run (RTX 3050 4GB recommended settings) |
|
|
| ```bash |
| python train.py \ |
| --config 150M \ |
| --data_dir tokenizer/data \ |
| --batch_size 2 \ |
| --grad_accum 16 \ |
| --grad_checkpoint \ |
| --dtype bf16 \ |
| --max_steps 5000 \ |
| --run_dir runs/sllm_150m \ |
| --log_every 10 \ |
| --save_every 500 \ |
| --val_every 500 \ |
| --warmup_steps 200 |
| ``` |
|
|
| ### Resume from a checkpoint |
|
|
| ```bash |
| python train.py \ |
| --resume \ |
| --run_dir runs/sllm_150m \ |
| --extra_steps 5000 \ |
| --data_dir tokenizer/data \ |
| --batch_size 2 \ |
| --grad_accum 16 \ |
| --grad_checkpoint \ |
| --dtype bf16 |
| ``` |
|
|
| ### Key training flags |
|
|
| | Flag | Default | Description | |
| |------|---------|-------------| |
| | `--config` | `100M` | Model size (`100M` or `150M`) | |
| | `--batch_size` | `4` | Per-device micro-batch size | |
| | `--grad_accum` | `8` | Gradient accumulation steps | |
| | `--max_steps` | unlimited | Absolute step target | |
| | `--extra_steps` | β | Run N more steps from current checkpoint | |
| | `--resume` | β | Resume from latest checkpoint in `--run_dir` | |
| | `--grad_checkpoint` | β | Enable gradient checkpointing (saves VRAM) | |
| | `--dtype` | `bf16` | Mixed precision dtype (`fp32`, `fp16`, `bf16`) | |
| | `--synthetic` | β | Use random data (for testing without real shards) | |
|
|
| --- |
|
|
| ## π Training Dashboard |
|
|
| Visualize training metrics in a dark-mode 6-panel dashboard: |
|
|
| ```bash |
| # Static plot |
| python plot_training.py --run_dir runs/sllm_150m |
| |
| # Live mode β refresh every 30 seconds while training |
| python plot_training.py --run_dir runs/sllm_150m --live --interval 30 |
| |
| # Compare two runs |
| python plot_training.py --run_dir runs/run_a runs/run_b |
| |
| # Save to file |
| python plot_training.py --run_dir runs/sllm_150m --save dashboard.png |
| ``` |
|
|
| **Dashboard panels:** Training Loss (raw + EMA) Β· Validation Loss Β· Learning Rate Β· Tokens/sec Β· VRAM usage Β· Gradient norm |
|
|
| --- |
|
|
| ## π¬ Fine-Tuning (Chat Model) |
|
|
| After pre-training, you can fine-tune with supervised instruction data: |
|
|
| ```bash |
| # 1. Prepare chat data |
| python finetune/prepare_data.py |
| |
| # 2. Fine-tune |
| python finetune/sft_train.py \ |
| --base_ckpt runs/sllm_150m/ckpt_0011500.pt \ |
| --run_dir runs/sllm_150m_chat \ |
| --max_steps 2500 \ |
| --batch_size 4 \ |
| --grad_accum 8 \ |
| --grad_checkpoint |
| |
| # 3. Chat interactively |
| python finetune/chat.py --run_dir runs/sllm_150m_chat |
| ``` |
|
|
| --- |
|
|
| ## π‘ Tokenizer |
|
|
| A custom BPE tokenizer trained on the educational subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu): |
|
|
| - **32,000 token vocabulary** |
| - **Byte fallback** β zero out-of-vocabulary tokens (even math symbols and emojis work) |
| - **Code-aware** β preserves `snake_case`, operators (`==`, `->`, `**`), and indentation |
| - **Contraction-aware** β `don't`, `I've`, `they're` are split correctly |
| - Packaged as a `PreTrainedTokenizerFast` (HuggingFace-compatible) |
|
|
| Training data is packed into flat binary `.bin` shards (`np.uint16`, 100M tokens each) for fast memory-mapped loading. |
|
|
| See [`tokenizer_walkthrough.md`](tokenizer_walkthrough.md) for a full pipeline deep-dive. |
|
|
| --- |
|
|
| ## π§ Architecture Deep-Dive |
|
|
| See [`model_explained.md`](model_explained.md) for a plain-language walkthrough of every model component, including: |
| - Why RMSNorm is faster than LayerNorm |
| - How RoPE encodes relative position without extra parameters |
| - Why SwiGLU outperforms GELU |
| - How weight tying saves 32M parameters |
| - Flash Attention and gradient checkpointing explained |
|
|
| --- |
|
|
| ## π Checkpoints & Logging |
|
|
| - Checkpoints are saved to `<run_dir>/ckpt_NNNNNNN.pt` every `--save_every` steps and on clean exit (Ctrl+C) |
| - Metrics are appended to `<run_dir>/train_log.jsonl` (one JSON line per log step) |
| - Each checkpoint stores: model weights, optimizer state, step number, loss, and config name |
| - Resuming auto-detects the correct model config from the checkpoint |
|
|
| --- |
|
|
| ## π¦ Requirements |
|
|
| ``` |
| torch>=2.3.0 |
| datasets>=2.14.0 # HuggingFace datasets (streaming) |
| tokenizers>=0.15.0 # Fast BPE tokenizer |
| transformers>=4.40.0 # PreTrainedTokenizerFast |
| numpy>=1.26.0 |
| tqdm |
| matplotlib |
| ``` |
|
|
| --- |
|
|
| ## π License |
|
|
| This project is released for educational purposes. |
|
|