File size: 8,121 Bytes

---
language:
  - en
license: mit
tags:
  - pytorch
  - language-model
  - gpt
  - transformer
  - from-scratch
  - causal-lm
pipeline_tag: text-generation
---

# SLLM — Small Language Model from Scratch

A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050).

---

## ✨ Features

- **Architecture**: Decoder-only transformer (GPT-style) with modern improvements
  - RMSNorm instead of LayerNorm (faster, no bias)
  - RoPE (Rotary Position Embeddings) — used in LLaMA, Mistral, Gemma
  - SwiGLU feed-forward network — outperforms GELU at the same parameter count
  - Flash Attention via `F.scaled_dot_product_attention` (O(T²) memory avoided)
  - Weight-tied token embeddings + LM head (saves ~32M parameters)
- **Training**
  - bf16 mixed-precision with gradient accumulation
  - Gradient checkpointing for low-VRAM GPUs
  - Cosine LR schedule with linear warmup
  - Resumable checkpointing (`--resume`, `--extra_steps`)
  - JSONL metric logging + live training dashboard
- **Custom BPE Tokenizer** — trained on FineWeb-Edu with byte fallback (zero OOV)
- **Supervised Fine-Tuning (SFT)** — chat model pipeline included in `finetune/`

---

## 🏗️ Project Structure

```
sllm/
├── model/                   # Model architecture
│   ├── config.py            # ModelConfig dataclass (SLLM_100M, SLLM_150M presets)
│   ├── model.py             # SLLM — full model assembly, weight init, gradient checkpointing
│   ├── block.py             # TransformerBlock (pre-norm, residual)
│   ├── attention.py         # Causal multi-head self-attention + RoPE
│   ├── mlp.py               # SwiGLU feed-forward network
│   ├── norm.py              # RMSNorm
│   └── rope.py              # Rotary Position Embeddings
│
├── tokenizer/               # Custom BPE tokenizer
│   ├── normalizer.py        # HTML stripping, unicode NFC, whitespace cleanup
│   ├── pretokenizer.py      # Regex pre-tokenizer (code-aware, contraction-aware)
│   ├── bpe.py               # BPE model config with byte fallback (32k vocab)
│   ├── traintokenizer.py    # Train on FineWeb-Edu stream
│   ├── post_processor.py    # Append <|endoftext|> to every sequence
│   ├── wrap_tokenizer.py    # Wrap into PreTrainedTokenizerFast
│   └── tokenize_dataset.py  # Pack tokens into flat binary .bin shards
│
├── data/
│   └── dataloader.py        # Memory-mapped shard dataloader
│
├── finetune/                # Supervised fine-tuning (SFT) pipeline
│   ├── prepare_data.py      # Prepare chat data
│   ├── sft_train.py         # SFT training loop
│   ├── sft_dataset.py       # Chat dataset
│   └── chat.py              # Interactive chat with the fine-tuned model
│
├── train.py                 # Pre-training loop
├── plot_training.py         # Training dashboard (static + live mode)
├── requirements.txt
├── model_explained.md       # Deep-dive into every model component
└── tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough
```

---

## 📐 Model Configs

| Config     | d_model | Heads | Layers | Parameters |
|------------|---------|-------|--------|------------|
| `SLLM_100M` | 768    | 12    | 12     | ~109.5M    |
| `SLLM_150M` | 1024   | 16    | 9      | ~148.4M    |

Both configs use:
- Context length: **1024 tokens**
- Vocab size: **32,000** (custom BPE)
- SwiGLU d_ff: computed as `round_up_256(⌊2/3 × 4 × d_model⌋)`

---

## ⚙️ Installation

**Requires:** Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended)

```bash
# Create and activate a conda environment
conda create -n pytorch python=3.11
conda activate pytorch

# Install dependencies
pip install -r requirements.txt
```

---

## 🚀 Training

### Start a new run (RTX 3050 4GB recommended settings)

```bash
python train.py \
  --config 150M \
  --data_dir tokenizer/data \
  --batch_size 2 \
  --grad_accum 16 \
  --grad_checkpoint \
  --dtype bf16 \
  --max_steps 5000 \
  --run_dir runs/sllm_150m \
  --log_every 10 \
  --save_every 500 \
  --val_every 500 \
  --warmup_steps 200
```

### Resume from a checkpoint

```bash
python train.py \
  --resume \
  --run_dir runs/sllm_150m \
  --extra_steps 5000 \
  --data_dir tokenizer/data \
  --batch_size 2 \
  --grad_accum 16 \
  --grad_checkpoint \
  --dtype bf16
```

### Key training flags

| Flag | Default | Description |
|------|---------|-------------|
| `--config` | `100M` | Model size (`100M` or `150M`) |
| `--batch_size` | `4` | Per-device micro-batch size |
| `--grad_accum` | `8` | Gradient accumulation steps |
| `--max_steps` | unlimited | Absolute step target |
| `--extra_steps` | — | Run N more steps from current checkpoint |
| `--resume` | — | Resume from latest checkpoint in `--run_dir` |
| `--grad_checkpoint` | — | Enable gradient checkpointing (saves VRAM) |
| `--dtype` | `bf16` | Mixed precision dtype (`fp32`, `fp16`, `bf16`) |
| `--synthetic` | — | Use random data (for testing without real shards) |

---

## 📊 Training Dashboard

Visualize training metrics in a dark-mode 6-panel dashboard:

```bash
# Static plot
python plot_training.py --run_dir runs/sllm_150m

# Live mode — refresh every 30 seconds while training
python plot_training.py --run_dir runs/sllm_150m --live --interval 30

# Compare two runs
python plot_training.py --run_dir runs/run_a runs/run_b

# Save to file
python plot_training.py --run_dir runs/sllm_150m --save dashboard.png
```

**Dashboard panels:** Training Loss (raw + EMA) · Validation Loss · Learning Rate · Tokens/sec · VRAM usage · Gradient norm

---

## 💬 Fine-Tuning (Chat Model)

After pre-training, you can fine-tune with supervised instruction data:

```bash
# 1. Prepare chat data
python finetune/prepare_data.py

# 2. Fine-tune
python finetune/sft_train.py \
  --base_ckpt runs/sllm_150m/ckpt_0011500.pt \
  --run_dir runs/sllm_150m_chat \
  --max_steps 2500 \
  --batch_size 4 \
  --grad_accum 8 \
  --grad_checkpoint

# 3. Chat interactively
python finetune/chat.py --run_dir runs/sllm_150m_chat
```

---

## 🔡 Tokenizer

A custom BPE tokenizer trained on the educational subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu):

- **32,000 token vocabulary**
- **Byte fallback** — zero out-of-vocabulary tokens (even math symbols and emojis work)
- **Code-aware** — preserves `snake_case`, operators (`==`, `->`, `**`), and indentation
- **Contraction-aware** — `don't`, `I've`, `they're` are split correctly
- Packaged as a `PreTrainedTokenizerFast` (HuggingFace-compatible)

Training data is packed into flat binary `.bin` shards (`np.uint16`, 100M tokens each) for fast memory-mapped loading.

See [`tokenizer_walkthrough.md`](tokenizer_walkthrough.md) for a full pipeline deep-dive.

---

## 🧠 Architecture Deep-Dive

See [`model_explained.md`](model_explained.md) for a plain-language walkthrough of every model component, including:
- Why RMSNorm is faster than LayerNorm
- How RoPE encodes relative position without extra parameters
- Why SwiGLU outperforms GELU
- How weight tying saves 32M parameters
- Flash Attention and gradient checkpointing explained

---

## 📋 Checkpoints & Logging

- Checkpoints are saved to `<run_dir>/ckpt_NNNNNNN.pt` every `--save_every` steps and on clean exit (Ctrl+C)
- Metrics are appended to `<run_dir>/train_log.jsonl` (one JSON line per log step)
- Each checkpoint stores: model weights, optimizer state, step number, loss, and config name
- Resuming auto-detects the correct model config from the checkpoint

---

## 📦 Requirements

```
torch>=2.3.0
datasets>=2.14.0       # HuggingFace datasets (streaming)
tokenizers>=0.15.0     # Fast BPE tokenizer
transformers>=4.40.0   # PreTrainedTokenizerFast
numpy>=1.26.0
tqdm
matplotlib
```

---

## 📄 License

This project is released for educational purposes.