SLLM β€” Small Language Model from Scratch

A GPT-style decoder-only transformer built and trained from scratch in PyTorch. Two model sizes are available (100M and 150M parameters), designed to fit on consumer GPUs as small as a 4 GB VRAM card (e.g. RTX 3050).


✨ Features

  • Architecture: Decoder-only transformer (GPT-style) with modern improvements
    • RMSNorm instead of LayerNorm (faster, no bias)
    • RoPE (Rotary Position Embeddings) β€” used in LLaMA, Mistral, Gemma
    • SwiGLU feed-forward network β€” outperforms GELU at the same parameter count
    • Flash Attention via F.scaled_dot_product_attention (O(TΒ²) memory avoided)
    • Weight-tied token embeddings + LM head (saves ~32M parameters)
  • Training
    • bf16 mixed-precision with gradient accumulation
    • Gradient checkpointing for low-VRAM GPUs
    • Cosine LR schedule with linear warmup
    • Resumable checkpointing (--resume, --extra_steps)
    • JSONL metric logging + live training dashboard
  • Custom BPE Tokenizer β€” trained on FineWeb-Edu with byte fallback (zero OOV)
  • Supervised Fine-Tuning (SFT) β€” chat model pipeline included in finetune/

πŸ—οΈ Project Structure

sllm/
β”œβ”€β”€ model/                   # Model architecture
β”‚   β”œβ”€β”€ config.py            # ModelConfig dataclass (SLLM_100M, SLLM_150M presets)
β”‚   β”œβ”€β”€ model.py             # SLLM β€” full model assembly, weight init, gradient checkpointing
β”‚   β”œβ”€β”€ block.py             # TransformerBlock (pre-norm, residual)
β”‚   β”œβ”€β”€ attention.py         # Causal multi-head self-attention + RoPE
β”‚   β”œβ”€β”€ mlp.py               # SwiGLU feed-forward network
β”‚   β”œβ”€β”€ norm.py              # RMSNorm
β”‚   └── rope.py              # Rotary Position Embeddings
β”‚
β”œβ”€β”€ tokenizer/               # Custom BPE tokenizer
β”‚   β”œβ”€β”€ normalizer.py        # HTML stripping, unicode NFC, whitespace cleanup
β”‚   β”œβ”€β”€ pretokenizer.py      # Regex pre-tokenizer (code-aware, contraction-aware)
β”‚   β”œβ”€β”€ bpe.py               # BPE model config with byte fallback (32k vocab)
β”‚   β”œβ”€β”€ traintokenizer.py    # Train on FineWeb-Edu stream
β”‚   β”œβ”€β”€ post_processor.py    # Append <|endoftext|> to every sequence
β”‚   β”œβ”€β”€ wrap_tokenizer.py    # Wrap into PreTrainedTokenizerFast
β”‚   └── tokenize_dataset.py  # Pack tokens into flat binary .bin shards
β”‚
β”œβ”€β”€ data/
β”‚   └── dataloader.py        # Memory-mapped shard dataloader
β”‚
β”œβ”€β”€ finetune/                # Supervised fine-tuning (SFT) pipeline
β”‚   β”œβ”€β”€ prepare_data.py      # Prepare chat data
β”‚   β”œβ”€β”€ sft_train.py         # SFT training loop
β”‚   β”œβ”€β”€ sft_dataset.py       # Chat dataset
β”‚   └── chat.py              # Interactive chat with the fine-tuned model
β”‚
β”œβ”€β”€ train.py                 # Pre-training loop
β”œβ”€β”€ plot_training.py         # Training dashboard (static + live mode)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ model_explained.md       # Deep-dive into every model component
└── tokenizer_walkthrough.md # Tokenizer design and pipeline walkthrough

πŸ“ Model Configs

Config d_model Heads Layers Parameters
SLLM_100M 768 12 12 ~109.5M
SLLM_150M 1024 16 9 ~148.4M

Both configs use:

  • Context length: 1024 tokens
  • Vocab size: 32,000 (custom BPE)
  • SwiGLU d_ff: computed as round_up_256(⌊2/3 Γ— 4 Γ— d_modelβŒ‹)

βš™οΈ Installation

Requires: Python 3.10+, PyTorch 2.3+, CUDA-capable GPU (bf16 recommended)

# Create and activate a conda environment
conda create -n pytorch python=3.11
conda activate pytorch

# Install dependencies
pip install -r requirements.txt

πŸš€ Training

Start a new run (RTX 3050 4GB recommended settings)

python train.py \
  --config 150M \
  --data_dir tokenizer/data \
  --batch_size 2 \
  --grad_accum 16 \
  --grad_checkpoint \
  --dtype bf16 \
  --max_steps 5000 \
  --run_dir runs/sllm_150m \
  --log_every 10 \
  --save_every 500 \
  --val_every 500 \
  --warmup_steps 200

Resume from a checkpoint

python train.py \
  --resume \
  --run_dir runs/sllm_150m \
  --extra_steps 5000 \
  --data_dir tokenizer/data \
  --batch_size 2 \
  --grad_accum 16 \
  --grad_checkpoint \
  --dtype bf16

Key training flags

Flag Default Description
--config 100M Model size (100M or 150M)
--batch_size 4 Per-device micro-batch size
--grad_accum 8 Gradient accumulation steps
--max_steps unlimited Absolute step target
--extra_steps β€” Run N more steps from current checkpoint
--resume β€” Resume from latest checkpoint in --run_dir
--grad_checkpoint β€” Enable gradient checkpointing (saves VRAM)
--dtype bf16 Mixed precision dtype (fp32, fp16, bf16)
--synthetic β€” Use random data (for testing without real shards)

πŸ“Š Training Dashboard

Visualize training metrics in a dark-mode 6-panel dashboard:

# Static plot
python plot_training.py --run_dir runs/sllm_150m

# Live mode β€” refresh every 30 seconds while training
python plot_training.py --run_dir runs/sllm_150m --live --interval 30

# Compare two runs
python plot_training.py --run_dir runs/run_a runs/run_b

# Save to file
python plot_training.py --run_dir runs/sllm_150m --save dashboard.png

Dashboard panels: Training Loss (raw + EMA) Β· Validation Loss Β· Learning Rate Β· Tokens/sec Β· VRAM usage Β· Gradient norm


πŸ’¬ Fine-Tuning (Chat Model)

After pre-training, you can fine-tune with supervised instruction data:

# 1. Prepare chat data
python finetune/prepare_data.py

# 2. Fine-tune
python finetune/sft_train.py \
  --base_ckpt runs/sllm_150m/ckpt_0011500.pt \
  --run_dir runs/sllm_150m_chat \
  --max_steps 2500 \
  --batch_size 4 \
  --grad_accum 8 \
  --grad_checkpoint

# 3. Chat interactively
python finetune/chat.py --run_dir runs/sllm_150m_chat

πŸ”‘ Tokenizer

A custom BPE tokenizer trained on the educational subset of FineWeb-Edu:

  • 32,000 token vocabulary
  • Byte fallback β€” zero out-of-vocabulary tokens (even math symbols and emojis work)
  • Code-aware β€” preserves snake_case, operators (==, ->, **), and indentation
  • Contraction-aware β€” don't, I've, they're are split correctly
  • Packaged as a PreTrainedTokenizerFast (HuggingFace-compatible)

Training data is packed into flat binary .bin shards (np.uint16, 100M tokens each) for fast memory-mapped loading.

See tokenizer_walkthrough.md for a full pipeline deep-dive.


🧠 Architecture Deep-Dive

See model_explained.md for a plain-language walkthrough of every model component, including:

  • Why RMSNorm is faster than LayerNorm
  • How RoPE encodes relative position without extra parameters
  • Why SwiGLU outperforms GELU
  • How weight tying saves 32M parameters
  • Flash Attention and gradient checkpointing explained

πŸ“‹ Checkpoints & Logging

  • Checkpoints are saved to <run_dir>/ckpt_NNNNNNN.pt every --save_every steps and on clean exit (Ctrl+C)
  • Metrics are appended to <run_dir>/train_log.jsonl (one JSON line per log step)
  • Each checkpoint stores: model weights, optimizer state, step number, loss, and config name
  • Resuming auto-detects the correct model config from the checkpoint

πŸ“¦ Requirements

torch>=2.3.0
datasets>=2.14.0       # HuggingFace datasets (streaming)
tokenizers>=0.15.0     # Fast BPE tokenizer
transformers>=4.40.0   # PreTrainedTokenizerFast
numpy>=1.26.0
tqdm
matplotlib

πŸ“„ License

This project is released for educational purposes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support