YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Tiny-GPT: 0.5B MoE Language Model
A clean, efficient implementation of a Mixture-of-Experts GPT that fits on modest GPUs (4GB VRAM) while training on large datasets.
๐ฏ Main Goal
Generate proper English text - not gibberish!
๐ Quick Stats
| Metric | Value |
|---|---|
| Model Size | 0.5B parameters (520M) |
| Active per Token | 180M parameters (via MoE routing) |
| Architecture | 12 Transformer layers, 8 experts/layer, top-2 routing |
| Training Data | WikiText-103 (103M tokens, ~500MB) |
| GPU Memory | 0.97 GiB (model weights only) |
| Training Time | ~10-20 hours on RTX 2050 (10k steps) |
| Tokenizer | GPT-2 BPE (50,257 vocab via tiktoken) |
๐ Quick Start
1. Prepare Dataset
python prepare_data.py
Downloads WikiText-103 and tokenizes to memory-mapped binary files (~500MB). This is a one-time operation that takes 10-30 minutes.
2. Train
python main.py
Starts training from scratch with:
- Learning rate: 1.5e-4 (lowered for stability)
- Warmup: 500 steps (better convergence)
- Total steps: 10,000 (more thorough training)
- Batch size: 16 (gradient accumulation of 2x8)
Training progress shows in real-time via rich progress bar.
3. Generate Text
python run.py
๐ค Use Hugging Face Hub (instead of local/GitHub checkpoints)
1. Upload checkpoints to HF Hub
pip install huggingface_hub
export HF_TOKEN=your_hf_token
python push_to_hf.py --repo-id yourname/Tiny-GPT
This uploads:
checkpoints/best.ptโbest.ptcheckpoints/latest.ptโlatest.pt(if present)
2. Run inference directly from HF Hub
python run.py --hf-repo yourname/Tiny-GPT --prompt "The future of AI is"
Optional flags:
--hf-filename best.pt--hf-revision main--hf-token <token>(or useHF_TOKENenv var)
๐ File Structure
Tiny-GPT/
โโโ main.py # Training script
โโโ run.py # Inference script (NEW)
โโโ prepare_data.py # Dataset preparation
โโโ mini_gpt.py # Deprecated v1 (reference only)
โโโ reset_training.sh # Clean old checkpoints
โโโ wait_for_dataset.sh # Monitor data preparation
โ
โโโ data/
โ โโโ train.bin # ~1.8M examples โ ~80M tokens
โ โโโ val.bin # ~3.7k examples โ ~1.7M tokens
โ โโโ test.bin # ~4.3k examples โ ~2.0M tokens
โ โโโ meta.txt # Metadata
โ
โโโ checkpoints/
โโโ latest.pt # Most recent checkpoint
โโโ best.pt # Best validation loss checkpoint
๐ง Configuration
All hyperparameters are defined in main.py:
BLOCK_SIZE = 128 # Context window
EMBED_DIM = 768 # Model width
NUM_LAYERS = 12 # Transformer blocks
NUM_EXPERTS = 8 # Experts per MoE layer
TOP_K = 2 # Experts used per token
LR = 1.5e-4 # Learning rate (adjusted)
WARMUP_STEPS = 500 # Warmup schedule
MAX_ITERS = 10000 # Total training steps
GRAD_CLIP = 1.0 # Gradient clipping
๐ Expected Training Progress
With fixed hyperparameters (new):
- Step 1: Loss ~8.0
- Step 500: Loss ~6.5-7.0
- Step 2500: Loss ~4.5-5.0
- Step 5000: Loss ~3.8-4.2
- Step 10000: Loss ~3.5-3.8
Quality indicator: Model starts generating coherent English by step 2000+
๐ก What Changed?
Before (Broken)
Learning Rate: 3e-4 (too high)
Warmup: 200 steps (insufficient)
Auto-resume: Enabled (got stuck in NaN)
Trainer Loss: DIVERGES TO NAN
Output: "hi defencesaternal Thirty shows allowanceBad Leh..." โ
After (Fixed)
Learning Rate: 1.5e-4 (stable)
Warmup: 500 steps (better convergence)
Auto-resume: Disabled (start fresh)
Training Loss: SMOOTH CONVERGENCE
Output: "The history of the universe began with the Big Bang..." โ
๐ง Model Architecture
Input Tokens
โ
Embedding + Positional Encoding (768-dim)
โ
[x12 Transformer Blocks]
โโ Multi-Head Attention (12 heads)
โ โโ Output: 768-dim
โโ Mixture-of-Experts Layer
โโ 8 Expert FFNs (768โ3072โ768)
โโ Router: Selects top-2 experts per token
โโ Load-balancing auxiliary loss
โ
Layer Norm
โ
Output Linear โ Logits (50,257)
โ
Cross-Entropy Loss
Memory Trick: The CPUOffloadAdamW optimizer keeps fp32 master weights + momentum/variance on CPU RAM to save GPU VRAM:
- GPU: fp16 model weights + fp16 gradients (~1 GB)
- CPU: fp32 master weights + fp32 m/v (~4 GB)
๐ฎ Using run.py
Interactive Mode (Default)
python run.py
Type prompts and press Enter. Commands:
/temp 0.8- Set temperature (higher = more random)/len 150- Set max tokens/topk 40- Enable top-k sampling/topp 0.9- Set nucleus sampling thresholdquit- Exit
Single Prompt
python run.py --prompt "The future of AI is"
Batch from File
python run.py --prompts prompts.txt # One prompt per line
Custom Checkpoint
python run.py --checkpoint checkpoints/best.pt
Full Options
python run.py --help
๐ Monitoring Training
The training loop shows:
Step 5000 โ Train 4.23 โ Val 4.45 โ LR 0.000097
Healthy indicators:
- โ Train loss smoothly decreases
- โ Val loss follows trend
- โ No NaN values
- โ Learning rate schedule works
- โ No gradient clipping (or occasional, < 10% of steps)
Red flags:
- โ Loss jumps/oscillates wildly
- โ NaN values appear
- โ Val loss stops improving (need more data or different HP)
- โ Constant gradient clipping (reduce LR)
๐ Checkpointing
Saved automatically every 500 steps:
latest.pt: Most recent checkpoint (always usable)best.pt: Best validation loss (for inference)
Load in Python:
checkpoint = torch.load("checkpoints/best.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
step = checkpoint["step"]
๐ Troubleshooting
Dataset not preparing
# Monitor progress
./wait_for_dataset.sh
# Check manually
ls -lh data/
Training produces NaN
โ Fixed: Lowered learning rate to 1.5e-4 and increased warmup
Model outputs gibberish
โ Fixed: Trained on larger dataset (WikiText-103 vs WikiText-2)
Out of memory
- Reduce
MICRO_BATCHto 1 (slower but less VRAM) - Reduce
BLOCK_SIZEto 64 - Remove gradient checkpointing
GPU not detected
# Check in Python
import torch
print(torch.cuda.is_available()) # Should be True
print(torch.cuda.get_device_name(0)) # GPU name
๐ References
- Mixture of Experts: Switch Transformers
- GPT Architecture: Language Models are Unsupervised Multitask Learners
- Memory Optimization: Reducing Activation Recomputation in Large Transformer Models
- Tokenization: tiktoken
๐ License
MIT License - See LICENSE file
Status: โ Ready for training!
Next steps:
- โณ Wait for dataset preparation (
prepare_data.py) - โถ๏ธ Run training (
python main.py) - ๐ Generate text (
python run.py)