YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Tiny-GPT: 0.5B MoE Language Model

A clean, efficient implementation of a Mixture-of-Experts GPT that fits on modest GPUs (4GB VRAM) while training on large datasets.

๐ŸŽฏ Main Goal

Generate proper English text - not gibberish!

๐Ÿ“Š Quick Stats

Metric Value
Model Size 0.5B parameters (520M)
Active per Token 180M parameters (via MoE routing)
Architecture 12 Transformer layers, 8 experts/layer, top-2 routing
Training Data WikiText-103 (103M tokens, ~500MB)
GPU Memory 0.97 GiB (model weights only)
Training Time ~10-20 hours on RTX 2050 (10k steps)
Tokenizer GPT-2 BPE (50,257 vocab via tiktoken)

๐Ÿš€ Quick Start

1. Prepare Dataset

python prepare_data.py

Downloads WikiText-103 and tokenizes to memory-mapped binary files (~500MB). This is a one-time operation that takes 10-30 minutes.

2. Train

python main.py

Starts training from scratch with:

  • Learning rate: 1.5e-4 (lowered for stability)
  • Warmup: 500 steps (better convergence)
  • Total steps: 10,000 (more thorough training)
  • Batch size: 16 (gradient accumulation of 2x8)

Training progress shows in real-time via rich progress bar.

3. Generate Text

python run.py

๐Ÿค— Use Hugging Face Hub (instead of local/GitHub checkpoints)

1. Upload checkpoints to HF Hub

pip install huggingface_hub
export HF_TOKEN=your_hf_token
python push_to_hf.py --repo-id yourname/Tiny-GPT

This uploads:

  • checkpoints/best.pt โ†’ best.pt
  • checkpoints/latest.pt โ†’ latest.pt (if present)

2. Run inference directly from HF Hub

python run.py --hf-repo yourname/Tiny-GPT --prompt "The future of AI is"

Optional flags:

  • --hf-filename best.pt
  • --hf-revision main
  • --hf-token <token> (or use HF_TOKEN env var)

๐Ÿ“ File Structure

Tiny-GPT/
โ”œโ”€โ”€ main.py                  # Training script
โ”œโ”€โ”€ run.py                   # Inference script (NEW)
โ”œโ”€โ”€ prepare_data.py          # Dataset preparation
โ”œโ”€โ”€ mini_gpt.py              # Deprecated v1 (reference only)
โ”œโ”€โ”€ reset_training.sh        # Clean old checkpoints
โ”œโ”€โ”€ wait_for_dataset.sh      # Monitor data preparation
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ train.bin            # ~1.8M examples โ†’ ~80M tokens
โ”‚   โ”œโ”€โ”€ val.bin              # ~3.7k examples โ†’ ~1.7M tokens
โ”‚   โ”œโ”€โ”€ test.bin             # ~4.3k examples โ†’ ~2.0M tokens
โ”‚   โ””โ”€โ”€ meta.txt             # Metadata
โ”‚
โ””โ”€โ”€ checkpoints/
    โ”œโ”€โ”€ latest.pt            # Most recent checkpoint
    โ””โ”€โ”€ best.pt              # Best validation loss checkpoint

๐Ÿ”ง Configuration

All hyperparameters are defined in main.py:

BLOCK_SIZE    = 128              # Context window
EMBED_DIM     = 768              # Model width
NUM_LAYERS    = 12               # Transformer blocks
NUM_EXPERTS   = 8                # Experts per MoE layer
TOP_K         = 2                # Experts used per token
LR            = 1.5e-4           # Learning rate (adjusted)
WARMUP_STEPS  = 500              # Warmup schedule
MAX_ITERS     = 10000            # Total training steps
GRAD_CLIP     = 1.0              # Gradient clipping

๐Ÿ“ˆ Expected Training Progress

With fixed hyperparameters (new):

  • Step 1: Loss ~8.0
  • Step 500: Loss ~6.5-7.0
  • Step 2500: Loss ~4.5-5.0
  • Step 5000: Loss ~3.8-4.2
  • Step 10000: Loss ~3.5-3.8

Quality indicator: Model starts generating coherent English by step 2000+

๐Ÿ’ก What Changed?

Before (Broken)

Learning Rate: 3e-4 (too high)
Warmup: 200 steps (insufficient)
Auto-resume: Enabled (got stuck in NaN)
Trainer Loss: DIVERGES TO NAN
Output: "hi defencesaternal Thirty shows allowanceBad Leh..."  โŒ

After (Fixed)

Learning Rate: 1.5e-4 (stable)
Warmup: 500 steps (better convergence)
Auto-resume: Disabled (start fresh)
Training Loss: SMOOTH CONVERGENCE
Output: "The history of the universe began with the Big Bang..."  โœ“

๐Ÿง  Model Architecture

Input Tokens
    โ†“
Embedding + Positional Encoding (768-dim)
    โ†“
[x12 Transformer Blocks]
  โ”œโ”€ Multi-Head Attention (12 heads)
  โ”‚  โ””โ”€ Output: 768-dim
  โ””โ”€ Mixture-of-Experts Layer
     โ”œโ”€ 8 Expert FFNs (768โ†’3072โ†’768)
     โ”œโ”€ Router: Selects top-2 experts per token
     โ””โ”€ Load-balancing auxiliary loss
    โ†“
Layer Norm
    โ†“
Output Linear โ†’ Logits (50,257)
    โ†“
Cross-Entropy Loss

Memory Trick: The CPUOffloadAdamW optimizer keeps fp32 master weights + momentum/variance on CPU RAM to save GPU VRAM:

  • GPU: fp16 model weights + fp16 gradients (~1 GB)
  • CPU: fp32 master weights + fp32 m/v (~4 GB)

๐ŸŽฎ Using run.py

Interactive Mode (Default)

python run.py

Type prompts and press Enter. Commands:

  • /temp 0.8 - Set temperature (higher = more random)
  • /len 150 - Set max tokens
  • /topk 40 - Enable top-k sampling
  • /topp 0.9 - Set nucleus sampling threshold
  • quit - Exit

Single Prompt

python run.py --prompt "The future of AI is"

Batch from File

python run.py --prompts prompts.txt  # One prompt per line

Custom Checkpoint

python run.py --checkpoint checkpoints/best.pt

Full Options

python run.py --help

๐Ÿ” Monitoring Training

The training loop shows:

Step  5000  โ”‚  Train 4.23  โ”‚  Val 4.45  โ”‚  LR 0.000097

Healthy indicators:

  • โœ“ Train loss smoothly decreases
  • โœ“ Val loss follows trend
  • โœ“ No NaN values
  • โœ“ Learning rate schedule works
  • โœ“ No gradient clipping (or occasional, < 10% of steps)

Red flags:

  • โŒ Loss jumps/oscillates wildly
  • โŒ NaN values appear
  • โŒ Val loss stops improving (need more data or different HP)
  • โŒ Constant gradient clipping (reduce LR)

๐Ÿ“Š Checkpointing

Saved automatically every 500 steps:

  • latest.pt: Most recent checkpoint (always usable)
  • best.pt: Best validation loss (for inference)

Load in Python:

checkpoint = torch.load("checkpoints/best.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])
optimizer.load_state_dict(checkpoint["optimizer"])
step = checkpoint["step"]

๐Ÿ›‘ Troubleshooting

Dataset not preparing

# Monitor progress
./wait_for_dataset.sh

# Check manually
ls -lh data/

Training produces NaN

โœ“ Fixed: Lowered learning rate to 1.5e-4 and increased warmup

Model outputs gibberish

โœ“ Fixed: Trained on larger dataset (WikiText-103 vs WikiText-2)

Out of memory

  • Reduce MICRO_BATCH to 1 (slower but less VRAM)
  • Reduce BLOCK_SIZE to 64
  • Remove gradient checkpointing

GPU not detected

# Check in Python
import torch
print(torch.cuda.is_available())  # Should be True
print(torch.cuda.get_device_name(0))  # GPU name

๐Ÿ“š References

๐Ÿ“ License

MIT License - See LICENSE file


Status: โœ… Ready for training!

Next steps:

  1. โณ Wait for dataset preparation (prepare_data.py)
  2. โ–ถ๏ธ Run training (python main.py)
  3. ๐ŸŽ‰ Generate text (python run.py)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Papers for pragadeeshv23/Tiny-GPT