GPT-300M

A 334,808,064 parameter autoregressive transformer language model built entirely from scratch in PyTorch. No pretrained weights. No fine-tuning. Everything from zero.

Architecture

Input Token IDs
  ↓
Token Embedding (32,000 Γ— 1,024)     β€” 32.8M params
  ↓
Rotary Position Embeddings (RoPE)     β€” 0 learned params
  ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Transformer Block  Γ— 24 layers  (12.6M each)      β”‚
β”‚                                                     β”‚
β”‚  RMSNorm β†’ Multi-Head Attention β†’ βŠ• Residual       β”‚
β”‚             16 heads Γ— 64d                          β”‚
β”‚             4,194,304 params                        β”‚
β”‚                                                     β”‚
β”‚  RMSNorm β†’ FFN (GELU) β†’ βŠ• Residual                 β”‚
β”‚             1,024 β†’ 4,096 β†’ 1,024                   β”‚
β”‚             8,388,608 params                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  ↓
Final RMSNorm
  ↓
LM Head (weight-tied with embedding)  β€” 0 extra params
  ↓
Softmax β†’ Next Token Probabilities

Parameter Breakdown

Component Parameters Percentage
Token Embedding 32,768,000 9.8%
Attention Layers (Γ—24) 100,663,296 30.1%
Feed-Forward Layers (Γ—24) 201,326,592 60.1%
RMSNorm (Γ—24 + final) 50,176 0.0%
LM Head 0 (tied) β€”
TOTAL 334,808,064 100%

Model Details

Hyperparameter Value
Hidden dimension (d_model) 1,024
Attention heads 16
Head dimension 64
Transformer layers 24
FFN dimension (d_ff) 4,096
Vocabulary size 32,000
Max sequence length 2,048
Position encoding RoPE (ΞΈ=10,000)
Activation GELU
Normalization RMSNorm (Ξ΅=1e-5)
Weight tying Yes (Embed ↔ LM Head)
Bias None

Training Configuration

Setting Value
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95)
Peak learning rate 3e-4
Min learning rate 3e-5
Schedule Cosine decay + linear warmup
Warmup steps 2,000
Weight decay 0.1
Batch size 32 Γ— 8 gradient accumulation
Max training steps 600,000
Precision bfloat16
Gradient clipping 1.0

Usage

Loading the Model

from model import GPT300M
from config import GPT300MConfig
from tokenizer import BPETokenizer
import torch

# Load config, model, and tokenizer
config = GPT300MConfig()
model = GPT300M(config)

# Load trained weights
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()

# Load tokenizer
tokenizer = BPETokenizer.load("tokenizer.json")

Chat with the Model

from chat import ChatBot

chatbot = ChatBot(model, tokenizer, config)
response = chatbot.chat("Hello! What is machine learning?")
print(response)

Interactive Chat

python chat.py --checkpoint pytorch_model.bin

Training from Scratch

# Quick test (tiny model)
python train.py --tiny

# Full 300M model
python train.py --data your_training_data.txt

# Multi-GPU
torchrun --nproc_per_node=4 train.py --data your_data.txt

Files

File Description
config.json Model configuration (HuggingFace format)
config.py Python config class with all hyperparameters
model.py Full transformer architecture (RoPE, MHA, FFN, KV-cache)
tokenizer.py BPE tokenizer built from scratch
tokenizer_config.json Tokenizer settings
special_tokens_map.json Special token definitions
dataset.py Dataset classes and data loading
train.py Training loop (DDP, mixed precision, scheduling)
chat.py Interactive chatbot with streaming generation
visual_nn_3d.py 3D matplotlib architecture visualization
requirements.txt Python dependencies
pytorch_model.bin Trained weights (upload after training)
tokenizer.json Trained tokenizer (upload after training)

Hardware Requirements

Config GPU Memory Est. Training Time
Tiny (debug) ~1 GB Minutes
Full 300M ~24 GB ~3-5 days (4Γ—A100)

Key Features

  • 100% from scratch β€” no pretrained weights, no HuggingFace Transformers dependency
  • Rotary Position Embeddings β€” better length generalization than learned positions
  • RMSNorm β€” faster than LayerNorm, equally effective
  • Flash Attention β€” via PyTorch 2.0 SDPA
  • KV-Cache β€” efficient autoregressive generation
  • Weight tying β€” saves ~33M parameters
  • Chat template β€” built-in support for multi-turn conversations
  • torch.compile β€” ready for PyTorch 2.0+ compilation

Citation

@misc{gpt300m,
  title={GPT-300M: A 300-Million Parameter Language Model From Scratch},
  year={2025},
  url={https://huggingface.co/YOUR_USERNAME/gpt-300m}
}

License

MIT

Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support