GPT-300M
A 334,808,064 parameter autoregressive transformer language model built entirely from scratch in PyTorch. No pretrained weights. No fine-tuning. Everything from zero.
Architecture
Input Token IDs
β
Token Embedding (32,000 Γ 1,024) β 32.8M params
β
Rotary Position Embeddings (RoPE) β 0 learned params
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Transformer Block Γ 24 layers (12.6M each) β
β β
β RMSNorm β Multi-Head Attention β β Residual β
β 16 heads Γ 64d β
β 4,194,304 params β
β β
β RMSNorm β FFN (GELU) β β Residual β
β 1,024 β 4,096 β 1,024 β
β 8,388,608 params β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
Final RMSNorm
β
LM Head (weight-tied with embedding) β 0 extra params
β
Softmax β Next Token Probabilities
Parameter Breakdown
| Component |
Parameters |
Percentage |
| Token Embedding |
32,768,000 |
9.8% |
| Attention Layers (Γ24) |
100,663,296 |
30.1% |
| Feed-Forward Layers (Γ24) |
201,326,592 |
60.1% |
| RMSNorm (Γ24 + final) |
50,176 |
0.0% |
| LM Head |
0 (tied) |
β |
| TOTAL |
334,808,064 |
100% |
Model Details
| Hyperparameter |
Value |
| Hidden dimension (d_model) |
1,024 |
| Attention heads |
16 |
| Head dimension |
64 |
| Transformer layers |
24 |
| FFN dimension (d_ff) |
4,096 |
| Vocabulary size |
32,000 |
| Max sequence length |
2,048 |
| Position encoding |
RoPE (ΞΈ=10,000) |
| Activation |
GELU |
| Normalization |
RMSNorm (Ξ΅=1e-5) |
| Weight tying |
Yes (Embed β LM Head) |
| Bias |
None |
Training Configuration
| Setting |
Value |
| Optimizer |
AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| Peak learning rate |
3e-4 |
| Min learning rate |
3e-5 |
| Schedule |
Cosine decay + linear warmup |
| Warmup steps |
2,000 |
| Weight decay |
0.1 |
| Batch size |
32 Γ 8 gradient accumulation |
| Max training steps |
600,000 |
| Precision |
bfloat16 |
| Gradient clipping |
1.0 |
Usage
Loading the Model
from model import GPT300M
from config import GPT300MConfig
from tokenizer import BPETokenizer
import torch
config = GPT300MConfig()
model = GPT300M(config)
checkpoint = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(checkpoint)
model.eval()
tokenizer = BPETokenizer.load("tokenizer.json")
Chat with the Model
from chat import ChatBot
chatbot = ChatBot(model, tokenizer, config)
response = chatbot.chat("Hello! What is machine learning?")
print(response)
Interactive Chat
python chat.py --checkpoint pytorch_model.bin
Training from Scratch
python train.py --tiny
python train.py --data your_training_data.txt
torchrun --nproc_per_node=4 train.py --data your_data.txt
Files
| File |
Description |
config.json |
Model configuration (HuggingFace format) |
config.py |
Python config class with all hyperparameters |
model.py |
Full transformer architecture (RoPE, MHA, FFN, KV-cache) |
tokenizer.py |
BPE tokenizer built from scratch |
tokenizer_config.json |
Tokenizer settings |
special_tokens_map.json |
Special token definitions |
dataset.py |
Dataset classes and data loading |
train.py |
Training loop (DDP, mixed precision, scheduling) |
chat.py |
Interactive chatbot with streaming generation |
visual_nn_3d.py |
3D matplotlib architecture visualization |
requirements.txt |
Python dependencies |
pytorch_model.bin |
Trained weights (upload after training) |
tokenizer.json |
Trained tokenizer (upload after training) |
Hardware Requirements
| Config |
GPU Memory |
Est. Training Time |
| Tiny (debug) |
~1 GB |
Minutes |
| Full 300M |
~24 GB |
~3-5 days (4ΓA100) |
Key Features
- 100% from scratch β no pretrained weights, no HuggingFace Transformers dependency
- Rotary Position Embeddings β better length generalization than learned positions
- RMSNorm β faster than LayerNorm, equally effective
- Flash Attention β via PyTorch 2.0 SDPA
- KV-Cache β efficient autoregressive generation
- Weight tying β saves ~33M parameters
- Chat template β built-in support for multi-turn conversations
- torch.compile β ready for PyTorch 2.0+ compilation
Citation
@misc{gpt300m,
title={GPT-300M: A 300-Million Parameter Language Model From Scratch},
year={2025},
url={https://huggingface.co/YOUR_USERNAME/gpt-300m}
}
License
MIT