license: mit
TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu
A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens of the FineWeb-Edu dataset, achieving a validation loss of 2.84.
Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT
Model Details
| Parameter | Value |
|---|---|
| Architecture | Decoder-only Transformer (GPT-2 style) |
| Parameters | ~163M |
| Layers | 12 |
| Attention heads | 12 |
| Embedding dim | 768 |
| Context length | 1024 tokens |
| Vocab size | 50,257 |
| Tokenizer | GPT-2 BPE via tiktoken |
| Attention | Causal self-attention (Flash Attention via F.scaled_dot_product_attention) |
| LM head | Separate linear layer (not weight-tied) |
Why ~163M and not 124M? Standard GPT-2 124M ties the LM head weights with the token embedding table, saving ~38M parameters. TinyGPT uses a separate
nn.Linearhead, resulting in ~163M total parameters.
Training Details
| Detail | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-100BT subset) |
| Tokens trained | ~43B |
| Validation loss | 2.84 |
| Optimizer | AdamW (betas=(0.9, 0.95), eps=1e-8) |
| Learning rate | 6e-4 |
| LR schedule | Linear warmup (4000 steps) -> Cosine decay to 6e-5 |
| Effective batch size | 512 (16 x 32 gradient accumulation steps) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 (bf16) |
| Max iterations | 600,000 |
| Dropout | 0.0 |
Format
Weights are saved in PyTorch native format — a plain state dict saved with
torch.save(), containing only model weights (no optimizer state, no
scheduler). The file is ~670MB.
To load, you need the TinyGPT model class (included below).
The model is also available in Hugging Face Transformers format in this repository. The HF-format files include:
model.safetensorsconfig.jsongeneration_config.jsontokenizer.jsontokenizer_config.json
The HF-format model can be loaded with transformers and is useful for standard
Hugging Face workflows. Note that TinyGPT was trained with a separate,
non-weight-tied LM head that includes a trained bias. Standard
GPT2LMHeadModel.from_pretrained() loads the main model weights but treats
lm_head.bias as an unexpected key because the default GPT-2 head is biasless.
For exact TinyGPT inference, restore the LM-head bias as shown below or use
infer_hf.py from the GitHub repo.
Usage
1. Install dependencies
Clone the repo and install requirements:
git clone https://github.com/hemantvirmani/tinygpt
cd tinygpt
pip install -r requirements.txt
2. Get the model class
The TinyGPT model class is available at:
https://github.com/hemantvirmani/tinygpt
Clone or download tinygpt.py and place it in your working directory.
3. Load weights and run inference
import tinygpt
model = tinygpt.load_model_for_inference()
prompts = [
"Hello, I'm a language model,",
"The human brain contains approximately",
"Photosynthesis is the process by which plants",
"The theory of relativity states that ",
"The Roman Empire fell due to several factors including",
"During the Industrial Revolution, workers ",
"To solve a quadratic equation, you must first",
"The key differences between mitosis and meiosis are ",
"Once upon a time in ancient India, there lived a king who ",
]
for prompt in prompts:
print(f"\n{'='*60}")
print(f"PROMPT: {prompt}")
print(f"{'='*60}")
print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7))
4. Load the Hugging Face format model
pip install torch transformers safetensors huggingface_hub
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_id = "hemantvirmani/tinyGPT"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
# Restore TinyGPT's trained LM-head bias for exact inference.
weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
state_dict = load_file(weights_path, device="cpu")
if "lm_head.bias" in state_dict:
lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"])
model.lm_head = lm_head
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
prompt = "Photosynthesis is the process by which plants"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=500,
do_sample=True,
temperature=0.7,
top_k=0,
top_p=1.0,
repetition_penalty=1.3,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
You can also run the helper script from the GitHub repo:
python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants"
Sample Outputs (temperature=0.7, 500 tokens)
Prompt: Photosynthesis is the process by which plants
Photosynthesis is the process by which plants take in sunlight, water, carbon dioxide and nutrients to produce energy for their cells. Humans depend on photosynthesis to provide their own energy, but many plants also use the energy of other organisms to produce food. The five types of...
Prompt: The Roman Empire fell due to several factors including
The Roman Empire fell due to several factors including the decline of the Roman army, the rise of the Papacy, and the threat of the Islamic invasion. The fall of the Roman Empire was the result of a series of civil wars in the late fourth century, and was led by the first emperor of the Roman Empire, Constantine the Great.
Limitations
- This is a base language model — it completes text, it does not follow instructions or answer questions.
- Prone to repetition loops, especially at low temperature.
- Fine-tuning required for instruction-following or domain-specific tasks.
Thanks to
- Andrej Karpathy's nanoGPT - Video and Code
- Dataset: HuggingFace FineWeb-Edu