tinyGPT / pretraining /README.md
hemantvirmani's picture
pretrained and finetuned tinyGPT dataset
65b2306 verified
metadata
license: mit

TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu

A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens of the FineWeb-Edu dataset, achieving a validation loss of 2.84.

Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT


Model Details

Parameter Value
Architecture Decoder-only Transformer (GPT-2 style)
Parameters ~163M
Layers 12
Attention heads 12
Embedding dim 768
Context length 1024 tokens
Vocab size 50,257
Tokenizer GPT-2 BPE via tiktoken
Attention Causal self-attention (Flash Attention via F.scaled_dot_product_attention)
LM head Separate linear layer (not weight-tied)

Why ~163M and not 124M? Standard GPT-2 124M ties the LM head weights with the token embedding table, saving ~38M parameters. TinyGPT uses a separate nn.Linear head, resulting in ~163M total parameters.


Training Details

Detail Value
Dataset FineWeb-Edu (sample-100BT subset)
Tokens trained ~43B
Validation loss 2.84
Optimizer AdamW (betas=(0.9, 0.95), eps=1e-8)
Learning rate 6e-4
LR schedule Linear warmup (4000 steps) -> Cosine decay to 6e-5
Effective batch size 512 (16 x 32 gradient accumulation steps)
Weight decay 0.1
Gradient clipping 1.0
Precision bfloat16 (bf16)
Max iterations 600,000
Dropout 0.0

Format

Weights are saved in PyTorch native format — a plain state dict saved with torch.save(), containing only model weights (no optimizer state, no scheduler). The file is ~670MB.

To load, you need the TinyGPT model class (included below).

The model is also available in Hugging Face Transformers format in this repository. The HF-format files include:

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer.json
  • tokenizer_config.json

The HF-format model can be loaded with transformers and is useful for standard Hugging Face workflows. Note that TinyGPT was trained with a separate, non-weight-tied LM head that includes a trained bias. Standard GPT2LMHeadModel.from_pretrained() loads the main model weights but treats lm_head.bias as an unexpected key because the default GPT-2 head is biasless. For exact TinyGPT inference, restore the LM-head bias as shown below or use infer_hf.py from the GitHub repo.


Usage

1. Install dependencies

Clone the repo and install requirements:

git clone https://github.com/hemantvirmani/tinygpt
cd tinygpt
pip install -r requirements.txt

2. Get the model class

The TinyGPT model class is available at: https://github.com/hemantvirmani/tinygpt

Clone or download tinygpt.py and place it in your working directory.

3. Load weights and run inference

import tinygpt

model = tinygpt.load_model_for_inference()

prompts = [
    "Hello, I'm a language model,",
    "The human brain contains approximately",
    "Photosynthesis is the process by which plants",
    "The theory of relativity states that ",
    "The Roman Empire fell due to several factors including",
    "During the Industrial Revolution, workers ",
    "To solve a quadratic equation, you must first",
    "The key differences between mitosis and meiosis are ",
    "Once upon a time in ancient India, there lived a king who ",
]

for prompt in prompts:
    print(f"\n{'='*60}")
    print(f"PROMPT: {prompt}")
    print(f"{'='*60}")
    print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7))

4. Load the Hugging Face format model

pip install torch transformers safetensors huggingface_hub
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_id = "hemantvirmani/tinyGPT"

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)

# Restore TinyGPT's trained LM-head bias for exact inference.
weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
state_dict = load_file(weights_path, device="cpu")
if "lm_head.bias" in state_dict:
    lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
    lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
    lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"])
    model.lm_head = lm_head

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Photosynthesis is the process by which plants"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.7,
        top_k=0,
        top_p=1.0,
        repetition_penalty=1.3,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

You can also run the helper script from the GitHub repo:

python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants"

Sample Outputs (temperature=0.7, 500 tokens)

Prompt: Photosynthesis is the process by which plants

Photosynthesis is the process by which plants take in sunlight, water, carbon dioxide and nutrients to produce energy for their cells. Humans depend on photosynthesis to provide their own energy, but many plants also use the energy of other organisms to produce food. The five types of...

Prompt: The Roman Empire fell due to several factors including

The Roman Empire fell due to several factors including the decline of the Roman army, the rise of the Papacy, and the threat of the Islamic invasion. The fall of the Roman Empire was the result of a series of civil wars in the late fourth century, and was led by the first emperor of the Roman Empire, Constantine the Great.


Limitations

  • This is a base language model — it completes text, it does not follow instructions or answer questions.
  • Prone to repetition loops, especially at low temperature.
  • Fine-tuning required for instruction-following or domain-specific tasks.

Thanks to

  • Andrej Karpathy's nanoGPT - Video and Code
  • Dataset: HuggingFace FineWeb-Edu