pretrained and finetuned tinyGPT dataset

65b2306 verified 13 days ago

6.64 kB

license: mit

TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu

A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens of the FineWeb-Edu dataset, achieving a validation loss of 2.84.

Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT

Model Details

Parameter	Value
Architecture	Decoder-only Transformer (GPT-2 style)
Parameters	~163M
Layers	12
Attention heads	12
Embedding dim	768
Context length	1024 tokens
Vocab size	50,257
Tokenizer	GPT-2 BPE via `tiktoken`
Attention	Causal self-attention (Flash Attention via `F.scaled_dot_product_attention`)
LM head	Separate linear layer (not weight-tied)

Why ~163M and not 124M? Standard GPT-2 124M ties the LM head weights with the token embedding table, saving ~38M parameters. TinyGPT uses a separate nn.Linear head, resulting in ~163M total parameters.

Training Details

Detail	Value
Dataset	FineWeb-Edu (`sample-100BT` subset)
Tokens trained	~43B
Validation loss	2.84
Optimizer	AdamW (betas=(0.9, 0.95), eps=1e-8)
Learning rate	6e-4
LR schedule	Linear warmup (4000 steps) -> Cosine decay to 6e-5
Effective batch size	512 (16 x 32 gradient accumulation steps)
Weight decay	0.1
Gradient clipping	1.0
Precision	bfloat16 (bf16)
Max iterations	600,000
Dropout	0.0

Format

Weights are saved in PyTorch native format — a plain state dict saved with torch.save(), containing only model weights (no optimizer state, no scheduler). The file is ~670MB.

To load, you need the TinyGPT model class (included below).

The model is also available in Hugging Face Transformers format in this repository. The HF-format files include:

model.safetensors
config.json
generation_config.json
tokenizer.json
tokenizer_config.json

The HF-format model can be loaded with transformers and is useful for standard Hugging Face workflows. Note that TinyGPT was trained with a separate, non-weight-tied LM head that includes a trained bias. Standard GPT2LMHeadModel.from_pretrained() loads the main model weights but treats lm_head.bias as an unexpected key because the default GPT-2 head is biasless. For exact TinyGPT inference, restore the LM-head bias as shown below or use infer_hf.py from the GitHub repo.

Usage

1. Install dependencies

Clone the repo and install requirements:

git clone https://github.com/hemantvirmani/tinygpt
cd tinygpt
pip install -r requirements.txt

2. Get the model class

The TinyGPT model class is available at: https://github.com/hemantvirmani/tinygpt

Clone or download tinygpt.py and place it in your working directory.

3. Load weights and run inference

import tinygpt

model = tinygpt.load_model_for_inference()

prompts = [
    "Hello, I'm a language model,",
    "The human brain contains approximately",
    "Photosynthesis is the process by which plants",
    "The theory of relativity states that ",
    "The Roman Empire fell due to several factors including",
    "During the Industrial Revolution, workers ",
    "To solve a quadratic equation, you must first",
    "The key differences between mitosis and meiosis are ",
    "Once upon a time in ancient India, there lived a king who ",
]

for prompt in prompts:
    print(f"\n{'='*60}")
    print(f"PROMPT: {prompt}")
    print(f"{'='*60}")
    print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7))

4. Load the Hugging Face format model

pip install torch transformers safetensors huggingface_hub

import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_id = "hemantvirmani/tinyGPT"

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)

# Restore TinyGPT's trained LM-head bias for exact inference.
weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
state_dict = load_file(weights_path, device="cpu")
if "lm_head.bias" in state_dict:
    lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
    lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
    lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"])
    model.lm_head = lm_head

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

prompt = "Photosynthesis is the process by which plants"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=500,
        do_sample=True,
        temperature=0.7,
        top_k=0,
        top_p=1.0,
        repetition_penalty=1.3,
        pad_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

You can also run the helper script from the GitHub repo:

python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants"

Sample Outputs (temperature=0.7, 500 tokens)

Prompt: Photosynthesis is the process by which plants

Photosynthesis is the process by which plants take in sunlight, water, carbon dioxide and nutrients to produce energy for their cells. Humans depend on photosynthesis to provide their own energy, but many plants also use the energy of other organisms to produce food. The five types of...

Prompt: The Roman Empire fell due to several factors including

The Roman Empire fell due to several factors including the decline of the Roman army, the rise of the Papacy, and the threat of the Islamic invasion. The fall of the Roman Empire was the result of a series of civil wars in the late fourth century, and was led by the first emperor of the Roman Empire, Constantine the Great.

Limitations

This is a base language model — it completes text, it does not follow instructions or answer questions.
Prone to repetition loops, especially at low temperature.
Fine-tuning required for instruction-following or domain-specific tasks.

Thanks to

Andrej Karpathy's nanoGPT - Video and Code
Dataset: HuggingFace FineWeb-Edu