tinyGPT / pretraining /README.md
hemantvirmani's picture
pretrained and finetuned tinyGPT dataset
65b2306 verified
---
license: mit
---
# TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu
A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens
of the FineWeb-Edu dataset, achieving a validation loss of **2.84**.
Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT
---
## Model Details
| Parameter | Value |
|-----------|-------|
| Architecture | Decoder-only Transformer (GPT-2 style) |
| Parameters | ~163M |
| Layers | 12 |
| Attention heads | 12 |
| Embedding dim | 768 |
| Context length | 1024 tokens |
| Vocab size | 50,257 |
| Tokenizer | GPT-2 BPE via `tiktoken` |
| Attention | Causal self-attention (Flash Attention via `F.scaled_dot_product_attention`) |
| LM head | Separate linear layer (not weight-tied) |
> **Why ~163M and not 124M?** Standard GPT-2 124M ties the LM head weights
> with the token embedding table, saving ~38M parameters. TinyGPT uses a
> separate `nn.Linear` head, resulting in ~163M total parameters.
---
## Training Details
| Detail | Value |
|--------|-------|
| Dataset | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (`sample-100BT` subset) |
| Tokens trained | ~43B |
| Validation loss | 2.84 |
| Optimizer | AdamW (betas=(0.9, 0.95), eps=1e-8) |
| Learning rate | 6e-4 |
| LR schedule | Linear warmup (4000 steps) -> Cosine decay to 6e-5 |
| Effective batch size | 512 (16 x 32 gradient accumulation steps) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 (bf16) |
| Max iterations | 600,000 |
| Dropout | 0.0 |
---
## Format
Weights are saved in **PyTorch native format** — a plain state dict saved with
`torch.save()`, containing only model weights (no optimizer state, no
scheduler). The file is ~670MB.
To load, you need the `TinyGPT` model class (included below).
The model is also available in **Hugging Face Transformers format** in this
repository. The HF-format files include:
- `model.safetensors`
- `config.json`
- `generation_config.json`
- `tokenizer.json`
- `tokenizer_config.json`
The HF-format model can be loaded with `transformers` and is useful for standard
Hugging Face workflows. Note that TinyGPT was trained with a separate,
non-weight-tied LM head that includes a trained bias. Standard
`GPT2LMHeadModel.from_pretrained()` loads the main model weights but treats
`lm_head.bias` as an unexpected key because the default GPT-2 head is biasless.
For exact TinyGPT inference, restore the LM-head bias as shown below or use
`infer_hf.py` from the GitHub repo.
---
## Usage
### 1. Install dependencies
Clone the repo and install requirements:
```bash
git clone https://github.com/hemantvirmani/tinygpt
cd tinygpt
pip install -r requirements.txt
```
### 2. Get the model class
The `TinyGPT` model class is available at:
**[https://github.com/hemantvirmani/tinygpt](https://github.com/hemantvirmani/tinygpt)**
Clone or download `tinygpt.py` and place it in your working directory.
### 3. Load weights and run inference
```python
import tinygpt
model = tinygpt.load_model_for_inference()
prompts = [
"Hello, I'm a language model,",
"The human brain contains approximately",
"Photosynthesis is the process by which plants",
"The theory of relativity states that ",
"The Roman Empire fell due to several factors including",
"During the Industrial Revolution, workers ",
"To solve a quadratic equation, you must first",
"The key differences between mitosis and meiosis are ",
"Once upon a time in ancient India, there lived a king who ",
]
for prompt in prompts:
print(f"\n{'='*60}")
print(f"PROMPT: {prompt}")
print(f"{'='*60}")
print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7))
```
### 4. Load the Hugging Face format model
```bash
pip install torch transformers safetensors huggingface_hub
```
```python
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model_id = "hemantvirmani/tinyGPT"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
# Restore TinyGPT's trained LM-head bias for exact inference.
weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
state_dict = load_file(weights_path, device="cpu")
if "lm_head.bias" in state_dict:
lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"])
model.lm_head = lm_head
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
prompt = "Photosynthesis is the process by which plants"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=500,
do_sample=True,
temperature=0.7,
top_k=0,
top_p=1.0,
repetition_penalty=1.3,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```
You can also run the helper script from the GitHub repo:
```bash
python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants"
```
---
## Sample Outputs (temperature=0.7, 500 tokens)
**Prompt:** `Photosynthesis is the process by which plants`
> Photosynthesis is the process by which plants take in sunlight, water,
> carbon dioxide and nutrients to produce energy for their cells. Humans
> depend on photosynthesis to provide their own energy, but many plants
> also use the energy of other organisms to produce food. The five types of...
**Prompt:** `The Roman Empire fell due to several factors including`
> The Roman Empire fell due to several factors including the decline of the
> Roman army, the rise of the Papacy, and the threat of the Islamic invasion.
> The fall of the Roman Empire was the result of a series of civil wars in
> the late fourth century, and was led by the first emperor of the Roman
> Empire, Constantine the Great.
---
## Limitations
- This is a **base language model** — it completes text, it does not follow
instructions or answer questions.
- Prone to repetition loops, especially at low temperature.
- Fine-tuning required for instruction-following or domain-specific tasks.
---
## Thanks to
- Andrej Karpathy's nanoGPT - Video and Code
- Dataset: HuggingFace [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)