| --- |
| license: mit |
| --- |
| # TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu |
|
|
| A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens |
| of the FineWeb-Edu dataset, achieving a validation loss of **2.84**. |
|
|
| Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Architecture | Decoder-only Transformer (GPT-2 style) | |
| | Parameters | ~163M | |
| | Layers | 12 | |
| | Attention heads | 12 | |
| | Embedding dim | 768 | |
| | Context length | 1024 tokens | |
| | Vocab size | 50,257 | |
| | Tokenizer | GPT-2 BPE via `tiktoken` | |
| | Attention | Causal self-attention (Flash Attention via `F.scaled_dot_product_attention`) | |
| | LM head | Separate linear layer (not weight-tied) | |
|
|
| > **Why ~163M and not 124M?** Standard GPT-2 124M ties the LM head weights |
| > with the token embedding table, saving ~38M parameters. TinyGPT uses a |
| > separate `nn.Linear` head, resulting in ~163M total parameters. |
|
|
| --- |
|
|
| ## Training Details |
|
|
| | Detail | Value | |
| |--------|-------| |
| | Dataset | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (`sample-100BT` subset) | |
| | Tokens trained | ~43B | |
| | Validation loss | 2.84 | |
| | Optimizer | AdamW (betas=(0.9, 0.95), eps=1e-8) | |
| | Learning rate | 6e-4 | |
| | LR schedule | Linear warmup (4000 steps) -> Cosine decay to 6e-5 | |
| | Effective batch size | 512 (16 x 32 gradient accumulation steps) | |
| | Weight decay | 0.1 | |
| | Gradient clipping | 1.0 | |
| | Precision | bfloat16 (bf16) | |
| | Max iterations | 600,000 | |
| | Dropout | 0.0 | |
|
|
| --- |
|
|
| ## Format |
|
|
| Weights are saved in **PyTorch native format** — a plain state dict saved with |
| `torch.save()`, containing only model weights (no optimizer state, no |
| scheduler). The file is ~670MB. |
|
|
| To load, you need the `TinyGPT` model class (included below). |
|
|
| The model is also available in **Hugging Face Transformers format** in this |
| repository. The HF-format files include: |
|
|
| - `model.safetensors` |
| - `config.json` |
| - `generation_config.json` |
| - `tokenizer.json` |
| - `tokenizer_config.json` |
|
|
| The HF-format model can be loaded with `transformers` and is useful for standard |
| Hugging Face workflows. Note that TinyGPT was trained with a separate, |
| non-weight-tied LM head that includes a trained bias. Standard |
| `GPT2LMHeadModel.from_pretrained()` loads the main model weights but treats |
| `lm_head.bias` as an unexpected key because the default GPT-2 head is biasless. |
| For exact TinyGPT inference, restore the LM-head bias as shown below or use |
| `infer_hf.py` from the GitHub repo. |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### 1. Install dependencies |
|
|
| Clone the repo and install requirements: |
|
|
| ```bash |
| git clone https://github.com/hemantvirmani/tinygpt |
| cd tinygpt |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 2. Get the model class |
|
|
| The `TinyGPT` model class is available at: |
| **[https://github.com/hemantvirmani/tinygpt](https://github.com/hemantvirmani/tinygpt)** |
|
|
| Clone or download `tinygpt.py` and place it in your working directory. |
|
|
| ### 3. Load weights and run inference |
|
|
| ```python |
| import tinygpt |
| |
| model = tinygpt.load_model_for_inference() |
| |
| prompts = [ |
| "Hello, I'm a language model,", |
| "The human brain contains approximately", |
| "Photosynthesis is the process by which plants", |
| "The theory of relativity states that ", |
| "The Roman Empire fell due to several factors including", |
| "During the Industrial Revolution, workers ", |
| "To solve a quadratic equation, you must first", |
| "The key differences between mitosis and meiosis are ", |
| "Once upon a time in ancient India, there lived a king who ", |
| ] |
| |
| for prompt in prompts: |
| print(f"\n{'='*60}") |
| print(f"PROMPT: {prompt}") |
| print(f"{'='*60}") |
| print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7)) |
| ``` |
|
|
| ### 4. Load the Hugging Face format model |
|
|
| ```bash |
| pip install torch transformers safetensors huggingface_hub |
| ``` |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from safetensors.torch import load_file |
| from transformers import GPT2LMHeadModel, GPT2Tokenizer |
| |
| model_id = "hemantvirmani/tinyGPT" |
| |
| tokenizer = GPT2Tokenizer.from_pretrained(model_id) |
| model = GPT2LMHeadModel.from_pretrained(model_id) |
| |
| # Restore TinyGPT's trained LM-head bias for exact inference. |
| weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors") |
| state_dict = load_file(weights_path, device="cpu") |
| if "lm_head.bias" in state_dict: |
| lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True) |
| lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"]) |
| lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"]) |
| model.lm_head = lm_head |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| model = model.to(device) |
| model.eval() |
| |
| prompt = "Photosynthesis is the process by which plants" |
| inputs = tokenizer(prompt, return_tensors="pt").to(device) |
| |
| with torch.no_grad(): |
| output_ids = model.generate( |
| **inputs, |
| max_new_tokens=500, |
| do_sample=True, |
| temperature=0.7, |
| top_k=0, |
| top_p=1.0, |
| repetition_penalty=1.3, |
| pad_token_id=tokenizer.eos_token_id, |
| ) |
| |
| print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) |
| ``` |
|
|
| You can also run the helper script from the GitHub repo: |
|
|
| ```bash |
| python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants" |
| ``` |
|
|
| --- |
|
|
| ## Sample Outputs (temperature=0.7, 500 tokens) |
|
|
| **Prompt:** `Photosynthesis is the process by which plants` |
| > Photosynthesis is the process by which plants take in sunlight, water, |
| > carbon dioxide and nutrients to produce energy for their cells. Humans |
| > depend on photosynthesis to provide their own energy, but many plants |
| > also use the energy of other organisms to produce food. The five types of... |
|
|
| **Prompt:** `The Roman Empire fell due to several factors including` |
| > The Roman Empire fell due to several factors including the decline of the |
| > Roman army, the rise of the Papacy, and the threat of the Islamic invasion. |
| > The fall of the Roman Empire was the result of a series of civil wars in |
| > the late fourth century, and was led by the first emperor of the Roman |
| > Empire, Constantine the Great. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - This is a **base language model** — it completes text, it does not follow |
| instructions or answer questions. |
| - Prone to repetition loops, especially at low temperature. |
| - Fine-tuning required for instruction-following or domain-specific tasks. |
|
|
| --- |
|
|
| ## Thanks to |
|
|
| - Andrej Karpathy's nanoGPT - Video and Code |
| - Dataset: HuggingFace [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
|
|