pretrained and finetuned tinyGPT dataset

Browse files

Files changed (16) hide show

.gitattributes +5 -0
finetuning alpaca/README.md +230 -0
finetuning alpaca/checkpoint/tinygpt_finetuned_checkpoint_alpaca.pt +3 -0
finetuning alpaca/huggingface/config.json +34 -0
finetuning alpaca/huggingface/generation_config.json +9 -0
finetuning alpaca/huggingface/model.safetensors +3 -0
finetuning alpaca/huggingface/tokenizer.json +0 -0
finetuning alpaca/huggingface/tokenizer_config.json +12 -0
pretraining/PyTorch native/tinygpt_pretrained_weights.pt +3 -0
pretraining/README.md +210 -0
pretraining/checkpoint/tinygpt_pretrained_checkpoint_438k.pt +3 -0
pretraining/tinygpt huggingface/config.json +34 -0
pretraining/tinygpt huggingface/generation_config.json +9 -0
pretraining/tinygpt huggingface/model.safetensors +3 -0
pretraining/tinygpt huggingface/tokenizer.json +0 -0
pretraining/tinygpt huggingface/tokenizer_config.json +12 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,5 @@

+finetuning[[:space:]]alpaca/checkpoint/tinygpt_finetuned_checkpoint_alpaca.pt filter=lfs diff=lfs merge=lfs -text
+finetuning[[:space:]]alpaca/huggingface/model.safetensors filter=lfs diff=lfs merge=lfs -text
+pretraining/checkpoint/tinygpt_pretrained_checkpoint_438k.pt filter=lfs diff=lfs merge=lfs -text
+pretraining/PyTorch[[:space:]]native/tinygpt_pretrained_weights.pt filter=lfs diff=lfs merge=lfs -text
+pretraining/tinygpt[[:space:]]huggingface/model.safetensors filter=lfs diff=lfs merge=lfs -text

finetuning alpaca/README.md ADDED Viewed

	@@ -0,0 +1,230 @@

+---
+license: mit
+---
+# TinyGPT-Alpaca — Instruction-Tuned GPT-2 Style LM (~163M)
+TinyGPT pretrained base model (~163M params, val loss 2.84) instruction
+fine-tuned on the [Alpaca Cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
+dataset (52K examples). Trained with a custom PyTorch loop — no LoRA, no PEFT,
+full fine-tune.
+Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT
+---
+## Model Details
+| Parameter | Value |
+| --- | --- |
+| Architecture | Decoder-only Transformer (GPT-2 style) |
+| Parameters | ~163M |
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dim | 768 |
+| Context length | 1024 tokens (512 used during fine-tuning) |
+| Vocab size | 50,257 |
+| Tokenizer | GPT-2 BPE via `tiktoken` |
+| Attention | Causal self-attention (Flash Attention via `F.scaled_dot_product_attention`) |
+| LM head | Separate linear layer with bias (not weight-tied) |
+| Base model | TinyGPT pretrained on FineWeb-Edu `sample-100BT` (val loss 2.84) |
+---
+## Fine-Tuning Details
+| Detail | Value |
+| --- | --- |
+| Dataset | [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) (52K instruction-response pairs) |
+| Prompt template | `### Instruction / ### Input (optional) / ### Response` |
+| Max sequence length | 512 tokens |
+| Val split | 10% (held out from the 52K) |
+| Best val loss | **1.8405** (step 3,600 of 5,000) |
+| Optimizer | AdamW (betas=(0.9, 0.95), eps=1e-8) |
+| Learning rate | 1e-4 with linear warmup (100 steps) → cosine decay |
+| Effective batch size | 64 (4 micro-batch × 16 gradient accumulation steps) |
+| Weight decay | 0.01 |
+| Gradient clipping | 1.0 |
+| Dropout | **0.1** (critical — without it, train/val gap exceeded 0.80 within 2,000 steps) |
+| Precision | bfloat16 (bf16) |
+| Hardware | Kaggle T4 GPU |
+---
+## Format
+Two formats are provided:
+**1. Full training checkpoint** (`tinygpt_finetuned_checkpoint_alpaca.pt`)
+A dict with keys: `model_state`, `optimizer_state`, `scheduler_state`, `step`, `val_loss`.
+Useful if you want to resume training or inspect training metadata.
+The file is ~2 GB (includes optimizer state).
+**2. HuggingFace format** (`model.safetensors` + `config.json`)
+Exported via `export_to_hf_alpaca.py` from the GitHub repo. Loadable with
+`transformers`. Same `lm_head.bias` caveat as the pretrained model applies here
+(see Usage below).
+---
+## Prompt Template
+This model was trained on the Alpaca instruction format. Always wrap prompts in
+this template — the model has learned to respond after `### Response:`.
+**Without input context:**
+```
+### Instruction:
+{your instruction here}
+### Response:
+```
+**With input context:**
+```
+### Instruction:
+{your instruction here}
+### Input:
+{additional context here}
+### Response:
+```
+---
+## Usage
+### 1. Install dependencies
+```bash
+git clone https://github.com/hemantvirmani/tinygpt
+cd tinygpt
+pip install torch tiktoken
+```
+### 2. Load PyTorch checkpoint and run inference
+```python
+import torch
+import tiktoken
+import tinygpt
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Load the full checkpoint and extract model weights
+ckpt = torch.load("tinygpt_finetuned_checkpoint_alpaca.pt", map_location=device, weights_only=False)
+state_dict = ckpt["model_state"]
+print(f"Loaded checkpoint — step: {ckpt['step']} | val loss: {ckpt['val_loss']:.4f}")
+# Strip _orig_mod. prefix if checkpoint came from a torch.compile() run
+if any(k.startswith("_orig_mod.") for k in state_dict):
+    state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
+enc = tiktoken.get_encoding("gpt2")
+state = tinygpt.State(tokenizer=enc, train_data=None, val_data=None, vocab_size=enc.n_vocab)
+model = tinygpt.TinyGPT(state).to(device)
+model.load_state_dict(state_dict)
+model.eval()
+# Run inference with instruction template
+def ask(instruction, input_text="", max_tokens=200, temperature=0.7):
+    if input_text:
+        prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
+    else:
+        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
+    return model.generate_text(start_text=prompt, max_tokens=max_tokens, temperature=temperature)
+print(ask("What is photosynthesis?"))
+print(ask("Explain the water cycle in simple terms."))
+print(ask("Summarize the following text.", input_text="The moon orbits Earth once every 27 days."))
+```
+### 3. Load the HuggingFace format model
+```bash
+pip install torch transformers safetensors huggingface_hub
+```
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+from transformers import GPT2LMHeadModel, GPT2Tokenizer
+model_id = "hemantvirmani/tinyGPT-alpaca"
+tokenizer = GPT2Tokenizer.from_pretrained(model_id)
+model = GPT2LMHeadModel.from_pretrained(model_id)
+# Restore TinyGPT's trained LM-head bias for exact inference
+weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
+state_dict = load_file(weights_path, device="cpu")
+if "lm_head.bias" in state_dict:
+    lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
+    lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
+    lm_head.bias   = torch.nn.Parameter(state_dict["lm_head.bias"])
+    model.lm_head  = lm_head
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device).eval()
+instruction = "What is photosynthesis?"
+prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=200,
+        do_sample=True,
+        temperature=0.7,
+        top_k=50,
+        repetition_penalty=1.3,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+```
+---
+## Sample Outputs (temperature=0.7, after fine-tuning)
+**Prompt:** `What is photosynthesis?`
+> Photosynthesis is the process by which plants, algae, and some bacteria
+> convert sunlight, water, and carbon dioxide into glucose, oxygen, and other
+> chemical compounds. It is a crucial process for the survival and reproduction
+> of life on Earth and is critical for the greenhouse effect, water cycle, and
+> the carbon cycle.
+**Prompt:** `Explain the water cycle in simple terms.`
+> The water cycle, also known as the hydrologic cycle, is the continuous
+> movement of water on, above, and below the surface of the earth. It starts
+> with the evaporation of water from the ground and rises into the atmosphere
+> through the process of precipitation. The water in the oceans and other bodies
+> of water evaporates from the surface of the earth in order to be returned to
+> the earth's surface through precipitation.
+---
+## Limitations
+- **163M parameters** — factual accuracy is limited. The model learns the
+  instruction-response *format* quickly (within the first 100 steps) but
+  factual depth is constrained by model capacity.
+- **Not RLHF-tuned** — no safety guardrails, no preference alignment.
+- **Trained on Alpaca Cleaned (52K)** — may not generalize well to complex,
+  multi-step, or domain-specific instructions.
+- Can degenerate on some questions (e.g., repeating `### Response:` headers).
+  Use `repetition_penalty=1.3` to mitigate.
+- The base model was trained on formal educational text (FineWeb-Edu); that
+  bias carries through to instruction-following.
+---
+## Thanks to
+- Andrej Karpathy's nanoGPT — architecture inspiration
+- Dataset: [yahma/alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned)
+- Base model: [hemantvirmani/tinyGPT](https://huggingface.co/hemantvirmani/tinyGPT)
+- Compute: Kaggle (T4 GPU)

finetuning alpaca/checkpoint/tinygpt_finetuned_checkpoint_alpaca.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b07938beb5b22b699314cb101ec6ac101f48fa8ae47355e1ee1907d31f4ac9b1
+size 2006993967

finetuning alpaca/huggingface/config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "activation_function": "gelu",
+  "add_cross_attention": false,
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.0,
+  "bos_token_id": 50256,
+  "dtype": "float32",
+  "embd_pdrop": 0.0,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "pad_token_id": null,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.0,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "tie_word_embeddings": false,
+  "transformers_version": "5.3.0",
+  "use_cache": true,
+  "vocab_size": 50257
+}

finetuning alpaca/huggingface/generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "transformers_version": "5.3.0",
+  "use_cache": true
+}

finetuning alpaca/huggingface/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:19486203f84dd502fc571085190e6f90794de219677cd7ecfd00c46d39c24011
+size 652365020

finetuning alpaca/huggingface/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

finetuning alpaca/huggingface/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "is_local": false,
+  "model_max_length": 1024,
+  "pad_token": null,
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}

pretraining/PyTorch native/tinygpt_pretrained_weights.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5519339ae282c0a32db9934589a501908e5acb498047397558767e17f9a9856e
+size 702547947

pretraining/README.md ADDED Viewed

	@@ -0,0 +1,210 @@

+---
+license: mit
+---
+# TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu
+A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens
+of the FineWeb-Edu dataset, achieving a validation loss of **2.84**.
+Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT
+---
+## Model Details
+| Parameter | Value |
+|-----------|-------|
+| Architecture | Decoder-only Transformer (GPT-2 style) |
+| Parameters | ~163M |
+| Layers | 12 |
+| Attention heads | 12 |
+| Embedding dim | 768 |
+| Context length | 1024 tokens |
+| Vocab size | 50,257 |
+| Tokenizer | GPT-2 BPE via `tiktoken` |
+| Attention | Causal self-attention (Flash Attention via `F.scaled_dot_product_attention`) |
+| LM head | Separate linear layer (not weight-tied) |
+> **Why ~163M and not 124M?** Standard GPT-2 124M ties the LM head weights
+> with the token embedding table, saving ~38M parameters. TinyGPT uses a
+> separate `nn.Linear` head, resulting in ~163M total parameters.
+---
+## Training Details
+| Detail | Value |
+|--------|-------|
+| Dataset | [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (`sample-100BT` subset) |
+| Tokens trained | ~43B |
+| Validation loss | 2.84 |
+| Optimizer | AdamW (betas=(0.9, 0.95), eps=1e-8) |
+| Learning rate | 6e-4 |
+| LR schedule | Linear warmup (4000 steps) -> Cosine decay to 6e-5 |
+| Effective batch size | 512 (16 x 32 gradient accumulation steps) |
+| Weight decay | 0.1 |
+| Gradient clipping | 1.0 |
+| Precision | bfloat16 (bf16) |
+| Max iterations | 600,000 |
+| Dropout | 0.0 |
+---
+## Format
+Weights are saved in **PyTorch native format** — a plain state dict saved with
+`torch.save()`, containing only model weights (no optimizer state, no
+scheduler). The file is ~670MB.
+To load, you need the `TinyGPT` model class (included below).
+The model is also available in **Hugging Face Transformers format** in this
+repository. The HF-format files include:
+- `model.safetensors`
+- `config.json`
+- `generation_config.json`
+- `tokenizer.json`
+- `tokenizer_config.json`
+The HF-format model can be loaded with `transformers` and is useful for standard
+Hugging Face workflows. Note that TinyGPT was trained with a separate,
+non-weight-tied LM head that includes a trained bias. Standard
+`GPT2LMHeadModel.from_pretrained()` loads the main model weights but treats
+`lm_head.bias` as an unexpected key because the default GPT-2 head is biasless.
+For exact TinyGPT inference, restore the LM-head bias as shown below or use
+`infer_hf.py` from the GitHub repo.
+---
+## Usage
+### 1. Install dependencies
+Clone the repo and install requirements:
+```bash
+git clone https://github.com/hemantvirmani/tinygpt
+cd tinygpt
+pip install -r requirements.txt
+```
+### 2. Get the model class
+The `TinyGPT` model class is available at:
+**[https://github.com/hemantvirmani/tinygpt](https://github.com/hemantvirmani/tinygpt)**
+Clone or download `tinygpt.py` and place it in your working directory.
+### 3. Load weights and run inference
+```python
+import tinygpt
+model = tinygpt.load_model_for_inference()
+prompts = [
+    "Hello, I'm a language model,",
+    "The human brain contains approximately",
+    "Photosynthesis is the process by which plants",
+    "The theory of relativity states that ",
+    "The Roman Empire fell due to several factors including",
+    "During the Industrial Revolution, workers ",
+    "To solve a quadratic equation, you must first",
+    "The key differences between mitosis and meiosis are ",
+    "Once upon a time in ancient India, there lived a king who ",
+]
+for prompt in prompts:
+    print(f"\n{'='*60}")
+    print(f"PROMPT: {prompt}")
+    print(f"{'='*60}")
+    print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7))
+```
+### 4. Load the Hugging Face format model
+```bash
+pip install torch transformers safetensors huggingface_hub
+```
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from safetensors.torch import load_file
+from transformers import GPT2LMHeadModel, GPT2Tokenizer
+model_id = "hemantvirmani/tinyGPT"
+tokenizer = GPT2Tokenizer.from_pretrained(model_id)
+model = GPT2LMHeadModel.from_pretrained(model_id)
+# Restore TinyGPT's trained LM-head bias for exact inference.
+weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
+state_dict = load_file(weights_path, device="cpu")
+if "lm_head.bias" in state_dict:
+    lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
+    lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
+    lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"])
+    model.lm_head = lm_head
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = model.to(device)
+model.eval()
+prompt = "Photosynthesis is the process by which plants"
+inputs = tokenizer(prompt, return_tensors="pt").to(device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=500,
+        do_sample=True,
+        temperature=0.7,
+        top_k=0,
+        top_p=1.0,
+        repetition_penalty=1.3,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+```
+You can also run the helper script from the GitHub repo:
+```bash
+python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants"
+```
+---
+## Sample Outputs (temperature=0.7, 500 tokens)
+**Prompt:** `Photosynthesis is the process by which plants`
+> Photosynthesis is the process by which plants take in sunlight, water,
+> carbon dioxide and nutrients to produce energy for their cells. Humans
+> depend on photosynthesis to provide their own energy, but many plants
+> also use the energy of other organisms to produce food. The five types of...
+**Prompt:** `The Roman Empire fell due to several factors including`
+> The Roman Empire fell due to several factors including the decline of the
+> Roman army, the rise of the Papacy, and the threat of the Islamic invasion.
+> The fall of the Roman Empire was the result of a series of civil wars in
+> the late fourth century, and was led by the first emperor of the Roman
+> Empire, Constantine the Great.
+---
+## Limitations
+- This is a **base language model** — it completes text, it does not follow
+  instructions or answer questions.
+- Prone to repetition loops, especially at low temperature.
+- Fine-tuning required for instruction-following or domain-specific tasks.
+---
+## Thanks to
+- Andrej Karpathy's nanoGPT - Video and Code
+- Dataset: HuggingFace [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)

pretraining/checkpoint/tinygpt_pretrained_checkpoint_438k.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:653308efbfbf616df128e652c48c3eac1ba72694d4cafed2aaae07e415c0a045
+size 2006991266

pretraining/tinygpt huggingface/config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "activation_function": "gelu",
+  "add_cross_attention": false,
+  "architectures": [
+    "GPT2LMHeadModel"
+  ],
+  "attn_pdrop": 0.0,
+  "bos_token_id": 50256,
+  "dtype": "float32",
+  "embd_pdrop": 0.0,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "model_type": "gpt2",
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "pad_token_id": null,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.0,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "tie_word_embeddings": false,
+  "transformers_version": "5.3.0",
+  "use_cache": true,
+  "vocab_size": 50257
+}

pretraining/tinygpt huggingface/generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 50256,
+  "eos_token_id": 50256,
+  "output_attentions": false,
+  "output_hidden_states": false,
+  "transformers_version": "5.3.0",
+  "use_cache": true
+}

pretraining/tinygpt huggingface/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d89305ea93a964f09e6ed382eb3f24726997bf564601fa46e2f6d226cfc0cf53
+size 652365020

pretraining/tinygpt huggingface/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

pretraining/tinygpt huggingface/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "is_local": false,
+  "model_max_length": 1024,
+  "pad_token": null,
+  "tokenizer_class": "GPT2Tokenizer",
+  "unk_token": "<|endoftext|>"
+}