pretrained and finetuned tinyGPT dataset

65b2306 verified 13 days ago

6.64 kB

	---
	license: mit
	---
	# TinyGPT — GPT-2 Style LM (~163M) trained on FineWeb-Edu

	A GPT-2 style decoder-only transformer pretrained from scratch on ~43B tokens
	of the FineWeb-Edu dataset, achieving a validation loss of 2.84.

	Built this project to develop hands-on intuition for LLMs - inspired by Andrej Karpathy's nanoGPT

	---

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| Decoder-only Transformer (GPT-2 style) \|
	\| Parameters \| ~163M \|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Embedding dim \| 768 \|
	\| Context length \| 1024 tokens \|
	\| Vocab size \| 50,257 \|
	\| Tokenizer \| GPT-2 BPE via `tiktoken` \|
	\| Attention \| Causal self-attention (Flash Attention via `F.scaled_dot_product_attention`) \|
	\| LM head \| Separate linear layer (not weight-tied) \|

	> Why ~163M and not 124M? Standard GPT-2 124M ties the LM head weights
	> with the token embedding table, saving ~38M parameters. TinyGPT uses a
	> separate `nn.Linear` head, resulting in ~163M total parameters.

	---

	## Training Details

	\| Detail \| Value \|
	\|--------\|-------\|
	\| Dataset \| [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (`sample-100BT` subset) \|
	\| Tokens trained \| ~43B \|
	\| Validation loss \| 2.84 \|
	\| Optimizer \| AdamW (betas=(0.9, 0.95), eps=1e-8) \|
	\| Learning rate \| 6e-4 \|
	\| LR schedule \| Linear warmup (4000 steps) -> Cosine decay to 6e-5 \|
	\| Effective batch size \| 512 (16 x 32 gradient accumulation steps) \|
	\| Weight decay \| 0.1 \|
	\| Gradient clipping \| 1.0 \|
	\| Precision \| bfloat16 (bf16) \|
	\| Max iterations \| 600,000 \|
	\| Dropout \| 0.0 \|

	---

	## Format

	Weights are saved in PyTorch native format — a plain state dict saved with
	`torch.save()`, containing only model weights (no optimizer state, no
	scheduler). The file is ~670MB.

	To load, you need the `TinyGPT` model class (included below).

	The model is also available in Hugging Face Transformers format in this
	repository. The HF-format files include:

	- `model.safetensors`
	- `config.json`
	- `generation_config.json`
	- `tokenizer.json`
	- `tokenizer_config.json`

	The HF-format model can be loaded with `transformers` and is useful for standard
	Hugging Face workflows. Note that TinyGPT was trained with a separate,
	non-weight-tied LM head that includes a trained bias. Standard
	`GPT2LMHeadModel.from_pretrained()` loads the main model weights but treats
	`lm_head.bias` as an unexpected key because the default GPT-2 head is biasless.
	For exact TinyGPT inference, restore the LM-head bias as shown below or use
	`infer_hf.py` from the GitHub repo.

	---

	## Usage

	### 1. Install dependencies

	Clone the repo and install requirements:

	```bash
	git clone https://github.com/hemantvirmani/tinygpt
	cd tinygpt
	pip install -r requirements.txt
	```

	### 2. Get the model class

	The `TinyGPT` model class is available at:
	[https://github.com/hemantvirmani/tinygpt](https://github.com/hemantvirmani/tinygpt)

	Clone or download `tinygpt.py` and place it in your working directory.

	### 3. Load weights and run inference

	```python
	import tinygpt

	model = tinygpt.load_model_for_inference()

	prompts = [
	"Hello, I'm a language model,",
	"The human brain contains approximately",
	"Photosynthesis is the process by which plants",
	"The theory of relativity states that ",
	"The Roman Empire fell due to several factors including",
	"During the Industrial Revolution, workers ",
	"To solve a quadratic equation, you must first",
	"The key differences between mitosis and meiosis are ",
	"Once upon a time in ancient India, there lived a king who ",
	]

	for prompt in prompts:
	print(f"\n{'='*60}")
	print(f"PROMPT: {prompt}")
	print(f"{'='*60}")
	print(model.generate_text(start_text=prompt, max_tokens=500, temperature=0.7))
	```

	### 4. Load the Hugging Face format model

	```bash
	pip install torch transformers safetensors huggingface_hub
	```

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file
	from transformers import GPT2LMHeadModel, GPT2Tokenizer

	model_id = "hemantvirmani/tinyGPT"

	tokenizer = GPT2Tokenizer.from_pretrained(model_id)
	model = GPT2LMHeadModel.from_pretrained(model_id)

	# Restore TinyGPT's trained LM-head bias for exact inference.
	weights_path = hf_hub_download(repo_id=model_id, filename="model.safetensors")
	state_dict = load_file(weights_path, device="cpu")
	if "lm_head.bias" in state_dict:
	lm_head = torch.nn.Linear(model.config.n_embd, model.config.vocab_size, bias=True)
	lm_head.weight = torch.nn.Parameter(state_dict["lm_head.weight"])
	lm_head.bias = torch.nn.Parameter(state_dict["lm_head.bias"])
	model.lm_head = lm_head

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device)
	model.eval()

	prompt = "Photosynthesis is the process by which plants"
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=500,
	do_sample=True,
	temperature=0.7,
	top_k=0,
	top_p=1.0,
	repetition_penalty=1.3,
	pad_token_id=tokenizer.eos_token_id,
	)

	print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
	```

	You can also run the helper script from the GitHub repo:

	```bash
	python infer_hf.py --model_dir hemantvirmani/tinyGPT --prompt "Photosynthesis is the process by which plants"
	```

	---

	## Sample Outputs (temperature=0.7, 500 tokens)

	Prompt: `Photosynthesis is the process by which plants`
	> Photosynthesis is the process by which plants take in sunlight, water,
	> carbon dioxide and nutrients to produce energy for their cells. Humans
	> depend on photosynthesis to provide their own energy, but many plants
	> also use the energy of other organisms to produce food. The five types of...

	Prompt: `The Roman Empire fell due to several factors including`
	> The Roman Empire fell due to several factors including the decline of the
	> Roman army, the rise of the Papacy, and the threat of the Islamic invasion.
	> The fall of the Roman Empire was the result of a series of civil wars in
	> the late fourth century, and was led by the first emperor of the Roman
	> Empire, Constantine the Great.

	---

	## Limitations

	- This is a base language model — it completes text, it does not follow
	instructions or answer questions.
	- Prone to repetition loops, especially at low temperature.
	- Fine-tuning required for instruction-following or domain-specific tasks.

	---

	## Thanks to

	- Andrej Karpathy's nanoGPT - Video and Code
	- Dataset: HuggingFace [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)