Update README.md

62cfa95 verified about 1 month ago

8.15 kB

	---
	language: en
	license: mit
	base_model: openai-community/gpt2-medium
	tags:
	- gpt2
	- instruction-tuning
	- text-generation
	- alpaca
	- pytorch-lightning
	- causal-lm
	datasets:
	- yahma/alpaca-cleaned
	pipeline_tag: text-generation
	model_name: gpt2-medium-instruct
	---

	# GPT-2 Medium Instruct

	A 355M parameter GPT-2 Medium model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning.
	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| `openai-community/gpt2-medium` \|
	\| Parameters \| ~355M \|
	\| Architecture \| GPT-2 (decoder-only transformer) \|
	\| Fine-tuning dataset \| `yahma/alpaca-cleaned` (10,000 training samples) \|
	\| Context length \| 1,024 tokens \|
	\| Vocabulary size \| 50,257 tokens \|
	\| Embedding dim \| 1,024 \|
	\| Transformer layers \| 24 \|
	\| Attention heads \| 16 \|
	\| Tokenizer \| GPT-2 BPE (via `tiktoken` / HF `GPT2Tokenizer`) \|

	---

	## Training Details

	### Dataset

	The model was fine-tuned on the [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) dataset — a cleaned version of Stanford Alpaca's 52K instruction-following data generated from `text-davinci-003`.

	\| Split \| Samples \|
	\|---\|---\|
	\| Train \| 10,000 \|
	\| Validation \| 1,000 \|
	\| Test \| 1,000 \|

	### Prompt Format

	The model uses the standard Alpaca prompt template:

	```
	Below is an instruction that describes a task. Write a response that appropriately completes the request.

	### Instruction:
	{instruction}

	### Input:
	{input} ← omitted if empty

	### Response:
	{output}
	```

	During training, the instruction + input portion is masked with `-100` in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn how to respond rather than memorize the prompt structure.

	### Optimizer

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW \|
	\| Learning rate \| `3e-5` \|
	\| Weight decay \| `0.1` \|
	\| Beta1 / Beta2 \| `0.9` / `0.95` \|
	\| Gradient clip \| `1.0` \|

	### Training Config

	\| Setting \| Value \|
	\|---\|---\|
	\| Framework \| PyTorch Lightning \|
	\| Epochs \| 2 (+ 1 continuation epoch) \|
	\| Batch size (per device) \| 2 \|
	\| Gradient accumulation steps \| 4 \|
	\| Effective batch size \| 8 \|
	\| Precision \| `16-mixed` (FP16 + FP32) \|
	\| Hardware \| Single GPU (Colab) \|
	\| Early stopping patience \| 3 validation checks \|
	\| Checkpoint metric \| `val_loss_eval` (minimize) \|

	---

	## Usage

	### Basic Inference

	```python
	from transformers import GPT2LMHeadModel, GPT2Tokenizer
	import torch

	model_id = "snehangshu511/gpt2-medium-instruct"

	tokenizer = GPT2Tokenizer.from_pretrained(model_id)
	model = GPT2LMHeadModel.from_pretrained(model_id)
	model.eval()

	def build_prompt(instruction, input_text=""):
	base = (
	"Below is an instruction that describes a task. "
	"Write a response that appropriately completes the request.\n\n"
	)
	if input_text.strip():
	return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
	return f"{base}### Instruction:\n{instruction}\n\n### Response:\n"

	prompt = build_prompt("Explain what machine learning is in simple terms.")
	inputs = tokenizer(prompt, return_tensors="pt")

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=200,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	top_k=50,
	repetition_penalty=1.2,
	pad_token_id=tokenizer.eos_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	# Decode only the newly generated tokens (strip the prompt)
	input_len = inputs["input_ids"].shape[1]
	response = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
	print(response)
	```

	### With Optional Input Context

	```python
	prompt = build_prompt(
	instruction="Summarize the following text.",
	input_text="The Industrial Revolution began in Britain in the 18th century..."
	)
	```

	### Recommended Generation Settings

	\| Setting \| Recommended range \| Effect \|
	\|---\|---\|---\|
	\| `temperature` \| 0.6 – 0.9 \| Higher = more creative, lower = more deterministic \|
	\| `top_p` \| 0.85 – 0.95 \| Nucleus sampling — limits token pool to top P% probability mass \|
	\| `top_k` \| 40 – 60 \| Hard limits candidate tokens to top K at each step \|
	\| `repetition_penalty` \| 1.1 – 1.3 \| Higher = less repetition in output \|
	\| `max_new_tokens` \| 100 – 300 \| Keep under 800 to stay within the 1024 context window \|

	---

	## Architecture Notes

	This model was built from scratch using a custom `GPTModel` class (no `AutoModel` during training). The weights were converted from the custom format to HF-compatible `GPT2LMHeadModel` format for this Hub upload.

	Key architectural decisions:

	- Weight tying disabled (`tie_word_embeddings=False`): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, `lm_head.weight` was explicitly cloned to avoid shared-memory issues with `safetensors`. The config reflects this.

	- QKV separation: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard `c_attn` format that `GPT2LMHeadModel` expects.

	- Drop rate = 0.0: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets.

	---

	## Files in This Repository

	\| File \| Description \|
	\|---\|---\|
	\| `model.safetensors` \| Model weights in safetensors format (recommended) \|
	\| `pytorch_model.bin` \| Model weights in legacy `.bin` format \|
	\| `config.json` \| GPT2Config — model architecture definition \|
	\| `generation_config.json` \| Default generation settings \|
	\| `tokenizer.json` \| Fast tokenizer file \|
	\| `tokenizer_config.json` \| Tokenizer configuration \|
	\| `checkpoints/model.ckpt` \| Original PyTorch Lightning training checkpoint \|

	---

	## Limitations

	- Small training subset: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results.
	- GPT-2 base: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks.
	- No RLHF: The model is instruction-tuned via supervised fine-tuning only — no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong.
	- Context length: Hard-limited to 1,024 tokens. Long prompts can get truncated.
	- No safety alignment: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures.

	---

	## Training Pipeline Summary

	```
	yahma/alpaca-cleaned (52K rows)
	↓ load 10K rows
	Alpaca prompt formatting
	↓
	tiktoken BPE tokenization
	↓ -100 masking on prompt tokens
	Custom PyTorch Dataset + DataLoader (dynamic padding)
	↓
	GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium
	↓
	PyTorch Lightning fine-tuning
	- AdamW, lr=3e-5, 2 epochs
	- FP16 mixed precision
	- Gradient accumulation (eff. batch = 8)
	- Checkpoint on best val_loss
	↓
	Lightning prefix stripped → raw GPTModel state dict
	↓
	Custom → HF format conversion (QKV fusing, key renaming)
	↓
	Saved as model.safetensors + pytorch_model.bin
	↓
	Pushed to snehangshu511/gpt2-medium-instruct
	```

	---

	## Citation

	If you use this model, please also cite the resources it was built from:

	```bibtex
	@book{raschka2024llms,
	title = {Build a Large Language Model (From Scratch)},
	author = {Sebastian Raschka},
	year = {2024},
	publisher = {Manning Publications}
	}

	@misc{alpaca,
	title = {Stanford Alpaca: An Instruction-following LLaMA model},
	author = {Taori et al.},
	year = {2023},
	url = {https://github.com/tatsu-lab/stanford_alpaca}
	}
	```

	---

	## Author

	Snehangshu Bhuin — Data Scientist
	GitHub: [snehangshu2002](https://github.com/snehangshu2002)
	Built as part of ongoing LLM learning and portfolio development.