snehangshu511's picture
Update README.md
62cfa95 verified
---
language: en
license: mit
base_model: openai-community/gpt2-medium
tags:
- gpt2
- instruction-tuning
- text-generation
- alpaca
- pytorch-lightning
- causal-lm
datasets:
- yahma/alpaca-cleaned
pipeline_tag: text-generation
model_name: gpt2-medium-instruct
---
# GPT-2 Medium Instruct
A **355M parameter GPT-2 Medium** model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning.
---
## Model Details
| Property | Value |
|---|---|
| **Base model** | `openai-community/gpt2-medium` |
| **Parameters** | ~355M |
| **Architecture** | GPT-2 (decoder-only transformer) |
| **Fine-tuning dataset** | `yahma/alpaca-cleaned` (10,000 training samples) |
| **Context length** | 1,024 tokens |
| **Vocabulary size** | 50,257 tokens |
| **Embedding dim** | 1,024 |
| **Transformer layers** | 24 |
| **Attention heads** | 16 |
| **Tokenizer** | GPT-2 BPE (via `tiktoken` / HF `GPT2Tokenizer`) |
---
## Training Details
### Dataset
The model was fine-tuned on the [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) dataset β€” a cleaned version of Stanford Alpaca's 52K instruction-following data generated from `text-davinci-003`.
| Split | Samples |
|---|---|
| Train | 10,000 |
| Validation | 1,000 |
| Test | 1,000 |
### Prompt Format
The model uses the standard **Alpaca prompt template**:
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input} ← omitted if empty
### Response:
{output}
```
During training, the **instruction + input portion is masked** with `-100` in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn *how to respond* rather than memorize the prompt structure.
### Optimizer
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | `3e-5` |
| Weight decay | `0.1` |
| Beta1 / Beta2 | `0.9` / `0.95` |
| Gradient clip | `1.0` |
### Training Config
| Setting | Value |
|---|---|
| Framework | PyTorch Lightning |
| Epochs | 2 (+ 1 continuation epoch) |
| Batch size (per device) | 2 |
| Gradient accumulation steps | 4 |
| Effective batch size | 8 |
| Precision | `16-mixed` (FP16 + FP32) |
| Hardware | Single GPU (Colab) |
| Early stopping patience | 3 validation checks |
| Checkpoint metric | `val_loss_eval` (minimize) |
---
## Usage
### Basic Inference
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
model_id = "snehangshu511/gpt2-medium-instruct"
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()
def build_prompt(instruction, input_text=""):
base = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
)
if input_text.strip():
return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
return f"{base}### Instruction:\n{instruction}\n\n### Response:\n"
prompt = build_prompt("Explain what machine learning is in simple terms.")
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
top_p=0.9,
top_k=50,
repetition_penalty=1.2,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
# Decode only the newly generated tokens (strip the prompt)
input_len = inputs["input_ids"].shape[1]
response = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
print(response)
```
### With Optional Input Context
```python
prompt = build_prompt(
instruction="Summarize the following text.",
input_text="The Industrial Revolution began in Britain in the 18th century..."
)
```
### Recommended Generation Settings
| Setting | Recommended range | Effect |
|---|---|---|
| `temperature` | 0.6 – 0.9 | Higher = more creative, lower = more deterministic |
| `top_p` | 0.85 – 0.95 | Nucleus sampling β€” limits token pool to top P% probability mass |
| `top_k` | 40 – 60 | Hard limits candidate tokens to top K at each step |
| `repetition_penalty` | 1.1 – 1.3 | Higher = less repetition in output |
| `max_new_tokens` | 100 – 300 | Keep under 800 to stay within the 1024 context window |
---
## Architecture Notes
This model was built **from scratch** using a custom `GPTModel` class (no `AutoModel` during training). The weights were converted from the custom format to HF-compatible `GPT2LMHeadModel` format for this Hub upload.
Key architectural decisions:
- **Weight tying disabled** (`tie_word_embeddings=False`): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, `lm_head.weight` was explicitly cloned to avoid shared-memory issues with `safetensors`. The config reflects this.
- **QKV separation**: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard `c_attn` format that `GPT2LMHeadModel` expects.
- **Drop rate = 0.0**: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets.
---
## Files in This Repository
| File | Description |
|---|---|
| `model.safetensors` | Model weights in safetensors format (recommended) |
| `pytorch_model.bin` | Model weights in legacy `.bin` format |
| `config.json` | GPT2Config β€” model architecture definition |
| `generation_config.json` | Default generation settings |
| `tokenizer.json` | Fast tokenizer file |
| `tokenizer_config.json` | Tokenizer configuration |
| `checkpoints/model.ckpt` | Original PyTorch Lightning training checkpoint |
---
## Limitations
- **Small training subset**: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results.
- **GPT-2 base**: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks.
- **No RLHF**: The model is instruction-tuned via supervised fine-tuning only β€” no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong.
- **Context length**: Hard-limited to 1,024 tokens. Long prompts can get truncated.
- **No safety alignment**: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures.
---
## Training Pipeline Summary
```
yahma/alpaca-cleaned (52K rows)
↓ load 10K rows
Alpaca prompt formatting
↓
tiktoken BPE tokenization
↓ -100 masking on prompt tokens
Custom PyTorch Dataset + DataLoader (dynamic padding)
↓
GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium
↓
PyTorch Lightning fine-tuning
- AdamW, lr=3e-5, 2 epochs
- FP16 mixed precision
- Gradient accumulation (eff. batch = 8)
- Checkpoint on best val_loss
↓
Lightning prefix stripped β†’ raw GPTModel state dict
↓
Custom β†’ HF format conversion (QKV fusing, key renaming)
↓
Saved as model.safetensors + pytorch_model.bin
↓
Pushed to snehangshu511/gpt2-medium-instruct
```
---
## Citation
If you use this model, please also cite the resources it was built from:
```bibtex
@book{raschka2024llms,
title = {Build a Large Language Model (From Scratch)},
author = {Sebastian Raschka},
year = {2024},
publisher = {Manning Publications}
}
@misc{alpaca,
title = {Stanford Alpaca: An Instruction-following LLaMA model},
author = {Taori et al.},
year = {2023},
url = {https://github.com/tatsu-lab/stanford_alpaca}
}
```
---
## Author
**Snehangshu Bhuin** β€” Data Scientist
GitHub: [snehangshu2002](https://github.com/snehangshu2002)
Built as part of ongoing LLM learning and portfolio development.