---
language: en
license: mit
base_model: openai-community/gpt2-medium
tags:
  - gpt2
  - instruction-tuning
  - text-generation
  - alpaca
  - pytorch-lightning
  - causal-lm
datasets:
  - yahma/alpaca-cleaned
pipeline_tag: text-generation
model_name: gpt2-medium-instruct
---

# GPT-2 Medium Instruct

A **355M parameter GPT-2 Medium** model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning.
---

## Model Details

| Property | Value |
|---|---|
| **Base model** | `openai-community/gpt2-medium` |
| **Parameters** | ~355M |
| **Architecture** | GPT-2 (decoder-only transformer) |
| **Fine-tuning dataset** | `yahma/alpaca-cleaned` (10,000 training samples) |
| **Context length** | 1,024 tokens |
| **Vocabulary size** | 50,257 tokens |
| **Embedding dim** | 1,024 |
| **Transformer layers** | 24 |
| **Attention heads** | 16 |
| **Tokenizer** | GPT-2 BPE (via `tiktoken` / HF `GPT2Tokenizer`) |

---

## Training Details

### Dataset

The model was fine-tuned on the [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) dataset — a cleaned version of Stanford Alpaca's 52K instruction-following data generated from `text-davinci-003`.

| Split | Samples |
|---|---|
| Train | 10,000 |
| Validation | 1,000 |
| Test | 1,000 |

### Prompt Format

The model uses the standard **Alpaca prompt template**:

```
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}   ← omitted if empty

### Response:
{output}
```

During training, the **instruction + input portion is masked** with `-100` in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn *how to respond* rather than memorize the prompt structure.

### Optimizer

| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | `3e-5` |
| Weight decay | `0.1` |
| Beta1 / Beta2 | `0.9` / `0.95` |
| Gradient clip | `1.0` |

### Training Config

| Setting | Value |
|---|---|
| Framework | PyTorch Lightning |
| Epochs | 2 (+ 1 continuation epoch) |
| Batch size (per device) | 2 |
| Gradient accumulation steps | 4 |
| Effective batch size | 8 |
| Precision | `16-mixed` (FP16 + FP32) |
| Hardware | Single GPU (Colab) |
| Early stopping patience | 3 validation checks |
| Checkpoint metric | `val_loss_eval` (minimize) |

---

## Usage

### Basic Inference

```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

model_id = "snehangshu511/gpt2-medium-instruct"

tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model     = GPT2LMHeadModel.from_pretrained(model_id)
model.eval()

def build_prompt(instruction, input_text=""):
    base = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
    )
    if input_text.strip():
        return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    return f"{base}### Instruction:\n{instruction}\n\n### Response:\n"

prompt = build_prompt("Explain what machine learning is in simple terms.")
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=200,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.2,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode only the newly generated tokens (strip the prompt)
input_len = inputs["input_ids"].shape[1]
response  = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True)
print(response)
```

### With Optional Input Context

```python
prompt = build_prompt(
    instruction="Summarize the following text.",
    input_text="The Industrial Revolution began in Britain in the 18th century..."
)
```

### Recommended Generation Settings

| Setting | Recommended range | Effect |
|---|---|---|
| `temperature` | 0.6 – 0.9 | Higher = more creative, lower = more deterministic |
| `top_p` | 0.85 – 0.95 | Nucleus sampling — limits token pool to top P% probability mass |
| `top_k` | 40 – 60 | Hard limits candidate tokens to top K at each step |
| `repetition_penalty` | 1.1 – 1.3 | Higher = less repetition in output |
| `max_new_tokens` | 100 – 300 | Keep under 800 to stay within the 1024 context window |

---

## Architecture Notes

This model was built **from scratch** using a custom `GPTModel` class (no `AutoModel` during training). The weights were converted from the custom format to HF-compatible `GPT2LMHeadModel` format for this Hub upload.

Key architectural decisions:

- **Weight tying disabled** (`tie_word_embeddings=False`): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, `lm_head.weight` was explicitly cloned to avoid shared-memory issues with `safetensors`. The config reflects this.

- **QKV separation**: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard `c_attn` format that `GPT2LMHeadModel` expects.

- **Drop rate = 0.0**: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets.

---

## Files in This Repository

| File | Description |
|---|---|
| `model.safetensors` | Model weights in safetensors format (recommended) |
| `pytorch_model.bin` | Model weights in legacy `.bin` format |
| `config.json` | GPT2Config — model architecture definition |
| `generation_config.json` | Default generation settings |
| `tokenizer.json` | Fast tokenizer file |
| `tokenizer_config.json` | Tokenizer configuration |
| `checkpoints/model.ckpt` | Original PyTorch Lightning training checkpoint |

---

## Limitations

- **Small training subset**: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results.
- **GPT-2 base**: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks.
- **No RLHF**: The model is instruction-tuned via supervised fine-tuning only — no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong.
- **Context length**: Hard-limited to 1,024 tokens. Long prompts can get truncated.
- **No safety alignment**: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures.

---

## Training Pipeline Summary

```
yahma/alpaca-cleaned (52K rows)
        ↓ load 10K rows
Alpaca prompt formatting
        ↓
tiktoken BPE tokenization
        ↓ -100 masking on prompt tokens
Custom PyTorch Dataset + DataLoader (dynamic padding)
        ↓
GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium
        ↓
PyTorch Lightning fine-tuning
  - AdamW, lr=3e-5, 2 epochs
  - FP16 mixed precision
  - Gradient accumulation (eff. batch = 8)
  - Checkpoint on best val_loss
        ↓
Lightning prefix stripped → raw GPTModel state dict
        ↓
Custom → HF format conversion (QKV fusing, key renaming)
        ↓
Saved as model.safetensors + pytorch_model.bin
        ↓
Pushed to snehangshu511/gpt2-medium-instruct
```

---

## Citation

If you use this model, please also cite the resources it was built from:

```bibtex
@book{raschka2024llms,
  title     = {Build a Large Language Model (From Scratch)},
  author    = {Sebastian Raschka},
  year      = {2024},
  publisher = {Manning Publications}
}

@misc{alpaca,
  title  = {Stanford Alpaca: An Instruction-following LLaMA model},
  author = {Taori et al.},
  year   = {2023},
  url    = {https://github.com/tatsu-lab/stanford_alpaca}
}
```

---

## Author

**Snehangshu Bhuin** — Data Scientist  
GitHub: [snehangshu2002](https://github.com/snehangshu2002)  
Built as part of ongoing LLM learning and portfolio development.