--- language: en license: mit base_model: openai-community/gpt2-medium tags: - gpt2 - instruction-tuning - text-generation - alpaca - pytorch-lightning - causal-lm datasets: - yahma/alpaca-cleaned pipeline_tag: text-generation model_name: gpt2-medium-instruct --- # GPT-2 Medium Instruct A **355M parameter GPT-2 Medium** model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning. --- ## Model Details | Property | Value | |---|---| | **Base model** | `openai-community/gpt2-medium` | | **Parameters** | ~355M | | **Architecture** | GPT-2 (decoder-only transformer) | | **Fine-tuning dataset** | `yahma/alpaca-cleaned` (10,000 training samples) | | **Context length** | 1,024 tokens | | **Vocabulary size** | 50,257 tokens | | **Embedding dim** | 1,024 | | **Transformer layers** | 24 | | **Attention heads** | 16 | | **Tokenizer** | GPT-2 BPE (via `tiktoken` / HF `GPT2Tokenizer`) | --- ## Training Details ### Dataset The model was fine-tuned on the [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) dataset — a cleaned version of Stanford Alpaca's 52K instruction-following data generated from `text-davinci-003`. | Split | Samples | |---|---| | Train | 10,000 | | Validation | 1,000 | | Test | 1,000 | ### Prompt Format The model uses the standard **Alpaca prompt template**: ``` Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ← omitted if empty ### Response: {output} ``` During training, the **instruction + input portion is masked** with `-100` in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn *how to respond* rather than memorize the prompt structure. ### Optimizer | Hyperparameter | Value | |---|---| | Optimizer | AdamW | | Learning rate | `3e-5` | | Weight decay | `0.1` | | Beta1 / Beta2 | `0.9` / `0.95` | | Gradient clip | `1.0` | ### Training Config | Setting | Value | |---|---| | Framework | PyTorch Lightning | | Epochs | 2 (+ 1 continuation epoch) | | Batch size (per device) | 2 | | Gradient accumulation steps | 4 | | Effective batch size | 8 | | Precision | `16-mixed` (FP16 + FP32) | | Hardware | Single GPU (Colab) | | Early stopping patience | 3 validation checks | | Checkpoint metric | `val_loss_eval` (minimize) | --- ## Usage ### Basic Inference ```python from transformers import GPT2LMHeadModel, GPT2Tokenizer import torch model_id = "snehangshu511/gpt2-medium-instruct" tokenizer = GPT2Tokenizer.from_pretrained(model_id) model = GPT2LMHeadModel.from_pretrained(model_id) model.eval() def build_prompt(instruction, input_text=""): base = ( "Below is an instruction that describes a task. " "Write a response that appropriately completes the request.\n\n" ) if input_text.strip(): return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n" return f"{base}### Instruction:\n{instruction}\n\n### Response:\n" prompt = build_prompt("Explain what machine learning is in simple terms.") inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9, top_k=50, repetition_penalty=1.2, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id, ) # Decode only the newly generated tokens (strip the prompt) input_len = inputs["input_ids"].shape[1] response = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True) print(response) ``` ### With Optional Input Context ```python prompt = build_prompt( instruction="Summarize the following text.", input_text="The Industrial Revolution began in Britain in the 18th century..." ) ``` ### Recommended Generation Settings | Setting | Recommended range | Effect | |---|---|---| | `temperature` | 0.6 – 0.9 | Higher = more creative, lower = more deterministic | | `top_p` | 0.85 – 0.95 | Nucleus sampling — limits token pool to top P% probability mass | | `top_k` | 40 – 60 | Hard limits candidate tokens to top K at each step | | `repetition_penalty` | 1.1 – 1.3 | Higher = less repetition in output | | `max_new_tokens` | 100 – 300 | Keep under 800 to stay within the 1024 context window | --- ## Architecture Notes This model was built **from scratch** using a custom `GPTModel` class (no `AutoModel` during training). The weights were converted from the custom format to HF-compatible `GPT2LMHeadModel` format for this Hub upload. Key architectural decisions: - **Weight tying disabled** (`tie_word_embeddings=False`): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, `lm_head.weight` was explicitly cloned to avoid shared-memory issues with `safetensors`. The config reflects this. - **QKV separation**: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard `c_attn` format that `GPT2LMHeadModel` expects. - **Drop rate = 0.0**: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets. --- ## Files in This Repository | File | Description | |---|---| | `model.safetensors` | Model weights in safetensors format (recommended) | | `pytorch_model.bin` | Model weights in legacy `.bin` format | | `config.json` | GPT2Config — model architecture definition | | `generation_config.json` | Default generation settings | | `tokenizer.json` | Fast tokenizer file | | `tokenizer_config.json` | Tokenizer configuration | | `checkpoints/model.ckpt` | Original PyTorch Lightning training checkpoint | --- ## Limitations - **Small training subset**: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results. - **GPT-2 base**: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks. - **No RLHF**: The model is instruction-tuned via supervised fine-tuning only — no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong. - **Context length**: Hard-limited to 1,024 tokens. Long prompts can get truncated. - **No safety alignment**: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures. --- ## Training Pipeline Summary ``` yahma/alpaca-cleaned (52K rows) ↓ load 10K rows Alpaca prompt formatting ↓ tiktoken BPE tokenization ↓ -100 masking on prompt tokens Custom PyTorch Dataset + DataLoader (dynamic padding) ↓ GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium ↓ PyTorch Lightning fine-tuning - AdamW, lr=3e-5, 2 epochs - FP16 mixed precision - Gradient accumulation (eff. batch = 8) - Checkpoint on best val_loss ↓ Lightning prefix stripped → raw GPTModel state dict ↓ Custom → HF format conversion (QKV fusing, key renaming) ↓ Saved as model.safetensors + pytorch_model.bin ↓ Pushed to snehangshu511/gpt2-medium-instruct ``` --- ## Citation If you use this model, please also cite the resources it was built from: ```bibtex @book{raschka2024llms, title = {Build a Large Language Model (From Scratch)}, author = {Sebastian Raschka}, year = {2024}, publisher = {Manning Publications} } @misc{alpaca, title = {Stanford Alpaca: An Instruction-following LLaMA model}, author = {Taori et al.}, year = {2023}, url = {https://github.com/tatsu-lab/stanford_alpaca} } ``` --- ## Author **Snehangshu Bhuin** — Data Scientist GitHub: [snehangshu2002](https://github.com/snehangshu2002) Built as part of ongoing LLM learning and portfolio development.