| --- |
| language: en |
| license: mit |
| base_model: openai-community/gpt2-medium |
| tags: |
| - gpt2 |
| - instruction-tuning |
| - text-generation |
| - alpaca |
| - pytorch-lightning |
| - causal-lm |
| datasets: |
| - yahma/alpaca-cleaned |
| pipeline_tag: text-generation |
| model_name: gpt2-medium-instruct |
| --- |
| |
| # GPT-2 Medium Instruct |
|
|
| A **355M parameter GPT-2 Medium** model fine-tuned from scratch on the `yahma/alpaca-cleaned` instruction dataset, with a full custom training pipeline in PyTorch Lightning. |
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | **Base model** | `openai-community/gpt2-medium` | |
| | **Parameters** | ~355M | |
| | **Architecture** | GPT-2 (decoder-only transformer) | |
| | **Fine-tuning dataset** | `yahma/alpaca-cleaned` (10,000 training samples) | |
| | **Context length** | 1,024 tokens | |
| | **Vocabulary size** | 50,257 tokens | |
| | **Embedding dim** | 1,024 | |
| | **Transformer layers** | 24 | |
| | **Attention heads** | 16 | |
| | **Tokenizer** | GPT-2 BPE (via `tiktoken` / HF `GPT2Tokenizer`) | |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Dataset |
|
|
| The model was fine-tuned on the [`yahma/alpaca-cleaned`](https://huggingface.co/datasets/yahma/alpaca-cleaned) dataset β a cleaned version of Stanford Alpaca's 52K instruction-following data generated from `text-davinci-003`. |
|
|
| | Split | Samples | |
| |---|---| |
| | Train | 10,000 | |
| | Validation | 1,000 | |
| | Test | 1,000 | |
|
|
| ### Prompt Format |
|
|
| The model uses the standard **Alpaca prompt template**: |
|
|
| ``` |
| Below is an instruction that describes a task. Write a response that appropriately completes the request. |
| |
| ### Instruction: |
| {instruction} |
| |
| ### Input: |
| {input} β omitted if empty |
| |
| ### Response: |
| {output} |
| ``` |
|
|
| During training, the **instruction + input portion is masked** with `-100` in the targets so the loss is only computed on the response tokens. This is the standard technique to make the model learn *how to respond* rather than memorize the prompt structure. |
|
|
| ### Optimizer |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Optimizer | AdamW | |
| | Learning rate | `3e-5` | |
| | Weight decay | `0.1` | |
| | Beta1 / Beta2 | `0.9` / `0.95` | |
| | Gradient clip | `1.0` | |
|
|
| ### Training Config |
|
|
| | Setting | Value | |
| |---|---| |
| | Framework | PyTorch Lightning | |
| | Epochs | 2 (+ 1 continuation epoch) | |
| | Batch size (per device) | 2 | |
| | Gradient accumulation steps | 4 | |
| | Effective batch size | 8 | |
| | Precision | `16-mixed` (FP16 + FP32) | |
| | Hardware | Single GPU (Colab) | |
| | Early stopping patience | 3 validation checks | |
| | Checkpoint metric | `val_loss_eval` (minimize) | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Basic Inference |
|
|
| ```python |
| from transformers import GPT2LMHeadModel, GPT2Tokenizer |
| import torch |
| |
| model_id = "snehangshu511/gpt2-medium-instruct" |
| |
| tokenizer = GPT2Tokenizer.from_pretrained(model_id) |
| model = GPT2LMHeadModel.from_pretrained(model_id) |
| model.eval() |
| |
| def build_prompt(instruction, input_text=""): |
| base = ( |
| "Below is an instruction that describes a task. " |
| "Write a response that appropriately completes the request.\n\n" |
| ) |
| if input_text.strip(): |
| return f"{base}### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n" |
| return f"{base}### Instruction:\n{instruction}\n\n### Response:\n" |
| |
| prompt = build_prompt("Explain what machine learning is in simple terms.") |
| inputs = tokenizer(prompt, return_tensors="pt") |
| |
| with torch.no_grad(): |
| output_ids = model.generate( |
| **inputs, |
| max_new_tokens=200, |
| do_sample=True, |
| temperature=0.7, |
| top_p=0.9, |
| top_k=50, |
| repetition_penalty=1.2, |
| pad_token_id=tokenizer.eos_token_id, |
| eos_token_id=tokenizer.eos_token_id, |
| ) |
| |
| # Decode only the newly generated tokens (strip the prompt) |
| input_len = inputs["input_ids"].shape[1] |
| response = tokenizer.decode(output_ids[0][input_len:], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ### With Optional Input Context |
|
|
| ```python |
| prompt = build_prompt( |
| instruction="Summarize the following text.", |
| input_text="The Industrial Revolution began in Britain in the 18th century..." |
| ) |
| ``` |
|
|
| ### Recommended Generation Settings |
|
|
| | Setting | Recommended range | Effect | |
| |---|---|---| |
| | `temperature` | 0.6 β 0.9 | Higher = more creative, lower = more deterministic | |
| | `top_p` | 0.85 β 0.95 | Nucleus sampling β limits token pool to top P% probability mass | |
| | `top_k` | 40 β 60 | Hard limits candidate tokens to top K at each step | |
| | `repetition_penalty` | 1.1 β 1.3 | Higher = less repetition in output | |
| | `max_new_tokens` | 100 β 300 | Keep under 800 to stay within the 1024 context window | |
|
|
| --- |
|
|
| ## Architecture Notes |
|
|
| This model was built **from scratch** using a custom `GPTModel` class (no `AutoModel` during training). The weights were converted from the custom format to HF-compatible `GPT2LMHeadModel` format for this Hub upload. |
|
|
| Key architectural decisions: |
|
|
| - **Weight tying disabled** (`tie_word_embeddings=False`): In standard GPT-2, the output head shares weights with the embedding layer. During conversion, `lm_head.weight` was explicitly cloned to avoid shared-memory issues with `safetensors`. The config reflects this. |
|
|
| - **QKV separation**: The custom training model stores Q, K, V as separate linear layers. During HF conversion, they are re-fused into the standard `c_attn` format that `GPT2LMHeadModel` expects. |
|
|
| - **Drop rate = 0.0**: Dropout is disabled during fine-tuning, which is standard practice when working with pretrained models on relatively small datasets. |
|
|
| --- |
|
|
| ## Files in This Repository |
|
|
| | File | Description | |
| |---|---| |
| | `model.safetensors` | Model weights in safetensors format (recommended) | |
| | `pytorch_model.bin` | Model weights in legacy `.bin` format | |
| | `config.json` | GPT2Config β model architecture definition | |
| | `generation_config.json` | Default generation settings | |
| | `tokenizer.json` | Fast tokenizer file | |
| | `tokenizer_config.json` | Tokenizer configuration | |
| | `checkpoints/model.ckpt` | Original PyTorch Lightning training checkpoint | |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - **Small training subset**: Only 10,000 of the available ~52,000 Alpaca samples were used. A full dataset run would likely yield noticeably better results. |
| - **GPT-2 base**: GPT-2 Medium, while a solid model, is much smaller than modern instruction-tuned LLMs. Responses can be inconsistent or drift from the prompt on complex tasks. |
| - **No RLHF**: The model is instruction-tuned via supervised fine-tuning only β no reinforcement learning from human feedback. It may produce responses that are grammatically correct but factually wrong. |
| - **Context length**: Hard-limited to 1,024 tokens. Long prompts can get truncated. |
| - **No safety alignment**: There is no safety filtering or RLHF alignment. Do not deploy in production without additional safety measures. |
|
|
| --- |
|
|
| ## Training Pipeline Summary |
|
|
| ``` |
| yahma/alpaca-cleaned (52K rows) |
| β load 10K rows |
| Alpaca prompt formatting |
| β |
| tiktoken BPE tokenization |
| β -100 masking on prompt tokens |
| Custom PyTorch Dataset + DataLoader (dynamic padding) |
| β |
| GPT-2 Medium pretrained weights loaded from openai-community/gpt2-medium |
| β |
| PyTorch Lightning fine-tuning |
| - AdamW, lr=3e-5, 2 epochs |
| - FP16 mixed precision |
| - Gradient accumulation (eff. batch = 8) |
| - Checkpoint on best val_loss |
| β |
| Lightning prefix stripped β raw GPTModel state dict |
| β |
| Custom β HF format conversion (QKV fusing, key renaming) |
| β |
| Saved as model.safetensors + pytorch_model.bin |
| β |
| Pushed to snehangshu511/gpt2-medium-instruct |
| ``` |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please also cite the resources it was built from: |
|
|
| ```bibtex |
| @book{raschka2024llms, |
| title = {Build a Large Language Model (From Scratch)}, |
| author = {Sebastian Raschka}, |
| year = {2024}, |
| publisher = {Manning Publications} |
| } |
| |
| @misc{alpaca, |
| title = {Stanford Alpaca: An Instruction-following LLaMA model}, |
| author = {Taori et al.}, |
| year = {2023}, |
| url = {https://github.com/tatsu-lab/stanford_alpaca} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Author |
|
|
| **Snehangshu Bhuin** β Data Scientist |
| GitHub: [snehangshu2002](https://github.com/snehangshu2002) |
| Built as part of ongoing LLM learning and portfolio development. |