File size: 9,111 Bytes
f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb 53785b6 f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb f721159 2005dcb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
---
library_name: transformers
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
tags:
- GPT
- GPT-3 Small
- GPT-3 Medium
- GPT-3 Large
- GPT-3 XL
- GPT-3 2.7B
- GPT-3 6.7B
- GPT-3 13B
- GPT-3 175B
- GPT-3
- GPT-2
- GPT-2 124M
- transformers
- mit
- HuggingFace
- fineweb-edu
- Decoder-Only
---
# Model Card for GPT-124M
## Overview
GPT-124M is a decoder-only transformer model based on OpenAI’s GPT-2 architecture. It is trained for text generation and other natural language processing (NLP) tasks. The model is designed for general-purpose language modeling, making it useful for applications such as text completion.
- **Library:** 🤗 `transformers`
- **License:** MIT
- **Datasets:** `HuggingFaceFW/fineweb-edu`
- **Language:** English
- **Base Model:** `openai-community/gpt2`
- **Pipeline Tag:** `text-generation`
- **Developer:** Samkeet Sangai
- **Funded By:** Samkeet Sangai
- **Shared By:** Samkeet Sangai
- **Model Type:** GPT Decoder-Only
## Model Sources
- **Paper:** [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- **Paper:** [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165)
- **Paper:** [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556)
- **Video:** [Andrej Karpathy-Let's reproduce GPT-2 (124M)](https://youtu.be/l8pRSuU81PU?si=KAo1y9dHYQAGJmj5)
- **Demo:** [GPT 124M Demo](https://huggingface.co/spaces/samkeet/GPT_124M)
- **GitHub:** [SamkeetSangai/GPT_124M](https://github.com/SamkeetSangai/GPT_124M)
-
## Model Details
### Model Description
GPT-124M is a lightweight generative language model fine-tuned on the `fineweb-edu` dataset. It can generate coherent and contextually relevant text but is not fine-tuned for instruction-following, safety, or factual accuracy.
### Training Configuration
- **Block Size:** `1024`
- **Vocabulary Size:** `50304`
- **Number of Layers:** `12`
- **Number of Attention Heads:** `12`
- **Embedding Size:** `768`
- **Hardware:** `8x NVIDIA RTX 4090 GPUs`
- **Training Duration:** `13 hours`
- **Dataset:** `fineweb-edu` (10 billion tokens)
- **Training Date:** `January 2025`
- **Validation Dataset:** 100 million tokens of HuggingFaceFW/fineweb-edu
## Usage
You can use this model for text generation using the `transformers` library.
### Method 2: Using Pipeline
```python
# Import necessary modules from transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
# Load tokenizer and model
model_name = "samkeet/GPT_124M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Create text generation pipeline
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer, trust_remote_code=True, device="cpu")
# Generate text
result = pipe("Earth revolves around the", do_sample=True, max_length=40, temperature=0.9, top_p=0.5, top_k=50)
print("Pipeline Output:", result)
```
### Method 1: Direct Generation
```python
# Import necessary libraries
import torch
# Function for direct tokenization and text generation
def generate_text(input_text, device='cpu'):
tokens = tokenizer.encode(input_text, return_tensors='pt').to(device)
model.to(device)
# Generate output
output = model.generate(
tokens,
do_sample=True,
max_length=40,
temperature=0.9,
top_p=0.5,
top_k=50,
)
# Decode generated text
generated_sentence = tokenizer.decode(output)
return generated_sentence
# Generate text
input_text = "Earth revolves around the"
print("Direct Output:", generate_text(input_text))
```
### Fine-tuning & Downstream Use
This model can be fine-tuned for specific NLP applications like:
- Dialogue generation
- Text summarization
- Creative writing
- Code generation
## Limitations & Risks
### Out-of-Scope Use
- The model is **not instruction-tuned** for safety, ethics, or factual accuracy.
- It may produce **biased, misleading, or unsafe outputs**.
- It should **not** be used for tasks requiring high reliability, such as medical, legal, or financial applications.
### Bias, Risks, and Limitations
- The dataset may contain biases present in public web data.
- The model does not filter or detect offensive content.
- The model may **hallucinate** incorrect facts.
### Recommendations
- Always **verify** generated content before use.
- Implement **content filtering mechanisms** for deployment.
- Use in supervised environments only.
## Evaluation
### Training & Validation Loss
Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.

### Results
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.

### Key Insights from Evaluation
- **Efficient Training:** The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
- **Data-Specific Advantage:** Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
- **Scaling Considerations:** GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
## Environmental Impact
- **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
- **Training Time:** `13 hours -> 104 GPU hours`
- **Estimated Carbon Emissions:** `13.48 kg CO2 eq.`
- **Equivalent to:**
- `54.5 km` driven by an average ICE car
- `6.75 kg` of coal burned
- `0.22` tree seedlings sequestering carbon for 10 years
## Technical Specifications
### Model Architecture
GPT-124M follows the architecture of OpenAI's GPT-2, which consists of:
- **Transformer-based decoder model**
- **Self-attention mechanism**
- **Layer normalization & feed-forward networks**
### Compute Infrastructure
- **Hardware:** 8x NVIDIA RTX 4090 GPUs
- **Software:** PyTorch, Hugging Face Transformers
- **Precision:** FP32
Gotcha — here’s a **tight, concise section** you can drop in **as-is**.
It keeps only the essentials: **data, setup, choices, Kaggle**, no fluff.
## Instruction-Tuned Model
### Training Data
The instruction-tuned GPT-124M is fine-tuned on the **`tatsu-lab/alpaca`** dataset, containing high-quality instruction–response pairs across reasoning, explanation, summarization, and creative tasks. Samples are **length-filtered** to fit the 1024-token context window, counting instruction, input, response, and EOS tokens.
### Prompt & Objective
Training follows an Alpaca-style format:
```
### Instruction:
<instruction and optional input>
### Response:
<target output>
```
Causal language modeling is used, with **loss applied only to response tokens** (prompt tokens masked with `-100`) and an explicit EOS token appended.
### Training Setup
* **Platform:** Kaggle (GPU-backed notebooks)
* **Framework:** PyTorch
* **Precision:** FP32
* **Optimizer:** AdamW with warmup + cosine decay
* **Stability:** Gradient clipping and fixed-length batching
### Fine-Tuning Choices
* Supports **full fine-tuning** and **LoRA-based parameter-efficient tuning**
* LoRA can be merged into base weights for a standalone instruct model
* Supervised fine-tuning (SFT) chosen for simplicity and reproducibility
* No RLHF or safety-specific tuning applied
### Outcome
Instruction tuning improves instruction following, output structure, and task performance while preserving the base model’s generative capabilities. The model remains non-aligned and may hallucinate.
## Citation
If you use this model, please cite:
```bibtex
@article{gpt124m,
title={GPT-124M: A Compact Transformer Model for NLP},
author={Samkeet Sangai},
year={2024},
url={https://huggingface.co/samkeet/GPT_124M}
}
```
## Contact
For inquiries, contact [Samkeet Sangai](https://www.linkedin.com/in/samkeet-sangai/). |