File size: 9,111 Bytes
f721159
 
2005dcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
 
 
 
 
 
 
 
 
 
f721159
2005dcb
 
 
 
 
 
 
53785b6
 
f721159
 
 
2005dcb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f721159
 
2005dcb
 
 
f721159
2005dcb
 
 
 
f721159
 
2005dcb
 
 
f721159
 
 
2005dcb
 
 
f721159
 
2005dcb
f721159
2005dcb
 
f721159
2005dcb
 
 
 
f721159
 
 
2005dcb
 
 
 
 
 
 
f721159
2005dcb
f721159
2005dcb
 
 
 
 
f721159
 
2005dcb
 
 
f721159
2005dcb
 
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
 
 
f721159
2005dcb
 
 
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
 
 
 
 
f721159
2005dcb
f721159
2005dcb
 
 
 
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
f721159
2005dcb
 
 
 
 
 
 
 
f721159
2005dcb
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
library_name: transformers
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
tags:
- GPT
- GPT-3 Small
- GPT-3 Medium
- GPT-3 Large
- GPT-3 XL
- GPT-3 2.7B
- GPT-3 6.7B
- GPT-3 13B
- GPT-3 175B
- GPT-3
- GPT-2
- GPT-2 124M
- transformers
- mit
- HuggingFace
- fineweb-edu
- Decoder-Only
---
# Model Card for GPT-124M

## Overview

GPT-124M is a decoder-only transformer model based on OpenAI’s GPT-2 architecture. It is trained for text generation and other natural language processing (NLP) tasks. The model is designed for general-purpose language modeling, making it useful for applications such as text completion.

- **Library:** 🤗 `transformers`
- **License:** MIT
- **Datasets:** `HuggingFaceFW/fineweb-edu`
- **Language:** English
- **Base Model:** `openai-community/gpt2`
- **Pipeline Tag:** `text-generation`
- **Developer:** Samkeet Sangai
- **Funded By:** Samkeet Sangai
- **Shared By:** Samkeet Sangai
- **Model Type:** GPT Decoder-Only

## Model Sources

- **Paper:** [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- **Paper:** [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165)
- **Paper:** [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556)
- **Video:** [Andrej Karpathy-Let's reproduce GPT-2 (124M)](https://youtu.be/l8pRSuU81PU?si=KAo1y9dHYQAGJmj5)
- **Demo:** [GPT 124M Demo](https://huggingface.co/spaces/samkeet/GPT_124M)
- **GitHub:** [SamkeetSangai/GPT_124M](https://github.com/SamkeetSangai/GPT_124M)
- 
## Model Details

### Model Description
GPT-124M is a lightweight generative language model fine-tuned on the `fineweb-edu` dataset. It can generate coherent and contextually relevant text but is not fine-tuned for instruction-following, safety, or factual accuracy.

### Training Configuration
- **Block Size:** `1024`
- **Vocabulary Size:** `50304`
- **Number of Layers:** `12`
- **Number of Attention Heads:** `12`
- **Embedding Size:** `768`
- **Hardware:** `8x NVIDIA RTX 4090 GPUs`
- **Training Duration:** `13 hours`
- **Dataset:** `fineweb-edu` (10 billion tokens)
- **Training Date:** `January 2025`
- **Validation Dataset:** 100 million tokens of HuggingFaceFW/fineweb-edu
  
## Usage
You can use this model for text generation using the `transformers` library.

### Method 2: Using Pipeline
```python
# Import necessary modules from transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Load tokenizer and model
model_name = "samkeet/GPT_124M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Create text generation pipeline
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer, trust_remote_code=True, device="cpu")

# Generate text
result = pipe("Earth revolves around the", do_sample=True, max_length=40, temperature=0.9, top_p=0.5, top_k=50)
print("Pipeline Output:", result)
```

### Method 1: Direct Generation
```python
# Import necessary libraries
import torch

# Function for direct tokenization and text generation
def generate_text(input_text, device='cpu'):
    tokens = tokenizer.encode(input_text, return_tensors='pt').to(device)
    model.to(device)
    
    # Generate output
    output = model.generate(
        tokens, 
        do_sample=True, 
        max_length=40, 
        temperature=0.9,
        top_p=0.5,
        top_k=50,
    )
    
    # Decode generated text
    generated_sentence = tokenizer.decode(output)
    return generated_sentence

# Generate text
input_text = "Earth revolves around the"
print("Direct Output:", generate_text(input_text))
```

### Fine-tuning & Downstream Use
This model can be fine-tuned for specific NLP applications like:
- Dialogue generation
- Text summarization
- Creative writing
- Code generation

## Limitations & Risks

### Out-of-Scope Use
- The model is **not instruction-tuned** for safety, ethics, or factual accuracy.
- It may produce **biased, misleading, or unsafe outputs**.
- It should **not** be used for tasks requiring high reliability, such as medical, legal, or financial applications.

### Bias, Risks, and Limitations
- The dataset may contain biases present in public web data.
- The model does not filter or detect offensive content.
- The model may **hallucinate** incorrect facts.

### Recommendations
- Always **verify** generated content before use.
- Implement **content filtering mechanisms** for deployment.
- Use in supervised environments only.

## Evaluation

### Training & Validation Loss
Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/fAwiSHr4f9SmO9PYiCntY.png)

### Results
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.

According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/Ne2MYAB2C0yHWFJLjCww3.png)

### Key Insights from Evaluation
- **Efficient Training:** The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
- **Data-Specific Advantage:** Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
- **Scaling Considerations:** GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.

## Environmental Impact

- **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
- **Training Time:** `13 hours -> 104 GPU hours`
- **Estimated Carbon Emissions:** `13.48 kg CO2 eq.`
- **Equivalent to:**
  - `54.5 km` driven by an average ICE car
  - `6.75 kg` of coal burned
  - `0.22` tree seedlings sequestering carbon for 10 years

## Technical Specifications

### Model Architecture
GPT-124M follows the architecture of OpenAI's GPT-2, which consists of:
- **Transformer-based decoder model**
- **Self-attention mechanism**
- **Layer normalization & feed-forward networks**

### Compute Infrastructure
- **Hardware:** 8x NVIDIA RTX 4090 GPUs
- **Software:** PyTorch, Hugging Face Transformers
- **Precision:** FP32

Gotcha — here’s a **tight, concise section** you can drop in **as-is**.
It keeps only the essentials: **data, setup, choices, Kaggle**, no fluff.

## Instruction-Tuned Model

### Training Data

The instruction-tuned GPT-124M is fine-tuned on the **`tatsu-lab/alpaca`** dataset, containing high-quality instruction–response pairs across reasoning, explanation, summarization, and creative tasks. Samples are **length-filtered** to fit the 1024-token context window, counting instruction, input, response, and EOS tokens.

### Prompt & Objective

Training follows an Alpaca-style format:

```
### Instruction:
<instruction and optional input>

### Response:
<target output>
```

Causal language modeling is used, with **loss applied only to response tokens** (prompt tokens masked with `-100`) and an explicit EOS token appended.

### Training Setup

* **Platform:** Kaggle (GPU-backed notebooks)
* **Framework:** PyTorch
* **Precision:** FP32
* **Optimizer:** AdamW with warmup + cosine decay
* **Stability:** Gradient clipping and fixed-length batching

### Fine-Tuning Choices

* Supports **full fine-tuning** and **LoRA-based parameter-efficient tuning**
* LoRA can be merged into base weights for a standalone instruct model
* Supervised fine-tuning (SFT) chosen for simplicity and reproducibility
* No RLHF or safety-specific tuning applied

### Outcome

Instruction tuning improves instruction following, output structure, and task performance while preserving the base model’s generative capabilities. The model remains non-aligned and may hallucinate.

## Citation

If you use this model, please cite:

```bibtex
@article{gpt124m,
  title={GPT-124M: A Compact Transformer Model for NLP},
  author={Samkeet Sangai},
  year={2024},
  url={https://huggingface.co/samkeet/GPT_124M}
}
```

## Contact
For inquiries, contact [Samkeet Sangai](https://www.linkedin.com/in/samkeet-sangai/).