GPT2 / README.md
svshrithik12's picture
Add metadata
2205fab verified
---
pipeline_tag: text-generation
language:
- en
tags:
- gguf
- llama.cpp
- gpt2
- nanogpt
license: other
---
# GPT
This is my own implementation of GPT 2
This covers both Tokenization and Pretraining from scratch.
I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects.
I skiped out on RLHF. That's probably be for a better project :)
I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars.
Setup, tokenization, pretraining, sft roughly takes 4 hours.
You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal.
---
## Transformer Architecture
![Training model architecture](train-model-architecture.png)
## `train-model.py`
This file contains **both** the model implementation **and** the training loop.
### 1) Model code
# Key classes and components
This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline.
## train-model.py
### `GPTConfig`
Holds model hyperparameters:
- `block_size` (context length, max T)
- `vocab_size` (V)
- `n_layer` (L)
- `n_head` (H)
- `n_embd` (C)
### `GPT`
Top-level decoder-only Transformer.
Key submodules:
- `wte`: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$
- `wpe`: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$
- `h`: stack of `Block` modules (length `n_layer`)
- `ln_f`: final LayerNorm
- `lm_head`: output projection to vocab (**tied** to `wte`)
### `Block`
One Transformer block (pre-LN residual):
- `ln_1``attn` → residual add
- `ln_2``mlp` → residual add
### `CausalSelfAttention`
Multi-head causal self-attention:
- projects $X$ into $Q,K,V$
- uses causal masking ($s>t$ masked) via fused attention
- projects concatenated heads back to $C$
### `MLP`
Feed-forward network:
- $C\to 4C\to C$ with GELU
### `DataLoaderLite`
Loads tokenized `.npy` shards and yields `(x, y)` pairs for next-token prediction.
---
## fineweb.py
- downloads FineWeb-Edu split
- tokenizes with GPT-2 BPE (tiktoken)
- writes token shards to disk as `.npy` (first shard often used as validation)
---
## evals.py
Standalone HellaSwag evaluator (multiple-choice by loss).
To use it with your trained model, swap HF model loading for your `GPT` checkpoint.
---
## Transformer formulas
**Notation:** batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$.
### Token + positional embeddings
Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$.
With token IDs $\mathrm{idx}\in\{0,\dots,V-1\}^{B\times T}$ and positions $p_t=t$:
$$
\begin{aligned}
x_0[b,t,:] = W_E[\mathrm{idx}_{b,t}] + W_P[p_t]
\quad \in \mathbb{R}^{C}.
\end{aligned}
$$
### LayerNorm (per token, over channels)
For $u\in\mathbb{R}^{C}$:
$$
\begin{aligned}
\mu(u) &= \frac{1}{C}\sum_{i=1}^{C} u_i, \\
\sigma^2(u) &= \frac{1}{C}\sum_{i=1}^{C}\left(u_i-\mu(u)\right)^2, \\
\mathrm{LN}(u) &= \gamma \odot \frac{u-\mu(u)}{\sqrt{\sigma^2(u)+\varepsilon}} + \beta,
\end{aligned}
$$
with learnable $\gamma,\beta\in\mathbb{R}^{C}$.
### Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$
$$
\begin{aligned}
x_{l+\frac12} &= x_l + \mathrm{Attn}\!\left(\mathrm{LN}(x_l)\right),\\
x_{l+1} &= x_{l+\frac12} + \mathrm{MLP}\!\left(\mathrm{LN}(x_{l+\frac12})\right).
\end{aligned}
$$
### Multi-head causal self-attention
Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in\{1,\dots,H\}$:
$$
\begin{aligned}
Q^{(h)} &= X W_Q^{(h)} + b_Q^{(h)},\\
K^{(h)} &= X W_K^{(h)} + b_K^{(h)},\\
V^{(h)} &= X W_V^{(h)} + b_V^{(h)},
\end{aligned}
$$
where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$.
Causal masked scores:
$$
\begin{aligned}
S^{(h)}_{t,s} &= \frac{\langle Q^{(h)}_{t},K^{(h)}_{s}\rangle}{\sqrt{d}} + M_{t,s},\\
M_{t,s} &=
\begin{cases}
0, & s \le t,\\
-\infty, & s > t.
\end{cases}
\end{aligned}
$$
Attention weights and output:
$$
\begin{aligned}
A^{(h)}_{t,s} &= \mathrm{softmax}_s\!\left(S^{(h)}_{t,s}\right),\\
O^{(h)}_{t} &= \sum_{s=0}^{T-1} A^{(h)}_{t,s}\, V^{(h)}_{s}.
\end{aligned}
$$
Concatenate heads and project:
$$
\begin{aligned}
O &= \mathrm{Concat}\left(O^{(1)},\dots,O^{(H)}\right)\in\mathbb{R}^{B\times T\times C},\\
\mathrm{Attn}(X) &= O W_O + b_O \in \mathbb{R}^{B\times T\times C}.
\end{aligned}
$$
### MLP (feed-forward)
Uses expansion ratio $4\times$:
$$
\begin{aligned}
\mathrm{MLP}(X) &= \mathrm{GELU}(XW_1+b_1)\,W_2 + b_2,\\
W_1 &\in \mathbb{R}^{C\times 4C},\quad W_2\in \mathbb{R}^{4C\times C}.
\end{aligned}
$$
### GELU (tanh approximation)
$$
\mathrm{GELU}(z)=\frac12 z\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}\left(z+0.044715 z^3\right)\right)\right).
$$
### Final LayerNorm + LM head (tied weights)
Let $\hat{x}=\mathrm{LN}_f(x_L)$:
$$
\begin{aligned}
\mathrm{logits} &= \hat{x} W_U + b_U \in \mathbb{R}^{B\times T\times V},\\
W_U &= W_E^\top \quad \text{(weight tying)}.
\end{aligned}
$$
### Training objective (next-token cross entropy)
With targets $y\in\{0,\dots,V-1\}^{B\times T}$:
$$
\begin{aligned}
p(y_{b,t}\mid x) &= \mathrm{softmax}(\mathrm{logits}_{b,t,:})_{y_{b,t}},\\
\mathcal{L} &= -\frac{1}{BT}\sum_{b=1}^{B}\sum_{t=1}^{T}\log p(y_{b,t}\mid x).
\end{aligned}
$$
---
## Walkthrough: `fineweb.py`
This script builds **tokenized training shards** from FineWeb-Edu.
What it does:
* Loads dataset:
* `load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")`
* Tokenizes with `tiktoken.get_encoding("gpt2")`
* Prepends `<|endoftext|>` token to each document
* Writes `.npy` shards of **`shard_size = 1e8` tokens** each
* Uses multiprocessing (`os.cpu_count()//2` workers)
* Naming:
* first shard is `"val"` (so you always have a validation shard)
* later shards are `"train"`
* output files look like: `edufineweb_train_000001.npy` (numpy adds `.npy`)
What to change most often:
* `remote_name` (`sample-10BT` → bigger subsets if you want)
* `shard_size` (smaller if disk is tight / you want more shards)
* `local_dir` (just make it match `train-model.py`’s `data_root`)
---
## Start to End Execution
## Repo layout
* **Training / pipeline code (all in `src/`)**
* `fineweb.py` → tokenization to `src/edu_fineweb10B/`
* `train-model.py` → pretraining, outputs to `src/log/`
* `convert_ckpt_to_hf.py``.pt` checkpoint → HF folder
* `evals.py` → eval HF folder on HellaSwag
* `sft.py` → LoRA SFT (outputs `src/hf_sft_lora/`)
* `merge_lora.py` → merge LoRA into base (outputs merged HF folder)
* `user_assistant.jinja` → your llama.cpp chat template
* **Artifacts produced(these will not be included in the git repo but you can find them on huggingface)**
* `src/log/model_19072.pt` (final pretrain ckpt)
* `src/hf_pretrained/` (HF-exported base)
* `src/hf_sft_lora/` (LoRA adapters + checkpoints)
* `src/hf_sft_merged/` (merged HF model)
* `src/gpt2-Q4_K_M_2.gguf` (quantized GGUF)
---
## Two-instance architecture (what runs where)
I used **two Lambda GPU instances** mounted to the **same shared filesystem**:
* **Instance A (8×A100)**: **Pretraining**
* **Instance B (1×A100)**: **Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload**
> I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below).
### Verify shared mount (both instances)
```bash
df -h
ls -lah /home/ubuntu/GPT
```
Clone the repo :
```bash
export REPO=/home/ubuntu/GPT
git clone https://github.com/ShrithikShahapure/GPT.git
cd $REPO
```
---
## 1) Environment setup (run on BOTH instances)
### System deps
```bash
sudo apt-get update
sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config
git lfs install
```
### Python venv + packages
```bash
cd $REPO
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub
pip install torch --index-url https://download.pytorch.org/whl/cu121
```
### (Optional but recommended) Put HF caches on the shared filesystem
This avoids re-downloading datasets/models separately on each instance:
```bash
export HF_HOME=$REPO/.hf
export HF_DATASETS_CACHE=$HF_HOME/datasets
export TRANSFORMERS_CACHE=$HF_HOME/transformers
mkdir -p "$HF_HOME"
```
---
## 2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)
```bash
cd $REPO
git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build -j
```
`llama.cpp` expects models in **GGUF** and provides conversion scripts. ([GitHub][1])
---
## 3) Tokenization (run on **Instance B: 1×A100**)
### 3.1 Run FineWeb tokenization
This writes shards under `src/edu_fineweb10B/`:
```bash
cd $REPO
source .venv/bin/activate
python src/fineweb.py
```
### 3.3 Ensure `hellaswag.py` exists (only if missing)
`train-model.py` imports `hellaswag`. If `src/hellaswag.py` doesn’t exist, do:
```bash
cd $REPO/src
cp -f evals.py hellaswag.py
```
---
## 4) Pretraining (run on **Instance A: 8×A100**)
From your repo root:
```bash
cd $REPO
source .venv/bin/activate
cd src
torchrun --standalone --nproc_per_node=8 train-model.py
```
### Watch training logs
```bash
tail -n 200 -F $REPO/src/log/log.txt
```
### Checkpoints
Your checkpoints land in:
```bash
ls -lah $REPO/src/log/model_*.pt
```
Example you currently have:
* `model_05000.pt`
* `model_10000.pt`
* `model_15000.pt`
* `model_19072.pt`
---
## 5) Convert checkpoint → Hugging Face model (run on **Instance B**)
Convert your final checkpoint into HF format (you already have `hf_pretrained`, but here’s the exact command):
```bash
cd $REPO
source .venv/bin/activate
python3 src/convert_ckpt_to_hf.py \
--ckpt src/log/model_19072.pt \
--out src/hf_pretrained
```
---
## 6) Evaluate HF model (run on **Instance B**)
```bash
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_pretrained-2 -d cuda
```
---
## 7) SFT (LoRA) (run on **Instance A**)
Your repo has `src/sft.py`. Run it:
```bash
cd $REPO
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 python3 src/sft.py
```
Expected output:
* `src/hf_sft_lora/` (adapters + checkpoints)
---
## 8) Merge LoRA → merged HF model (run on **Instance B**)
Run:
```bash
cd $REPO
source .venv/bin/activate
python3 src/merge_lora.py
```
Evaluate merged model:
```bash
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_sft_merged -d cuda
```
---
## 9) Convert merged HF → GGUF + Quantize (run on **Instance B**)
### 9.1 HF → fp16 GGUF
```bash
cd $REPO
source .venv/bin/activate
python llama.cpp/convert_hf_to_gguf.py \
src/hf_sft_merged \
--outfile src/gpt2-f16.gguf
```
### 9.2 Quantize to Q4_K_M
```bash
cd $REPO
./llama.cpp/build/bin/llama-quantize \
src/gpt2-f16.gguf \
src/gpt2-Q4_K_M_2.gguf \
Q4_K_M
```
---
### Login
```bash
pip install -U "huggingface_hub[cli]"
hf auth login
```
### Upload the GGUF
Set your repo:
```bash
export HF_REPO="your-username/your-gguf-repo"
```
Upload:
```bash
hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model
```
(Optionally upload your template too)
```bash
hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model
```
---
## 11) Inference with llama.cpp (your exact “ONE paragraph + <END>” flow)
From repo root (note: your GGUF is `src/gpt2-Q4_K_M_2.gguf`):
```bash
cd $REPO
./llama.cpp/build/bin/llama-cli \
-m src/gpt2-Q4_K_M_2.gguf \
--jinja --chat-template-file src/user_assistant.jinja \
-cnv -st \
-sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: <END>" \
-r "<END>" \
-n 200 --temp 0.2 --top-p 0.95 \
--repeat-penalty 1.15 --repeat-last-n 128
```
---
## 12) Pull the GGUF from Hugging Face (Mac/Linux) and use it
### Download via HF CLI
```bash
pip install -U "huggingface_hub[cli]"
hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir .
```
### Or let llama.cpp pull from HF (caching supported)
Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by `LLAMA_CACHE`). ([Hugging Face][2])
---
## 13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline
PocketPal supports downloading **GGUF model weights** (including from Hugging Face) and chatting offline. ([App Store][3])
PocketPal’s project also notes **Hugging Face Hub integration** (browse/download models inside the app). ([GitHub][4])
### Recommended flow
1. Upload `gpt2-Q4_K_M_2.gguf` to Hugging Face (Section 10).
2. On iPhone:
* Open PocketPal
* Go to Models / Download / Hugging Face (wording varies by version)
* Search your repo (`your-username/your-gguf-repo`)
* Download `gpt2-Q4_K_M_2.gguf`
3. Start a chat offline.
> If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your `user_assistant.jinja` template is what you used in llama.cpp).
---