GPT2 / README.md

Add metadata

2205fab verified 15 days ago

13.4 kB

	---
	pipeline_tag: text-generation
	language:
	- en
	tags:
	- gguf
	- llama.cpp
	- gpt2
	- nanogpt
	license: other
	---

	# GPT

	This is my own implementation of GPT 2


	This covers both Tokenization and Pretraining from scratch.

	I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects.

	I skiped out on RLHF. That's probably be for a better project :)

	I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars.

	Setup, tokenization, pretraining, sft roughly takes 4 hours.

	You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal.


	---
	## Transformer Architecture

	![Training model architecture](train-model-architecture.png)



	## `train-model.py`

	This file contains both the model implementation and the training loop.

	### 1) Model code

	# Key classes and components

	This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline.

	## train-model.py

	### `GPTConfig`
	Holds model hyperparameters:

	- `block_size` (context length, max T)
	- `vocab_size` (V)
	- `n_layer` (L)
	- `n_head` (H)
	- `n_embd` (C)

	### `GPT`
	Top-level decoder-only Transformer.

	Key submodules:

	- `wte`: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$
	- `wpe`: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$
	- `h`: stack of `Block` modules (length `n_layer`)
	- `ln_f`: final LayerNorm
	- `lm_head`: output projection to vocab (tied to `wte`)

	### `Block`
	One Transformer block (pre-LN residual):

	- `ln_1` → `attn` → residual add
	- `ln_2` → `mlp` → residual add

	### `CausalSelfAttention`
	Multi-head causal self-attention:

	- projects $X$ into $Q,K,V$
	- uses causal masking ($s>t$ masked) via fused attention
	- projects concatenated heads back to $C$

	### `MLP`
	Feed-forward network:

	- $C\to 4C\to C$ with GELU

	### `DataLoaderLite`
	Loads tokenized `.npy` shards and yields `(x, y)` pairs for next-token prediction.

	---

	## fineweb.py

	- downloads FineWeb-Edu split
	- tokenizes with GPT-2 BPE (tiktoken)
	- writes token shards to disk as `.npy` (first shard often used as validation)

	---

	## evals.py

	Standalone HellaSwag evaluator (multiple-choice by loss).
	To use it with your trained model, swap HF model loading for your `GPT` checkpoint.

	---

	## Transformer formulas


	Notation: batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$.

	### Token + positional embeddings

	Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$.
	With token IDs $\mathrm{idx}\in\{0,\dots,V-1\}^{B\times T}$ and positions $p_t=t$:

	$$
	\begin{aligned}
	x_0[b,t,:] = W_E[\mathrm{idx}_{b,t}] + W_P[p_t]
	\quad \in \mathbb{R}^{C}.
	\end{aligned}
	$$

	### LayerNorm (per token, over channels)

	For $u\in\mathbb{R}^{C}$:

	$$
	\begin{aligned}
	\mu(u) &= \frac{1}{C}\sum_{i=1}^{C} u_i, \\
	\sigma^2(u) &= \frac{1}{C}\sum_{i=1}^{C}\left(u_i-\mu(u)\right)^2, \\
	\mathrm{LN}(u) &= \gamma \odot \frac{u-\mu(u)}{\sqrt{\sigma^2(u)+\varepsilon}} + \beta,
	\end{aligned}
	$$

	with learnable $\gamma,\beta\in\mathbb{R}^{C}$.

	### Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$

	$$
	\begin{aligned}
	x_{l+\frac12} &= x_l + \mathrm{Attn}\!\left(\mathrm{LN}(x_l)\right),\\
	x_{l+1} &= x_{l+\frac12} + \mathrm{MLP}\!\left(\mathrm{LN}(x_{l+\frac12})\right).
	\end{aligned}
	$$

	### Multi-head causal self-attention

	Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in\{1,\dots,H\}$:

	$$
	\begin{aligned}
	Q^{(h)} &= X W_Q^{(h)} + b_Q^{(h)},\\
	K^{(h)} &= X W_K^{(h)} + b_K^{(h)},\\
	V^{(h)} &= X W_V^{(h)} + b_V^{(h)},
	\end{aligned}
	$$

	where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$.

	Causal masked scores:

	$$
	\begin{aligned}
	S^{(h)}_{t,s} &= \frac{\langle Q^{(h)}_{t},K^{(h)}_{s}\rangle}{\sqrt{d}} + M_{t,s},\\
	M_{t,s} &=
	\begin{cases}
	0, & s \le t,\\
	-\infty, & s > t.
	\end{cases}
	\end{aligned}
	$$

	Attention weights and output:

	$$
	\begin{aligned}
	A^{(h)}_{t,s} &= \mathrm{softmax}_s\!\left(S^{(h)}_{t,s}\right),\\
	O^{(h)}_{t} &= \sum_{s=0}^{T-1} A^{(h)}_{t,s}\, V^{(h)}_{s}.
	\end{aligned}
	$$

	Concatenate heads and project:

	$$
	\begin{aligned}
	O &= \mathrm{Concat}\left(O^{(1)},\dots,O^{(H)}\right)\in\mathbb{R}^{B\times T\times C},\\
	\mathrm{Attn}(X) &= O W_O + b_O \in \mathbb{R}^{B\times T\times C}.
	\end{aligned}
	$$

	### MLP (feed-forward)

	Uses expansion ratio $4\times$:

	$$
	\begin{aligned}
	\mathrm{MLP}(X) &= \mathrm{GELU}(XW_1+b_1)\,W_2 + b_2,\\
	W_1 &\in \mathbb{R}^{C\times 4C},\quad W_2\in \mathbb{R}^{4C\times C}.
	\end{aligned}
	$$

	### GELU (tanh approximation)

	$$
	\mathrm{GELU}(z)=\frac12 z\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}\left(z+0.044715 z^3\right)\right)\right).
	$$

	### Final LayerNorm + LM head (tied weights)

	Let $\hat{x}=\mathrm{LN}_f(x_L)$:

	$$
	\begin{aligned}
	\mathrm{logits} &= \hat{x} W_U + b_U \in \mathbb{R}^{B\times T\times V},\\
	W_U &= W_E^\top \quad \text{(weight tying)}.
	\end{aligned}
	$$

	### Training objective (next-token cross entropy)

	With targets $y\in\{0,\dots,V-1\}^{B\times T}$:

	$$
	\begin{aligned}
	p(y_{b,t}\mid x) &= \mathrm{softmax}(\mathrm{logits}_{b,t,:})_{y_{b,t}},\\
	\mathcal{L} &= -\frac{1}{BT}\sum_{b=1}^{B}\sum_{t=1}^{T}\log p(y_{b,t}\mid x).
	\end{aligned}
	$$

	---

	## Walkthrough: `fineweb.py`

	This script builds tokenized training shards from FineWeb-Edu.

	What it does:

	* Loads dataset:

	* `load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")`
	* Tokenizes with `tiktoken.get_encoding("gpt2")`
	* Prepends `<\|endoftext\|>` token to each document
	* Writes `.npy` shards of `shard_size = 1e8` tokens each
	* Uses multiprocessing (`os.cpu_count()//2` workers)
	* Naming:

	* first shard is `"val"` (so you always have a validation shard)
	* later shards are `"train"`
	* output files look like: `edufineweb_train_000001.npy` (numpy adds `.npy`)

	What to change most often:

	* `remote_name` (`sample-10BT` → bigger subsets if you want)
	* `shard_size` (smaller if disk is tight / you want more shards)
	* `local_dir` (just make it match `train-model.py`’s `data_root`)


	---

	## Start to End Execution

	## Repo layout

	* Training / pipeline code (all in `src/`)

	* `fineweb.py` → tokenization to `src/edu_fineweb10B/`
	* `train-model.py` → pretraining, outputs to `src/log/`
	* `convert_ckpt_to_hf.py` → `.pt` checkpoint → HF folder
	* `evals.py` → eval HF folder on HellaSwag
	* `sft.py` → LoRA SFT (outputs `src/hf_sft_lora/`)
	* `merge_lora.py` → merge LoRA into base (outputs merged HF folder)
	* `user_assistant.jinja` → your llama.cpp chat template
	* Artifacts produced(these will not be included in the git repo but you can find them on huggingface)

	* `src/log/model_19072.pt` (final pretrain ckpt)
	* `src/hf_pretrained/` (HF-exported base)
	* `src/hf_sft_lora/` (LoRA adapters + checkpoints)
	* `src/hf_sft_merged/` (merged HF model)
	* `src/gpt2-Q4_K_M_2.gguf` (quantized GGUF)

	---

	## Two-instance architecture (what runs where)

	I used two Lambda GPU instances mounted to the same shared filesystem:

	* Instance A (8×A100): Pretraining
	* Instance B (1×A100): Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload

	> I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below).

	### Verify shared mount (both instances)

	```bash
	df -h
	ls -lah /home/ubuntu/GPT
	```

	Clone the repo :

	```bash
	export REPO=/home/ubuntu/GPT

	git clone https://github.com/ShrithikShahapure/GPT.git
	cd $REPO
	```

	---

	## 1) Environment setup (run on BOTH instances)

	### System deps

	```bash
	sudo apt-get update
	sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config
	git lfs install
	```

	### Python venv + packages

	```bash
	cd $REPO
	python3 -m venv .venv
	source .venv/bin/activate
	python -m pip install -U pip

	pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub
	pip install torch --index-url https://download.pytorch.org/whl/cu121
	```

	### (Optional but recommended) Put HF caches on the shared filesystem

	This avoids re-downloading datasets/models separately on each instance:

	```bash
	export HF_HOME=$REPO/.hf
	export HF_DATASETS_CACHE=$HF_HOME/datasets
	export TRANSFORMERS_CACHE=$HF_HOME/transformers
	mkdir -p "$HF_HOME"
	```

	---

	## 2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)



	```bash
	cd $REPO
	git clone https://github.com/ggml-org/llama.cpp
	cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
	cmake --build llama.cpp/build -j
	```

	`llama.cpp` expects models in GGUF and provides conversion scripts. ([GitHub][1])

	---

	## 3) Tokenization (run on Instance B: 1×A100)

	### 3.1 Run FineWeb tokenization

	This writes shards under `src/edu_fineweb10B/`:

	```bash
	cd $REPO
	source .venv/bin/activate
	python src/fineweb.py
	```

	### 3.3 Ensure `hellaswag.py` exists (only if missing)

	`train-model.py` imports `hellaswag`. If `src/hellaswag.py` doesn’t exist, do:

	```bash
	cd $REPO/src
	cp -f evals.py hellaswag.py
	```

	---

	## 4) Pretraining (run on Instance A: 8×A100)

	From your repo root:

	```bash
	cd $REPO
	source .venv/bin/activate
	cd src
	torchrun --standalone --nproc_per_node=8 train-model.py
	```

	### Watch training logs

	```bash
	tail -n 200 -F $REPO/src/log/log.txt
	```

	### Checkpoints

	Your checkpoints land in:

	```bash
	ls -lah $REPO/src/log/model_*.pt
	```

	Example you currently have:

	* `model_05000.pt`
	* `model_10000.pt`
	* `model_15000.pt`
	* `model_19072.pt`

	---

	## 5) Convert checkpoint → Hugging Face model (run on Instance B)

	Convert your final checkpoint into HF format (you already have `hf_pretrained`, but here’s the exact command):

	```bash
	cd $REPO
	source .venv/bin/activate
	python3 src/convert_ckpt_to_hf.py \
	--ckpt src/log/model_19072.pt \
	--out src/hf_pretrained
	```

	---

	## 6) Evaluate HF model (run on Instance B)

	```bash
	cd $REPO
	source .venv/bin/activate
	python3 src/evals.py -m src/hf_pretrained-2 -d cuda
	```

	---

	## 7) SFT (LoRA) (run on Instance A)

	Your repo has `src/sft.py`. Run it:

	```bash
	cd $REPO
	source .venv/bin/activate
	CUDA_VISIBLE_DEVICES=0 python3 src/sft.py
	```

	Expected output:

	* `src/hf_sft_lora/` (adapters + checkpoints)


	---

	## 8) Merge LoRA → merged HF model (run on Instance B)

	Run:

	```bash
	cd $REPO
	source .venv/bin/activate
	python3 src/merge_lora.py
	```


	Evaluate merged model:

	```bash
	cd $REPO
	source .venv/bin/activate
	python3 src/evals.py -m src/hf_sft_merged -d cuda
	```

	---

	## 9) Convert merged HF → GGUF + Quantize (run on Instance B)

	### 9.1 HF → fp16 GGUF

	```bash
	cd $REPO
	source .venv/bin/activate

	python llama.cpp/convert_hf_to_gguf.py \
	src/hf_sft_merged \
	--outfile src/gpt2-f16.gguf
	```

	### 9.2 Quantize to Q4_K_M

	```bash
	cd $REPO
	./llama.cpp/build/bin/llama-quantize \
	src/gpt2-f16.gguf \
	src/gpt2-Q4_K_M_2.gguf \
	Q4_K_M
	```


	---


	### Login

	```bash
	pip install -U "huggingface_hub[cli]"
	hf auth login
	```

	### Upload the GGUF

	Set your repo:

	```bash
	export HF_REPO="your-username/your-gguf-repo"
	```

	Upload:

	```bash
	hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model
	```

	(Optionally upload your template too)

	```bash
	hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model
	```

	---

	## 11) Inference with llama.cpp (your exact “ONE paragraph + <END>” flow)

	From repo root (note: your GGUF is `src/gpt2-Q4_K_M_2.gguf`):

	```bash
	cd $REPO

	./llama.cpp/build/bin/llama-cli \
	-m src/gpt2-Q4_K_M_2.gguf \
	--jinja --chat-template-file src/user_assistant.jinja \
	-cnv -st \
	-sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: <END>" \
	-r "<END>" \
	-n 200 --temp 0.2 --top-p 0.95 \
	--repeat-penalty 1.15 --repeat-last-n 128
	```

	---

	## 12) Pull the GGUF from Hugging Face (Mac/Linux) and use it

	### Download via HF CLI

	```bash
	pip install -U "huggingface_hub[cli]"
	hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir .
	```

	### Or let llama.cpp pull from HF (caching supported)

	Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by `LLAMA_CACHE`). ([Hugging Face][2])

	---

	## 13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline

	PocketPal supports downloading GGUF model weights (including from Hugging Face) and chatting offline. ([App Store][3])
	PocketPal’s project also notes Hugging Face Hub integration (browse/download models inside the app). ([GitHub][4])

	### Recommended flow

	1. Upload `gpt2-Q4_K_M_2.gguf` to Hugging Face (Section 10).
	2. On iPhone:

	* Open PocketPal
	* Go to Models / Download / Hugging Face (wording varies by version)
	* Search your repo (`your-username/your-gguf-repo`)
	* Download `gpt2-Q4_K_M_2.gguf`
	3. Start a chat offline.

	> If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your `user_assistant.jinja` template is what you used in llama.cpp).

	---