---
pipeline_tag: text-generation
language:
  - en
tags:
  - gguf
  - llama.cpp
  - gpt2
  - nanogpt
license: other
---

# GPT

This is my own implementation of GPT 2


This covers both Tokenization and Pretraining from scratch.

I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects.

I skiped out on RLHF. That's probably be for a better project :)

I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars. 

Setup, tokenization, pretraining, sft roughly takes 4 hours.

You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal. 


---
## Transformer Architecture 

![Training model architecture](train-model-architecture.png)


## `train-model.py`

This file contains **both** the model implementation **and** the training loop.

### 1) Model code 

# Key classes and components

This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline.

## train-model.py

### `GPTConfig`
Holds model hyperparameters:

- `block_size` (context length, max T)
- `vocab_size` (V)
- `n_layer` (L)
- `n_head` (H)
- `n_embd` (C)

### `GPT`
Top-level decoder-only Transformer.

Key submodules:

- `wte`: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$
- `wpe`: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$
- `h`: stack of `Block` modules (length `n_layer`)
- `ln_f`: final LayerNorm
- `lm_head`: output projection to vocab (**tied** to `wte`)

### `Block`
One Transformer block (pre-LN residual):

- `ln_1` → `attn` → residual add
- `ln_2` → `mlp` → residual add

### `CausalSelfAttention`
Multi-head causal self-attention:

- projects $X$ into $Q,K,V$
- uses causal masking ($s>t$ masked) via fused attention
- projects concatenated heads back to $C$

### `MLP`
Feed-forward network:

- $C\to 4C\to C$ with GELU

### `DataLoaderLite`
Loads tokenized `.npy` shards and yields `(x, y)` pairs for next-token prediction.

---

## fineweb.py

- downloads FineWeb-Edu split
- tokenizes with GPT-2 BPE (tiktoken)
- writes token shards to disk as `.npy` (first shard often used as validation)

---

## evals.py

Standalone HellaSwag evaluator (multiple-choice by loss).
To use it with your trained model, swap HF model loading for your `GPT` checkpoint.

---

## Transformer formulas 


**Notation:** batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$.

### Token + positional embeddings

Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$.
With token IDs $\mathrm{idx}\in\{0,\dots,V-1\}^{B\times T}$ and positions $p_t=t$:

$$
\begin{aligned}
x_0[b,t,:] = W_E[\mathrm{idx}_{b,t}] + W_P[p_t]
\quad \in \mathbb{R}^{C}.
\end{aligned}
$$

### LayerNorm (per token, over channels)

For $u\in\mathbb{R}^{C}$:

$$
\begin{aligned}
\mu(u) &= \frac{1}{C}\sum_{i=1}^{C} u_i, \\
\sigma^2(u) &= \frac{1}{C}\sum_{i=1}^{C}\left(u_i-\mu(u)\right)^2, \\
\mathrm{LN}(u) &= \gamma \odot \frac{u-\mu(u)}{\sqrt{\sigma^2(u)+\varepsilon}} + \beta,
\end{aligned}
$$

with learnable $\gamma,\beta\in\mathbb{R}^{C}$.

### Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$

$$
\begin{aligned}
x_{l+\frac12} &= x_l + \mathrm{Attn}\!\left(\mathrm{LN}(x_l)\right),\\
x_{l+1} &= x_{l+\frac12} + \mathrm{MLP}\!\left(\mathrm{LN}(x_{l+\frac12})\right).
\end{aligned}
$$

### Multi-head causal self-attention

Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in\{1,\dots,H\}$:

$$
\begin{aligned}
Q^{(h)} &= X W_Q^{(h)} + b_Q^{(h)},\\
K^{(h)} &= X W_K^{(h)} + b_K^{(h)},\\
V^{(h)} &= X W_V^{(h)} + b_V^{(h)},
\end{aligned}
$$

where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$.

Causal masked scores:

$$
\begin{aligned}
S^{(h)}_{t,s} &= \frac{\langle Q^{(h)}_{t},K^{(h)}_{s}\rangle}{\sqrt{d}} + M_{t,s},\\
M_{t,s} &=
\begin{cases}
0, & s \le t,\\
-\infty, & s > t.
\end{cases}
\end{aligned}
$$

Attention weights and output:

$$
\begin{aligned}
A^{(h)}_{t,s} &= \mathrm{softmax}_s\!\left(S^{(h)}_{t,s}\right),\\
O^{(h)}_{t} &= \sum_{s=0}^{T-1} A^{(h)}_{t,s}\, V^{(h)}_{s}.
\end{aligned}
$$

Concatenate heads and project:

$$
\begin{aligned}
O &= \mathrm{Concat}\left(O^{(1)},\dots,O^{(H)}\right)\in\mathbb{R}^{B\times T\times C},\\
\mathrm{Attn}(X) &= O W_O + b_O \in \mathbb{R}^{B\times T\times C}.
\end{aligned}
$$

### MLP (feed-forward)

Uses expansion ratio $4\times$:

$$
\begin{aligned}
\mathrm{MLP}(X) &= \mathrm{GELU}(XW_1+b_1)\,W_2 + b_2,\\
W_1 &\in \mathbb{R}^{C\times 4C},\quad W_2\in \mathbb{R}^{4C\times C}.
\end{aligned}
$$

### GELU (tanh approximation)

$$
\mathrm{GELU}(z)=\frac12 z\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}\left(z+0.044715 z^3\right)\right)\right).
$$

### Final LayerNorm + LM head (tied weights)

Let $\hat{x}=\mathrm{LN}_f(x_L)$:

$$
\begin{aligned}
\mathrm{logits} &= \hat{x} W_U + b_U \in \mathbb{R}^{B\times T\times V},\\
W_U &= W_E^\top \quad \text{(weight tying)}.
\end{aligned}
$$

### Training objective (next-token cross entropy)

With targets $y\in\{0,\dots,V-1\}^{B\times T}$:

$$
\begin{aligned}
p(y_{b,t}\mid x) &= \mathrm{softmax}(\mathrm{logits}_{b,t,:})_{y_{b,t}},\\
\mathcal{L} &= -\frac{1}{BT}\sum_{b=1}^{B}\sum_{t=1}^{T}\log p(y_{b,t}\mid x).
\end{aligned}
$$

---

## Walkthrough: `fineweb.py`

This script builds  **tokenized training shards** from FineWeb-Edu.

What it does:

* Loads dataset:

  * `load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")`
* Tokenizes with `tiktoken.get_encoding("gpt2")`
* Prepends `<|endoftext|>` token to each document
* Writes `.npy` shards of **`shard_size = 1e8` tokens** each
* Uses multiprocessing (`os.cpu_count()//2` workers)
* Naming:

  * first shard is `"val"` (so you always have a validation shard)
  * later shards are `"train"`
  * output files look like: `edufineweb_train_000001.npy` (numpy adds `.npy`)

What to change most often:

* `remote_name` (`sample-10BT` → bigger subsets if you want)
* `shard_size` (smaller if disk is tight / you want more shards)
* `local_dir` (just make it match `train-model.py`’s `data_root`)


---

## Start to End Execution 

## Repo layout 

* **Training / pipeline code (all in `src/`)**

  * `fineweb.py` → tokenization to `src/edu_fineweb10B/`
  * `train-model.py` → pretraining, outputs to `src/log/`
  * `convert_ckpt_to_hf.py` → `.pt` checkpoint → HF folder
  * `evals.py` → eval HF folder on HellaSwag
  * `sft.py` → LoRA SFT (outputs `src/hf_sft_lora/`)
  * `merge_lora.py` → merge LoRA into base (outputs merged HF folder)
  * `user_assistant.jinja` → your llama.cpp chat template
* **Artifacts produced(these will not be included in the git repo but you can find them on huggingface)**

  * `src/log/model_19072.pt` (final pretrain ckpt)
  * `src/hf_pretrained/` (HF-exported base)
  * `src/hf_sft_lora/` (LoRA adapters + checkpoints)
  * `src/hf_sft_merged/` (merged HF model)
  * `src/gpt2-Q4_K_M_2.gguf` (quantized GGUF)

---

##  Two-instance architecture (what runs where)

I used **two Lambda GPU instances** mounted to the **same shared filesystem**:

* **Instance A (8×A100)**: **Pretraining**
* **Instance B (1×A100)**: **Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload**

> I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below).

### Verify shared mount (both instances)

```bash
df -h
ls -lah /home/ubuntu/GPT
```

Clone the repo :

```bash
export REPO=/home/ubuntu/GPT

git clone https://github.com/ShrithikShahapure/GPT.git
cd $REPO
```

---

## 1) Environment setup (run on BOTH instances)

### System deps

```bash
sudo apt-get update
sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config
git lfs install
```

### Python venv + packages

```bash
cd $REPO
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

### (Optional but recommended) Put HF caches on the shared filesystem

This avoids re-downloading datasets/models separately on each instance:

```bash
export HF_HOME=$REPO/.hf
export HF_DATASETS_CACHE=$HF_HOME/datasets
export TRANSFORMERS_CACHE=$HF_HOME/transformers
mkdir -p "$HF_HOME"
```

---

## 2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)


```bash
cd $REPO
git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build -j
```

`llama.cpp` expects models in **GGUF** and provides conversion scripts. ([GitHub][1])

---

## 3) Tokenization (run on **Instance B: 1×A100**)

### 3.1 Run FineWeb tokenization

This writes shards under `src/edu_fineweb10B/`:

```bash
cd $REPO
source .venv/bin/activate
python src/fineweb.py
```

### 3.3 Ensure `hellaswag.py` exists (only if missing)

`train-model.py` imports `hellaswag`. If `src/hellaswag.py` doesn’t exist, do:

```bash
cd $REPO/src
cp -f evals.py hellaswag.py
```

---

## 4) Pretraining (run on **Instance A: 8×A100**)

From your repo root:

```bash
cd $REPO
source .venv/bin/activate
cd src
torchrun --standalone --nproc_per_node=8 train-model.py
```

### Watch training logs

```bash
tail -n 200 -F $REPO/src/log/log.txt
```

### Checkpoints

Your checkpoints land in:

```bash
ls -lah $REPO/src/log/model_*.pt
```

Example you currently have:

* `model_05000.pt`
* `model_10000.pt`
* `model_15000.pt`
* `model_19072.pt`

---

## 5) Convert checkpoint → Hugging Face model (run on **Instance B**)

Convert your final checkpoint into HF format (you already have `hf_pretrained`, but here’s the exact command):

```bash
cd $REPO
source .venv/bin/activate
python3 src/convert_ckpt_to_hf.py \
  --ckpt src/log/model_19072.pt \
  --out  src/hf_pretrained
```

---

## 6) Evaluate HF model (run on **Instance B**)

```bash
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_pretrained-2 -d cuda
```

---

## 7) SFT (LoRA) (run on **Instance A**)

Your repo has `src/sft.py`. Run it:

```bash
cd $REPO
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 python3 src/sft.py
```

Expected output:

* `src/hf_sft_lora/` (adapters + checkpoints)


---

## 8) Merge LoRA → merged HF model (run on **Instance B**)

Run:

```bash
cd $REPO
source .venv/bin/activate
python3 src/merge_lora.py
```


Evaluate merged model:

```bash
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_sft_merged -d cuda
```

---

## 9) Convert merged HF → GGUF + Quantize (run on **Instance B**)

### 9.1 HF → fp16 GGUF

```bash
cd $REPO
source .venv/bin/activate

python llama.cpp/convert_hf_to_gguf.py \
  src/hf_sft_merged \
  --outfile src/gpt2-f16.gguf
```

### 9.2 Quantize to Q4_K_M

```bash
cd $REPO
./llama.cpp/build/bin/llama-quantize \
  src/gpt2-f16.gguf \
  src/gpt2-Q4_K_M_2.gguf \
  Q4_K_M
```


---


### Login

```bash
pip install -U "huggingface_hub[cli]"
hf auth login
```

### Upload the GGUF

Set your repo:

```bash
export HF_REPO="your-username/your-gguf-repo"
```

Upload:

```bash
hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model
```

(Optionally upload your template too)

```bash
hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model
```

---

## 11) Inference with llama.cpp (your exact “ONE paragraph + <END>” flow)

From repo root (note: your GGUF is `src/gpt2-Q4_K_M_2.gguf`):

```bash
cd $REPO

./llama.cpp/build/bin/llama-cli \
  -m src/gpt2-Q4_K_M_2.gguf \
  --jinja --chat-template-file src/user_assistant.jinja \
  -cnv -st \
  -sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: <END>" \
  -r "<END>" \
  -n 200 --temp 0.2 --top-p 0.95 \
  --repeat-penalty 1.15 --repeat-last-n 128
```

---

## 12) Pull the GGUF from Hugging Face (Mac/Linux) and use it

### Download via HF CLI

```bash
pip install -U "huggingface_hub[cli]"
hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir .
```

### Or let llama.cpp pull from HF (caching supported)

Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by `LLAMA_CACHE`). ([Hugging Face][2])

---

## 13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline

PocketPal supports downloading **GGUF model weights** (including from Hugging Face) and chatting offline. ([App Store][3])
PocketPal’s project also notes **Hugging Face Hub integration** (browse/download models inside the app). ([GitHub][4])

### Recommended flow

1. Upload `gpt2-Q4_K_M_2.gguf` to Hugging Face (Section 10).
2. On iPhone:

   * Open PocketPal
   * Go to Models / Download / Hugging Face (wording varies by version)
   * Search your repo (`your-username/your-gguf-repo`)
   * Download `gpt2-Q4_K_M_2.gguf`
3. Start a chat offline.

> If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your `user_assistant.jinja` template is what you used in llama.cpp).

---