File size: 13,413 Bytes
2205fab
 
 
 
 
 
 
 
 
 
 
 
c96ac34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
---
pipeline_tag: text-generation
language:
  - en
tags:
  - gguf
  - llama.cpp
  - gpt2
  - nanogpt
license: other
---

# GPT

This is my own implementation of GPT 2


This covers both Tokenization and Pretraining from scratch.

I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects.

I skiped out on RLHF. That's probably be for a better project :)

I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars. 

Setup, tokenization, pretraining, sft roughly takes 4 hours.

You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal. 


---
## Transformer Architecture 

![Training model architecture](train-model-architecture.png)



## `train-model.py`

This file contains **both** the model implementation **and** the training loop.

### 1) Model code 

# Key classes and components

This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline.

## train-model.py

### `GPTConfig`
Holds model hyperparameters:

- `block_size` (context length, max T)
- `vocab_size` (V)
- `n_layer` (L)
- `n_head` (H)
- `n_embd` (C)

### `GPT`
Top-level decoder-only Transformer.

Key submodules:

- `wte`: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$
- `wpe`: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$
- `h`: stack of `Block` modules (length `n_layer`)
- `ln_f`: final LayerNorm
- `lm_head`: output projection to vocab (**tied** to `wte`)

### `Block`
One Transformer block (pre-LN residual):

- `ln_1``attn` → residual add
- `ln_2``mlp` → residual add

### `CausalSelfAttention`
Multi-head causal self-attention:

- projects $X$ into $Q,K,V$
- uses causal masking ($s>t$ masked) via fused attention
- projects concatenated heads back to $C$

### `MLP`
Feed-forward network:

- $C\to 4C\to C$ with GELU

### `DataLoaderLite`
Loads tokenized `.npy` shards and yields `(x, y)` pairs for next-token prediction.

---

## fineweb.py

- downloads FineWeb-Edu split
- tokenizes with GPT-2 BPE (tiktoken)
- writes token shards to disk as `.npy` (first shard often used as validation)

---

## evals.py

Standalone HellaSwag evaluator (multiple-choice by loss).
To use it with your trained model, swap HF model loading for your `GPT` checkpoint.

---

## Transformer formulas 


**Notation:** batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$.

### Token + positional embeddings

Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$.
With token IDs $\mathrm{idx}\in\{0,\dots,V-1\}^{B\times T}$ and positions $p_t=t$:

$$
\begin{aligned}
x_0[b,t,:] = W_E[\mathrm{idx}_{b,t}] + W_P[p_t]
\quad \in \mathbb{R}^{C}.
\end{aligned}
$$

### LayerNorm (per token, over channels)

For $u\in\mathbb{R}^{C}$:

$$
\begin{aligned}
\mu(u) &= \frac{1}{C}\sum_{i=1}^{C} u_i, \\
\sigma^2(u) &= \frac{1}{C}\sum_{i=1}^{C}\left(u_i-\mu(u)\right)^2, \\
\mathrm{LN}(u) &= \gamma \odot \frac{u-\mu(u)}{\sqrt{\sigma^2(u)+\varepsilon}} + \beta,
\end{aligned}
$$

with learnable $\gamma,\beta\in\mathbb{R}^{C}$.

### Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$

$$
\begin{aligned}
x_{l+\frac12} &= x_l + \mathrm{Attn}\!\left(\mathrm{LN}(x_l)\right),\\
x_{l+1} &= x_{l+\frac12} + \mathrm{MLP}\!\left(\mathrm{LN}(x_{l+\frac12})\right).
\end{aligned}
$$

### Multi-head causal self-attention

Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in\{1,\dots,H\}$:

$$
\begin{aligned}
Q^{(h)} &= X W_Q^{(h)} + b_Q^{(h)},\\
K^{(h)} &= X W_K^{(h)} + b_K^{(h)},\\
V^{(h)} &= X W_V^{(h)} + b_V^{(h)},
\end{aligned}
$$

where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$.

Causal masked scores:

$$
\begin{aligned}
S^{(h)}_{t,s} &= \frac{\langle Q^{(h)}_{t},K^{(h)}_{s}\rangle}{\sqrt{d}} + M_{t,s},\\
M_{t,s} &=
\begin{cases}
0, & s \le t,\\
-\infty, & s > t.
\end{cases}
\end{aligned}
$$

Attention weights and output:

$$
\begin{aligned}
A^{(h)}_{t,s} &= \mathrm{softmax}_s\!\left(S^{(h)}_{t,s}\right),\\
O^{(h)}_{t} &= \sum_{s=0}^{T-1} A^{(h)}_{t,s}\, V^{(h)}_{s}.
\end{aligned}
$$

Concatenate heads and project:

$$
\begin{aligned}
O &= \mathrm{Concat}\left(O^{(1)},\dots,O^{(H)}\right)\in\mathbb{R}^{B\times T\times C},\\
\mathrm{Attn}(X) &= O W_O + b_O \in \mathbb{R}^{B\times T\times C}.
\end{aligned}
$$

### MLP (feed-forward)

Uses expansion ratio $4\times$:

$$
\begin{aligned}
\mathrm{MLP}(X) &= \mathrm{GELU}(XW_1+b_1)\,W_2 + b_2,\\
W_1 &\in \mathbb{R}^{C\times 4C},\quad W_2\in \mathbb{R}^{4C\times C}.
\end{aligned}
$$

### GELU (tanh approximation)

$$
\mathrm{GELU}(z)=\frac12 z\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}\left(z+0.044715 z^3\right)\right)\right).
$$

### Final LayerNorm + LM head (tied weights)

Let $\hat{x}=\mathrm{LN}_f(x_L)$:

$$
\begin{aligned}
\mathrm{logits} &= \hat{x} W_U + b_U \in \mathbb{R}^{B\times T\times V},\\
W_U &= W_E^\top \quad \text{(weight tying)}.
\end{aligned}
$$

### Training objective (next-token cross entropy)

With targets $y\in\{0,\dots,V-1\}^{B\times T}$:

$$
\begin{aligned}
p(y_{b,t}\mid x) &= \mathrm{softmax}(\mathrm{logits}_{b,t,:})_{y_{b,t}},\\
\mathcal{L} &= -\frac{1}{BT}\sum_{b=1}^{B}\sum_{t=1}^{T}\log p(y_{b,t}\mid x).
\end{aligned}
$$

---

## Walkthrough: `fineweb.py`

This script builds  **tokenized training shards** from FineWeb-Edu.

What it does:

* Loads dataset:

  * `load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")`
* Tokenizes with `tiktoken.get_encoding("gpt2")`
* Prepends `<|endoftext|>` token to each document
* Writes `.npy` shards of **`shard_size = 1e8` tokens** each
* Uses multiprocessing (`os.cpu_count()//2` workers)
* Naming:

  * first shard is `"val"` (so you always have a validation shard)
  * later shards are `"train"`
  * output files look like: `edufineweb_train_000001.npy` (numpy adds `.npy`)

What to change most often:

* `remote_name` (`sample-10BT` → bigger subsets if you want)
* `shard_size` (smaller if disk is tight / you want more shards)
* `local_dir` (just make it match `train-model.py`’s `data_root`)


---

## Start to End Execution 

## Repo layout 

* **Training / pipeline code (all in `src/`)**

  * `fineweb.py` → tokenization to `src/edu_fineweb10B/`
  * `train-model.py` → pretraining, outputs to `src/log/`
  * `convert_ckpt_to_hf.py``.pt` checkpoint → HF folder
  * `evals.py` → eval HF folder on HellaSwag
  * `sft.py` → LoRA SFT (outputs `src/hf_sft_lora/`)
  * `merge_lora.py` → merge LoRA into base (outputs merged HF folder)
  * `user_assistant.jinja` → your llama.cpp chat template
* **Artifacts produced(these will not be included in the git repo but you can find them on huggingface)**

  * `src/log/model_19072.pt` (final pretrain ckpt)
  * `src/hf_pretrained/` (HF-exported base)
  * `src/hf_sft_lora/` (LoRA adapters + checkpoints)
  * `src/hf_sft_merged/` (merged HF model)
  * `src/gpt2-Q4_K_M_2.gguf` (quantized GGUF)

---

##  Two-instance architecture (what runs where)

I used **two Lambda GPU instances** mounted to the **same shared filesystem**:

* **Instance A (8×A100)**: **Pretraining**
* **Instance B (1×A100)**: **Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload**

> I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below).

### Verify shared mount (both instances)

```bash
df -h
ls -lah /home/ubuntu/GPT
```

Clone the repo :

```bash
export REPO=/home/ubuntu/GPT

git clone https://github.com/ShrithikShahapure/GPT.git
cd $REPO
```

---

## 1) Environment setup (run on BOTH instances)

### System deps

```bash
sudo apt-get update
sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config
git lfs install
```

### Python venv + packages

```bash
cd $REPO
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

### (Optional but recommended) Put HF caches on the shared filesystem

This avoids re-downloading datasets/models separately on each instance:

```bash
export HF_HOME=$REPO/.hf
export HF_DATASETS_CACHE=$HF_HOME/datasets
export TRANSFORMERS_CACHE=$HF_HOME/transformers
mkdir -p "$HF_HOME"
```

---

## 2) Build llama.cpp (do on the machine you’ll convert/quantize/test on)



```bash
cd $REPO
git clone https://github.com/ggml-org/llama.cpp
cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON
cmake --build llama.cpp/build -j
```

`llama.cpp` expects models in **GGUF** and provides conversion scripts. ([GitHub][1])

---

## 3) Tokenization (run on **Instance B: 1×A100**)

### 3.1 Run FineWeb tokenization

This writes shards under `src/edu_fineweb10B/`:

```bash
cd $REPO
source .venv/bin/activate
python src/fineweb.py
```

### 3.3 Ensure `hellaswag.py` exists (only if missing)

`train-model.py` imports `hellaswag`. If `src/hellaswag.py` doesn’t exist, do:

```bash
cd $REPO/src
cp -f evals.py hellaswag.py
```

---

## 4) Pretraining (run on **Instance A: 8×A100**)

From your repo root:

```bash
cd $REPO
source .venv/bin/activate
cd src
torchrun --standalone --nproc_per_node=8 train-model.py
```

### Watch training logs

```bash
tail -n 200 -F $REPO/src/log/log.txt
```

### Checkpoints

Your checkpoints land in:

```bash
ls -lah $REPO/src/log/model_*.pt
```

Example you currently have:

* `model_05000.pt`
* `model_10000.pt`
* `model_15000.pt`
* `model_19072.pt`

---

## 5) Convert checkpoint → Hugging Face model (run on **Instance B**)

Convert your final checkpoint into HF format (you already have `hf_pretrained`, but here’s the exact command):

```bash
cd $REPO
source .venv/bin/activate
python3 src/convert_ckpt_to_hf.py \
  --ckpt src/log/model_19072.pt \
  --out  src/hf_pretrained
```

---

## 6) Evaluate HF model (run on **Instance B**)

```bash
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_pretrained-2 -d cuda
```

---

## 7) SFT (LoRA) (run on **Instance A**)

Your repo has `src/sft.py`. Run it:

```bash
cd $REPO
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 python3 src/sft.py
```

Expected output:

* `src/hf_sft_lora/` (adapters + checkpoints)


---

## 8) Merge LoRA → merged HF model (run on **Instance B**)

Run:

```bash
cd $REPO
source .venv/bin/activate
python3 src/merge_lora.py
```


Evaluate merged model:

```bash
cd $REPO
source .venv/bin/activate
python3 src/evals.py -m src/hf_sft_merged -d cuda
```

---

## 9) Convert merged HF → GGUF + Quantize (run on **Instance B**)

### 9.1 HF → fp16 GGUF

```bash
cd $REPO
source .venv/bin/activate

python llama.cpp/convert_hf_to_gguf.py \
  src/hf_sft_merged \
  --outfile src/gpt2-f16.gguf
```

### 9.2 Quantize to Q4_K_M

```bash
cd $REPO
./llama.cpp/build/bin/llama-quantize \
  src/gpt2-f16.gguf \
  src/gpt2-Q4_K_M_2.gguf \
  Q4_K_M
```


---


### Login

```bash
pip install -U "huggingface_hub[cli]"
hf auth login
```

### Upload the GGUF

Set your repo:

```bash
export HF_REPO="your-username/your-gguf-repo"
```

Upload:

```bash
hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model
```

(Optionally upload your template too)

```bash
hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model
```

---

## 11) Inference with llama.cpp (your exact “ONE paragraph + <END>” flow)

From repo root (note: your GGUF is `src/gpt2-Q4_K_M_2.gguf`):

```bash
cd $REPO

./llama.cpp/build/bin/llama-cli \
  -m src/gpt2-Q4_K_M_2.gguf \
  --jinja --chat-template-file src/user_assistant.jinja \
  -cnv -st \
  -sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: <END>" \
  -r "<END>" \
  -n 200 --temp 0.2 --top-p 0.95 \
  --repeat-penalty 1.15 --repeat-last-n 128
```

---

## 12) Pull the GGUF from Hugging Face (Mac/Linux) and use it

### Download via HF CLI

```bash
pip install -U "huggingface_hub[cli]"
hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir .
```

### Or let llama.cpp pull from HF (caching supported)

Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by `LLAMA_CACHE`). ([Hugging Face][2])

---

## 13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline

PocketPal supports downloading **GGUF model weights** (including from Hugging Face) and chatting offline. ([App Store][3])
PocketPal’s project also notes **Hugging Face Hub integration** (browse/download models inside the app). ([GitHub][4])

### Recommended flow

1. Upload `gpt2-Q4_K_M_2.gguf` to Hugging Face (Section 10).
2. On iPhone:

   * Open PocketPal
   * Go to Models / Download / Hugging Face (wording varies by version)
   * Search your repo (`your-username/your-gguf-repo`)
   * Download `gpt2-Q4_K_M_2.gguf`
3. Start a chat offline.

> If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your `user_assistant.jinja` template is what you used in llama.cpp).

---