--- pipeline_tag: text-generation language: - en tags: - gguf - llama.cpp - gpt2 - nanogpt license: other --- # GPT This is my own implementation of GPT 2 This covers both Tokenization and Pretraining from scratch. I wanted to try to make this a little bit end to end by implementing SFT with LoRA using transformers, quantization using llama.cpp and inference with llama.cpp and huggingface. I'll try implementing these from scratch in future projects. I skiped out on RLHF. That's probably be for a better project :) I used an A100 and an 8xA100 GPU cluster, attatched to a filesystem. This setup roughly cost me 80 dollars. Setup, tokenization, pretraining, sft roughly takes 4 hours. You can run this in llama.cpp or if you want to try running this on your ios/android device, i reccomend pocketpal. --- ## Transformer Architecture ![Training model architecture](train-model-architecture.png) ## `train-model.py` This file contains **both** the model implementation **and** the training loop. ### 1) Model code # Key classes and components This doc maps the main code in this repo to the conceptual pieces of the model and training pipeline. ## train-model.py ### `GPTConfig` Holds model hyperparameters: - `block_size` (context length, max T) - `vocab_size` (V) - `n_layer` (L) - `n_head` (H) - `n_embd` (C) ### `GPT` Top-level decoder-only Transformer. Key submodules: - `wte`: token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ - `wpe`: position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$ - `h`: stack of `Block` modules (length `n_layer`) - `ln_f`: final LayerNorm - `lm_head`: output projection to vocab (**tied** to `wte`) ### `Block` One Transformer block (pre-LN residual): - `ln_1` → `attn` → residual add - `ln_2` → `mlp` → residual add ### `CausalSelfAttention` Multi-head causal self-attention: - projects $X$ into $Q,K,V$ - uses causal masking ($s>t$ masked) via fused attention - projects concatenated heads back to $C$ ### `MLP` Feed-forward network: - $C\to 4C\to C$ with GELU ### `DataLoaderLite` Loads tokenized `.npy` shards and yields `(x, y)` pairs for next-token prediction. --- ## fineweb.py - downloads FineWeb-Edu split - tokenizes with GPT-2 BPE (tiktoken) - writes token shards to disk as `.npy` (first shard often used as validation) --- ## evals.py Standalone HellaSwag evaluator (multiple-choice by loss). To use it with your trained model, swap HF model loading for your `GPT` checkpoint. --- ## Transformer formulas **Notation:** batch $B$, sequence length $T$, embedding dim $C$, heads $H$, head dim $d=C/H$, vocab $V$, layers $L$. ### Token + positional embeddings Let token embedding matrix $W_E \in \mathbb{R}^{V\times C}$ and position embedding matrix $W_P \in \mathbb{R}^{T_{\max}\times C}$. With token IDs $\mathrm{idx}\in\{0,\dots,V-1\}^{B\times T}$ and positions $p_t=t$: $$ \begin{aligned} x_0[b,t,:] = W_E[\mathrm{idx}_{b,t}] + W_P[p_t] \quad \in \mathbb{R}^{C}. \end{aligned} $$ ### LayerNorm (per token, over channels) For $u\in\mathbb{R}^{C}$: $$ \begin{aligned} \mu(u) &= \frac{1}{C}\sum_{i=1}^{C} u_i, \\ \sigma^2(u) &= \frac{1}{C}\sum_{i=1}^{C}\left(u_i-\mu(u)\right)^2, \\ \mathrm{LN}(u) &= \gamma \odot \frac{u-\mu(u)}{\sqrt{\sigma^2(u)+\varepsilon}} + \beta, \end{aligned} $$ with learnable $\gamma,\beta\in\mathbb{R}^{C}$. ### Transformer block (Pre-LN residual), repeated for $l=0,\dots,L-1$ $$ \begin{aligned} x_{l+\frac12} &= x_l + \mathrm{Attn}\!\left(\mathrm{LN}(x_l)\right),\\ x_{l+1} &= x_{l+\frac12} + \mathrm{MLP}\!\left(\mathrm{LN}(x_{l+\frac12})\right). \end{aligned} $$ ### Multi-head causal self-attention Given $X=\mathrm{LN}(x)\in\mathbb{R}^{B\times T\times C}$, per head $h\in\{1,\dots,H\}$: $$ \begin{aligned} Q^{(h)} &= X W_Q^{(h)} + b_Q^{(h)},\\ K^{(h)} &= X W_K^{(h)} + b_K^{(h)},\\ V^{(h)} &= X W_V^{(h)} + b_V^{(h)}, \end{aligned} $$ where $Q^{(h)},K^{(h)},V^{(h)}\in\mathbb{R}^{B\times T\times d}$. Causal masked scores: $$ \begin{aligned} S^{(h)}_{t,s} &= \frac{\langle Q^{(h)}_{t},K^{(h)}_{s}\rangle}{\sqrt{d}} + M_{t,s},\\ M_{t,s} &= \begin{cases} 0, & s \le t,\\ -\infty, & s > t. \end{cases} \end{aligned} $$ Attention weights and output: $$ \begin{aligned} A^{(h)}_{t,s} &= \mathrm{softmax}_s\!\left(S^{(h)}_{t,s}\right),\\ O^{(h)}_{t} &= \sum_{s=0}^{T-1} A^{(h)}_{t,s}\, V^{(h)}_{s}. \end{aligned} $$ Concatenate heads and project: $$ \begin{aligned} O &= \mathrm{Concat}\left(O^{(1)},\dots,O^{(H)}\right)\in\mathbb{R}^{B\times T\times C},\\ \mathrm{Attn}(X) &= O W_O + b_O \in \mathbb{R}^{B\times T\times C}. \end{aligned} $$ ### MLP (feed-forward) Uses expansion ratio $4\times$: $$ \begin{aligned} \mathrm{MLP}(X) &= \mathrm{GELU}(XW_1+b_1)\,W_2 + b_2,\\ W_1 &\in \mathbb{R}^{C\times 4C},\quad W_2\in \mathbb{R}^{4C\times C}. \end{aligned} $$ ### GELU (tanh approximation) $$ \mathrm{GELU}(z)=\frac12 z\left(1+\tanh\left(\sqrt{\frac{2}{\pi}}\left(z+0.044715 z^3\right)\right)\right). $$ ### Final LayerNorm + LM head (tied weights) Let $\hat{x}=\mathrm{LN}_f(x_L)$: $$ \begin{aligned} \mathrm{logits} &= \hat{x} W_U + b_U \in \mathbb{R}^{B\times T\times V},\\ W_U &= W_E^\top \quad \text{(weight tying)}. \end{aligned} $$ ### Training objective (next-token cross entropy) With targets $y\in\{0,\dots,V-1\}^{B\times T}$: $$ \begin{aligned} p(y_{b,t}\mid x) &= \mathrm{softmax}(\mathrm{logits}_{b,t,:})_{y_{b,t}},\\ \mathcal{L} &= -\frac{1}{BT}\sum_{b=1}^{B}\sum_{t=1}^{T}\log p(y_{b,t}\mid x). \end{aligned} $$ --- ## Walkthrough: `fineweb.py` This script builds **tokenized training shards** from FineWeb-Edu. What it does: * Loads dataset: * `load_dataset("HuggingFaceFW/fineweb-edu", name="sample-10BT", split="train")` * Tokenizes with `tiktoken.get_encoding("gpt2")` * Prepends `<|endoftext|>` token to each document * Writes `.npy` shards of **`shard_size = 1e8` tokens** each * Uses multiprocessing (`os.cpu_count()//2` workers) * Naming: * first shard is `"val"` (so you always have a validation shard) * later shards are `"train"` * output files look like: `edufineweb_train_000001.npy` (numpy adds `.npy`) What to change most often: * `remote_name` (`sample-10BT` → bigger subsets if you want) * `shard_size` (smaller if disk is tight / you want more shards) * `local_dir` (just make it match `train-model.py`’s `data_root`) --- ## Start to End Execution ## Repo layout * **Training / pipeline code (all in `src/`)** * `fineweb.py` → tokenization to `src/edu_fineweb10B/` * `train-model.py` → pretraining, outputs to `src/log/` * `convert_ckpt_to_hf.py` → `.pt` checkpoint → HF folder * `evals.py` → eval HF folder on HellaSwag * `sft.py` → LoRA SFT (outputs `src/hf_sft_lora/`) * `merge_lora.py` → merge LoRA into base (outputs merged HF folder) * `user_assistant.jinja` → your llama.cpp chat template * **Artifacts produced(these will not be included in the git repo but you can find them on huggingface)** * `src/log/model_19072.pt` (final pretrain ckpt) * `src/hf_pretrained/` (HF-exported base) * `src/hf_sft_lora/` (LoRA adapters + checkpoints) * `src/hf_sft_merged/` (merged HF model) * `src/gpt2-Q4_K_M_2.gguf` (quantized GGUF) --- ## Two-instance architecture (what runs where) I used **two Lambda GPU instances** mounted to the **same shared filesystem**: * **Instance A (8×A100)**: **Pretraining** * **Instance B (1×A100)**: **Tokenization + Eval + SFT + Merge + GGUF conversion + Quantization + Upload** > I attach the filesystem in Lambda’s UI. After attaching, verify both machines see the same path (example below). ### Verify shared mount (both instances) ```bash df -h ls -lah /home/ubuntu/GPT ``` Clone the repo : ```bash export REPO=/home/ubuntu/GPT git clone https://github.com/ShrithikShahapure/GPT.git cd $REPO ``` --- ## 1) Environment setup (run on BOTH instances) ### System deps ```bash sudo apt-get update sudo apt-get install -y git git-lfs python3-venv python3-pip build-essential cmake pkg-config git lfs install ``` ### Python venv + packages ```bash cd $REPO python3 -m venv .venv source .venv/bin/activate python -m pip install -U pip pip install numpy tqdm requests tiktoken datasets transformers accelerate safetensors peft trl huggingface_hub pip install torch --index-url https://download.pytorch.org/whl/cu121 ``` ### (Optional but recommended) Put HF caches on the shared filesystem This avoids re-downloading datasets/models separately on each instance: ```bash export HF_HOME=$REPO/.hf export HF_DATASETS_CACHE=$HF_HOME/datasets export TRANSFORMERS_CACHE=$HF_HOME/transformers mkdir -p "$HF_HOME" ``` --- ## 2) Build llama.cpp (do on the machine you’ll convert/quantize/test on) ```bash cd $REPO git clone https://github.com/ggml-org/llama.cpp cmake -S llama.cpp -B llama.cpp/build -DGGML_CUDA=ON cmake --build llama.cpp/build -j ``` `llama.cpp` expects models in **GGUF** and provides conversion scripts. ([GitHub][1]) --- ## 3) Tokenization (run on **Instance B: 1×A100**) ### 3.1 Run FineWeb tokenization This writes shards under `src/edu_fineweb10B/`: ```bash cd $REPO source .venv/bin/activate python src/fineweb.py ``` ### 3.3 Ensure `hellaswag.py` exists (only if missing) `train-model.py` imports `hellaswag`. If `src/hellaswag.py` doesn’t exist, do: ```bash cd $REPO/src cp -f evals.py hellaswag.py ``` --- ## 4) Pretraining (run on **Instance A: 8×A100**) From your repo root: ```bash cd $REPO source .venv/bin/activate cd src torchrun --standalone --nproc_per_node=8 train-model.py ``` ### Watch training logs ```bash tail -n 200 -F $REPO/src/log/log.txt ``` ### Checkpoints Your checkpoints land in: ```bash ls -lah $REPO/src/log/model_*.pt ``` Example you currently have: * `model_05000.pt` * `model_10000.pt` * `model_15000.pt` * `model_19072.pt` --- ## 5) Convert checkpoint → Hugging Face model (run on **Instance B**) Convert your final checkpoint into HF format (you already have `hf_pretrained`, but here’s the exact command): ```bash cd $REPO source .venv/bin/activate python3 src/convert_ckpt_to_hf.py \ --ckpt src/log/model_19072.pt \ --out src/hf_pretrained ``` --- ## 6) Evaluate HF model (run on **Instance B**) ```bash cd $REPO source .venv/bin/activate python3 src/evals.py -m src/hf_pretrained-2 -d cuda ``` --- ## 7) SFT (LoRA) (run on **Instance A**) Your repo has `src/sft.py`. Run it: ```bash cd $REPO source .venv/bin/activate CUDA_VISIBLE_DEVICES=0 python3 src/sft.py ``` Expected output: * `src/hf_sft_lora/` (adapters + checkpoints) --- ## 8) Merge LoRA → merged HF model (run on **Instance B**) Run: ```bash cd $REPO source .venv/bin/activate python3 src/merge_lora.py ``` Evaluate merged model: ```bash cd $REPO source .venv/bin/activate python3 src/evals.py -m src/hf_sft_merged -d cuda ``` --- ## 9) Convert merged HF → GGUF + Quantize (run on **Instance B**) ### 9.1 HF → fp16 GGUF ```bash cd $REPO source .venv/bin/activate python llama.cpp/convert_hf_to_gguf.py \ src/hf_sft_merged \ --outfile src/gpt2-f16.gguf ``` ### 9.2 Quantize to Q4_K_M ```bash cd $REPO ./llama.cpp/build/bin/llama-quantize \ src/gpt2-f16.gguf \ src/gpt2-Q4_K_M_2.gguf \ Q4_K_M ``` --- ### Login ```bash pip install -U "huggingface_hub[cli]" hf auth login ``` ### Upload the GGUF Set your repo: ```bash export HF_REPO="your-username/your-gguf-repo" ``` Upload: ```bash hf upload "$HF_REPO" src/gpt2-Q4_K_M_2.gguf gpt2-Q4_K_M_2.gguf --repo-type model ``` (Optionally upload your template too) ```bash hf upload "$HF_REPO" src/user_assistant.jinja user_assistant.jinja --repo-type model ``` --- ## 11) Inference with llama.cpp (your exact “ONE paragraph + ” flow) From repo root (note: your GGUF is `src/gpt2-Q4_K_M_2.gguf`): ```bash cd $REPO ./llama.cpp/build/bin/llama-cli \ -m src/gpt2-Q4_K_M_2.gguf \ --jinja --chat-template-file src/user_assistant.jinja \ -cnv -st \ -sys "Answer in exactly ONE paragraph (no blank lines). End your answer with: " \ -r "" \ -n 200 --temp 0.2 --top-p 0.95 \ --repeat-penalty 1.15 --repeat-last-n 128 ``` --- ## 12) Pull the GGUF from Hugging Face (Mac/Linux) and use it ### Download via HF CLI ```bash pip install -U "huggingface_hub[cli]" hf download "$HF_REPO" gpt2-Q4_K_M_2.gguf --local-dir . ``` ### Or let llama.cpp pull from HF (caching supported) Hugging Face documents running GGUFs with llama.cpp by pointing to the HF repo/file, and llama.cpp caches downloads (cache path controlled by `LLAMA_CACHE`). ([Hugging Face][2]) --- ## 13) PocketPal (iPhone) — load GGUF from Hugging Face and run offline PocketPal supports downloading **GGUF model weights** (including from Hugging Face) and chatting offline. ([App Store][3]) PocketPal’s project also notes **Hugging Face Hub integration** (browse/download models inside the app). ([GitHub][4]) ### Recommended flow 1. Upload `gpt2-Q4_K_M_2.gguf` to Hugging Face (Section 10). 2. On iPhone: * Open PocketPal * Go to Models / Download / Hugging Face (wording varies by version) * Search your repo (`your-username/your-gguf-repo`) * Download `gpt2-Q4_K_M_2.gguf` 3. Start a chat offline. > If PocketPal asks for a chat template, choose the one that matches your “User / Assistant” formatting, or keep prompts simple (your `user_assistant.jinja` template is what you used in llama.cpp). ---