Update README.md

# GPT-2 Small (124M) — Built From Scratch

A full GPT-2 Small (124M parameter) language model built entirely from scratch in PyTorch — no HuggingFace model classes, no pretrained weights, no shortcuts. Pretrained on 25B tokens of FineWeb-Edu using 2× A100 SXM4 80GB GPUs with DDP.

**This model beats OpenAI's original GPT-2 Small on 3 out of 4 standard benchmarks.**

---

## Benchmark Results

| Benchmark | This Model | OpenAI GPT-2 | Winner |
|---|---|---|---|
| **WikiText-103 Perplexity** ↓ | **45.93** | 48.84 | ✅ Ours |
| **HellaSwag Accuracy** ↑ | **38.5%** | 31.0% | ✅ Ours (+7.5%) |
| **LAMBADA Accuracy** ↑ | 61.5% | 65.5% | ❌ GPT-2 (+4.0%) |
| **Text Coherence** | Low repetition | Higher repetition | ✅ Ours |

> Evaluated head-to-head against HuggingFace's `gpt2` (OpenAI's official weights) using identical prompts, datasets, and evaluation code.

---

## Architecture

Standard GPT-2 decoder-only transformer with:

- **12 layers**, **12 attention heads**, **768 embedding dim**
- Fused QKV projection (single matmul instead of three)
- Flash Attention via `scaled_dot_product_attention` with `is_causal=True`
- GELU activation (tanh approximation)
- Pre-norm architecture (LayerNorm before attention/FFN)
- Weight tying (output projection shares weights with token embedding)
- GPT-2 style initialization: `N(0, 0.02)` with scaled residual paths `(0.02 / √(2n))`

```
Model params : 124.4M
Decay params : 124.3M
No-decay params : 0.1M
Context length : 1024 tokens
Vocab size : 50,257 (GPT-2 BPE)
```

---

## Training

### Phase 1: Pretraining (20B tokens, ~17 hours)

| Setting | Value |
|---|---|
| Hardware | 2× NVIDIA A100 SXM4 80GB |
| Data | [FineWeb-Edu](https://huggingface.co/datasets/karpathy/fineweb-edu-100B-gpt2-token-shards) (pretokenized `.bin` shards) |
| Batch size | 32/GPU × 2 GPUs × 8 accum = 512 sequences (524K tokens/step) |
| Optimizer | AdamW (fused), β=(0.9, 0.95), weight decay 0.1 |
| LR schedule | Linear warmup (715 steps) → Cosine decay (6e-4 → 6e-5) |
| Precision | BF16 mixed precision with `torch.autocast` |
| Compilation | `torch.compile()` for kernel fusion |
| Throughput | ~316K tokens/sec sustained |
| Final val loss | 3.0821 |

### Phase 2: Refinement (5B tokens, ~5 hours)

Low learning rate cooldown phase for final polish:

| Setting | Value |
|---|---|
| Resume from | Best pretrain checkpoint |
| LR schedule | Cosine decay (5e-5 → 5e-6) |
| Warmup | 50 steps |
| Improvement | 3.0821 → **2.9597** (-0.1223) |

### Phase 3: Long-Context Refinement (Optional)

To improve long-range dependencies and story coherence (specifically targeting our LAMBADA score of 61.5%), we run a secondary refinement phase purely on classical literature (Project Gutenberg via `sedthh/gutenberg_english`).
* Extremely low learning rate (`5e-6` → `1e-6`) prevents the "alignment tax"
* Operates on dense, context-heavy sequences

### Phase 4: Supervised Fine-Tuning - SFT (Chatbot)

We convert the raw autocomplete model into an instruction-following assistant by fine-tuning on the **Alpaca dataset** (52K conversation turns).
* **Loss Masking:** Prompt tokens are ignored (`-100`) during cross-entropy so the model strictly learns to *generate responses*, not mimic prompts.
* **Evaluation:** An automated suite (`finetuning/eval_sft.py`) validates constraint-following (JSON outputs, counting) and utilizes **LLM-as-a-Judge** (`gpt-4o-mini`) to grade response coherence.

### Training Curve

```
Step 0 │ loss 10.97 │ Random initialization
Step 5,000 │ loss 3.39 │ Learning language structure
Step 15,000 │ loss 3.20 │ Passed GPT-2 val loss (3.11)
Step 30,000 │ loss 3.12 │ Diminishing returns at high LR
Step 38,500 │ loss 3.08 │ End pretraining
Refine │ loss 2.96 │ Cooldown phase (5e-5 → 5e-6)
```

---

## Project Structure

```
├── Architecture.py # Model definition (MHA, FFN, TransformerBlock, LanguageModel)
├── Generator.py # Text generation with temperature, top-k, top-p sampling
├── GeneratorHF.py # HuggingFace-compatible generator wrapper (for generation config)
├── train.py # DDP pretraining script (torchrun)
├── refine.py # Low-LR refinement/cooldown phase
├── prepare_longctx.py # Download and stream Project Gutenberg data for long context
├── refine_longctx.py # Refinement script for long-context book corpus
├── dataloader.py # DDP-aware streaming shard loader (.bin format)
├── preparedata.py # Download pretokenized shards from HuggingFace
├── evaluate.py # Base Benchmark suite (WikiText-103, HellaSwag, LAMBADA)
├── chat.py # Interactive terminal generation
│
├── finetuning/
│ ├── sft.py # Supervised Fine-Tuning on Alpaca dataset
│ ├── chat_sft.py # Clean chatbot terminal interface for the SFT model
│ └── eval_sft.py # Automated constraints and LLM-as-a-judge benchmarking
│
├── Trainer.py # Legacy trainer (from notebook prototype)
├── Main.ipynb # Original prototype notebook
└── Tokenization.ipynb # BPE tokenization experiments
```

---

## Quick Start

### Generate Text

```bash
# Interactive generation
python chat.py --checkpoint best-gpt2small.pth

# Adjust sampling
python chat.py --checkpoint best-gpt2small.pth --temperature 0.9 --top-k 50
```

### Run Benchmarks

```bash
pip install transformers datasets tiktoken

# Full evaluation (side-by-side with OpenAI's GPT-2)
python evaluate.py --checkpoint best-gpt2small.pth

# Individual tests
python evaluate.py --checkpoint best-gpt2small.pth --test hellaswag
python evaluate.py --checkpoint best-gpt2small.pth --test perplexity
python evaluate.py --checkpoint best-gpt2small.pth --test lambada
```

### Train From Scratch

```bash
# 1. Download pretokenized data (~80 GB for full 40B tokens)
python preparedata.py

# 2. Pretrain with DDP (2+ GPUs recommended)
torchrun --nproc_per_node=2 train.py

# 3. Refinement phase (resume from best checkpoint)
torchrun --nproc_per_node=2 refine.py
```

### SFT Fine-Tuning (Chatbot)

```bash
# 1. Run the fine-tuning pipeline
python finetuning/sft.py

# 2. Chat with your custom assistant!
python finetuning/chat_sft.py --checkpoint finetuning/sft_output/best_sft.pth

# 3. Automated Check
# Deterministic checks only:
python finetuning/eval_sft.py
# Deterministic checks + GPT-4o-mini coherence scoring:
python finetuning/eval_sft.py --openai-key sk-xxxxxx
```

---

## Sample Outputs

**Prompt:** *"In a shocking finding, scientists discovered"*

> In a shocking finding, scientists discovered how the brain's decision-making process is affected by the amount of a certain protein molecule (a protein called MMP1). The findings were published in the journal Brain. "Brain plasticity is necessary for learning to take place," said study co-author Pau...

**Prompt:** *"The theory of relativity states that"*

> The theory of relativity states that the apparent movement of objects relative to one another is governed by a speed limit defined by a mathematical formula which we have already seen in reference to the speed of light. The speed of light is proportional to the change in speed of objects relative to one another. This formula is known as the Einstein formula.

**Prompt:** *"Climate change is one of the most"*

> Climate change is one of the most challenging issues facing the world, and it is one that we cannot ignore. And with the increasing frequency of the hottest days, it is a threat that we can all equally agree on. The world is warming in the most extreme ways, and in the most important ways, it is mak...

---

## Key Design Decisions

1. **No dropout during pretraining** — Following GPT-3 practice. Dropout hurts when you have enough data.
2. **Flash Attention** — PyTorch's `scaled_dot_product_attention` with `is_causal=True` for O(1) memory attention.
3. **Fused QKV** — Single linear projection split into Q, K, V, reducing memory bandwidth.
4. **Weight tying** — Output projection shares embedding weights, saving ~38M parameters.
5. **Scaled residual init** — `0.02 / √(2 × n_layers)` prevents the residual stream from exploding in deep networks.
6. **Fresh optimizer for refinement** — Old momentum from high-LR pretraining would fight the low-LR cooldown.
7. **FineWeb-Edu over WebText** — Curated educational data produces better commonsense reasoning (HellaSwag +7.5%).
8. **Loss Masking in SFT** — Replaced prompt tokens with `-100` during fine-tuning so cross-entropy is only calculated on the assistant's responses.

---

## References

- Radford et al., *"Language Models are Unsupervised Multitask Learners"* (GPT-2 paper)
- Raschka, *"Build a Large Language Model From Scratch"*
- Karpathy, [llm.c](https://github.com/karpathy/llm.c) & [nanoGPT](https://github.com/karpathy/nanoGPT)
- [FineWeb-Edu Dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)

---

## Hardware

Trained on [Vast.ai](https://cloud.vast.ai/?ref_id=112020) using 2× NVIDIA A100 SXM4 80GB GPUs. Total training time: ~22 hours (17h pretrain + 5h refinement).

Files changed (1) hide show

README.md +12 -3

README.md CHANGED Viewed

@@ -1,3 +1,12 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- karpathy/fineweb-edu-100b-shuffle
+language:
+- en
+metrics:
+- perplexity
+pipeline_tag: text-generation
+tags:
+- text-generation-inference
+---