import gradio as gr
# ============================================================
# CONTENT β each chapter stored as a Python string (Markdown)
# ============================================================
OVERVIEW = r"""
# π The Complete Guide to Post-Training of Large Language Models
### From Pretraining to Alignment β Everything You Need to Know
---
> **Who is this for?** You've learned how pretraining works β you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens *after* pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini.
---
## πΊοΈ Roadmap
| # | Chapter | What You'll Learn |
|---|---------|-------------------|
| 1 | **The Big Picture** | Why pretrained models aren't useful yet; the 3-stage pipeline |
| 2 | **SFT** | Supervised Fine-Tuning β loss function, data formats, key papers |
| 3 | **RLHF** | Reward models, PPO, KL divergence, reward hacking |
| 4 | **DPO** | Direct Preference Optimization β RLHF without RL |
| 5 | **Preference Zoo** | KTO, ORPO, SimPO, CPO, IPO, Online DPO |
| 6 | **GRPO & Reasoning** | DeepSeek-R1, reward functions, the reasoning revolution |
| 7 | **PEFT** | LoRA, QLoRA β fine-tune on consumer GPUs |
| 8 | **Toolbox** | TRL, Transformers, vLLM, Accelerate, DeepSpeed |
| 9 | **Datasets** | What to train on β curated lists with Hub links |
| 10 | **Evaluation** | Benchmarks, LLM-as-Judge, human eval |
| 11 | **Full Recipe** | End-to-end pipeline with code |
| 12 | **Reading List** | 18 must-read papers in 4 tiers |
---
### The Three Stages of Post-Training
```
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
β STAGE 1: SFT β ββ> β STAGE 2: Reward β ββ> β STAGE 3: RL β
β β β Model Training β β (PPO / DPO / GRPO) β
β Teach format β β Learn preferencesβ β Optimize for preferencesβ
β & behavior β β from comparisons β β while staying close to β
β β β β β the SFT model β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
Input: Pretrained LM Output: Aligned Assistant
```
### The Evolution Timeline
| Year | Method | Key Idea |
|------|--------|----------|
| 2017 | RLHF (original) | Human preferences β reward model β RL |
| 2020 | RLHF for LLMs | Applied to text summarization |
| 2022 | **InstructGPT** | Full SFT β RM β PPO pipeline |
| 2022 | Constitutional AI | AI feedback replaces human feedback |
| 2023 | **DPO** | No reward model needed β direct optimization |
| 2024 | KTO / ORPO | Binary feedback / combined SFT+preference |
| 2024 | **GRPO** | Group-based RL for reasoning (DeepSeek) |
| 2025 | DeepSeek-R1 | RL teaches chain-of-thought from scratch |
"""
CH1_BIG_PICTURE = r"""
# Chapter 1: The Big Picture β Why Post-Training Exists
## 1.1 The Gap Between Pretraining and Usefulness
You've pretrained a language model. It can predict the next token with impressive accuracy. It has absorbed vast knowledge from the internet. But try asking it a question:
```
User: What is the capital of France?
Model: What is the capital of Germany? What is the capital of Italy? What is the...
```
The model doesn't *answer* β it *continues*. The pretraining objective (`P(next_token | context)`) optimizes for predicting what comes next in web text, not for being helpful.
> *"Large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users."*
> β **InstructGPT** (Ouyang et al., 2022)
## 1.2 What Post-Training Does
Post-training is everything after pretraining that makes a model useful, safe, and aligned. It has three stages:
**Stage 1 β SFT (Supervised Fine-Tuning):** Teach the model the *format* of helpful responses using demonstrations.
**Stage 2 β Reward Modeling:** Train a model to predict which response a human would prefer.
**Stage 3 β RL Optimization:** Optimize the SFT model to generate responses the reward model scores highly.
## 1.3 The Superficial Alignment Hypothesis
> *"A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."*
> β **LIMA** (Zhou et al., 2023)
This is the key insight: **post-training doesn't teach new knowledge** β it teaches the model to *surface existing knowledge in the right way*. The pretrained model already "knows" the capital of France; SFT teaches it to respond to questions rather than generate more questions.
## 1.4 Why This Matters
The difference is dramatic:
- **InstructGPT 1.3B** (post-trained) was preferred over **GPT-3 175B** (pretrained only) by human evaluators
- That's a 100Γ smaller model winning because of post-training
- Post-training is what turns a text predictor into an assistant
"""
CH2_SFT = r"""
# Chapter 2: Supervised Fine-Tuning (SFT)
## 2.1 What SFT Does
SFT is the bridge between a pretrained language model and a useful assistant.
**Before SFT:**
```
Input: "Explain quantum computing in simple terms."
Output: "Explain quantum computing to a 5-year-old. Explain quantum computing..."
```
**After SFT:**
```
Input: "Explain quantum computing in simple terms."
Output: "Quantum computing uses the principles of quantum mechanics to process
information. Unlike classical computers that use bits (0 or 1),
quantum computers use qubits that can be both 0 and 1 simultaneously..."
```
## 2.2 The SFT Loss Function
If you understand pretraining, you understand SFT β with one crucial difference:
**Pretraining loss** (on ALL tokens):
```
L_pretrain = -Ξ£ log P(token_i | token_1, ..., token_{i-1}) for ALL tokens
```
**SFT loss** (on RESPONSE tokens only):
```
L_SFT = -Ξ£ log P(c_i | prompt, c_1, ..., c_{i-1}) for COMPLETION tokens only
```
The prompt tokens are masked from the loss. We don't want the model to learn to generate instructions β we want it to learn to *respond* to them.
```
Sequence: [User: What is 2+2?] [Assistant: 4]
Loss mask: [ ----IGNORED---- ] [COMPUTED HERE ]
```
## 2.3 Chat Templates & Data Format
Modern SFT uses structured conversations:
```python
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
```
Each model family converts this to its own template:
```
# ChatML (Qwen, etc.):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
# Llama-3:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|>
```
The `transformers` library handles this automatically:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
# For training:
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
# For inference:
text = tokenizer.apply_chat_template(messages[:1], tokenize=False, add_generation_prompt=True)
```
## 2.4 Key SFT Papers
### FLAN (2021) β Instruction Tuning Works
π *"Finetuned Language Models Are Zero-Shot Learners"* β [arXiv:2109.01652](https://arxiv.org/abs/2109.01652)
- 62 NLP datasets formatted as instructions β fine-tune LaMDA-PT 137B
- **Result:** Surpassed zero-shot GPT-3 175B on 20/25 tasks
- **Key insight:** Instructions matter β same tasks without instructions = much weaker
- **Recipe:** Adafactor, lr=3e-5, 30K steps, batch 8192 tokens
### Self-Instruct (2022) β Synthetic Data
π *"Self-Instruct"* β [arXiv:2212.10560](https://arxiv.org/abs/2212.10560)
- 175 seed tasks β GPT-3 generates 52K instructions + responses
- **Result:** +33% over vanilla GPT-3 on SuperNaturalInstructions
- **Led to:** Stanford Alpaca (LLaMA + 52K GPT instructions for <$600)
### InstructGPT (2022) β SFT as Stage 1
π *"Training Language Models to Follow Instructions with Human Feedback"* β [arXiv:2203.02155](https://arxiv.org/abs/2203.02155)
- ~13K human demonstrations, 16 epochs, cosine LR, dropout 0.2
- **Key finding:** SFT overfits on val loss after 1 epoch, but more epochs β RM score
- **Result:** 1.3B InstructGPT preferred over 175B GPT-3
### LIMA (2023) β Less Is More
π *"LIMA: Less Is More for Alignment"* β [arXiv:2305.11206](https://arxiv.org/abs/2305.11206)
- Just **1,000 curated examples** β competitive with GPT-3.5 (DaVinci003)
- **Recipe:** AdamW, lr 1e-5β1e-6, 15 epochs, batch 32, max len 2048
- **Takeaway:** Data quality >> data quantity
## 2.5 SFT Code Example
```python
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
config = SFTConfig(
output_dir="./sft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
max_seq_length=2048,
gradient_checkpointing=True,
bf16=True,
logging_steps=10,
push_to_hub=True,
hub_model_id="your-username/your-sft-model",
)
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
args=config,
train_dataset=dataset,
)
trainer.train()
```
SFTTrainer automatically detects the `messages` column, applies the chat template, and masks prompt tokens from the loss.
"""
CH3_RLHF = r"""
# Chapter 3: Reinforcement Learning from Human Feedback (RLHF)
## 3.1 Why SFT Isn't Enough
SFT teaches format and basic behavior, but:
- It only learns from demonstrations (can't be better than its training data)
- It can't express preferences (treats all tokens equally)
- It can learn bad habits from imperfect training data
RLHF addresses this by training on **which outputs are better**, not what specific tokens to generate.
## 3.2 Step 1: Train a Reward Model
A **reward model (RM)** scores how good a response is (prompt + response β scalar score).
**Training process:**
1. Generate multiple responses per prompt using the SFT model
2. Humans rank responses (e.g., A > B)
3. Train RM to predict these rankings
**The Bradley-Terry preference model:**
```
P(A preferred over B) = Ο(r(A) - r(B))
Loss: L_RM = -E[log Ο(r(x, y_chosen) - r(x, y_rejected))]
```
**Architecture:** Same as the LM, but output head β single scalar. InstructGPT used a 6B RM (not 175B β larger was unstable).
## 3.3 Step 2: Optimize with PPO
**The RLHF objective:**
```
maximize E[RM(prompt, response)] - Ξ² Β· KL(Ο_ΞΈ || Ο_ref)
β score high β don't deviate too far
on reward model from original SFT model
```
The **KL penalty** prevents **reward hacking** β without it, the model generates gibberish that tricks the RM.
**PPO (Proximal Policy Optimization) loop:**
1. **Generate** responses from the current model
2. **Score** them with the reward model
3. **Compute advantage** (how much better than expected)
4. **Update** weights to favor high-advantage responses
5. **Clip** updates for stability
```
L_PPO = -E[min(ratio Β· Γ, clip(ratio, 1-Ξ΅, 1+Ξ΅) Β· Γ)]
where ratio = Ο_ΞΈ(a|s) / Ο_old(a|s)
```
## 3.4 InstructGPT Training Details
- Ξ² = 0.02 for KL penalty
- Mixed 10% pretraining data during PPO (prevents capability regression)
- LR range: 2.55e-6 to 2.55e-5 (>8.05e-6 diverged)
- 256K PPO episodes total
- 4 models in memory simultaneously: policy, reference, reward, value
## 3.5 Why RLHF is Hard
| Challenge | Description |
|-----------|-------------|
| **Complexity** | 4 models in memory simultaneously |
| **Instability** | PPO is sensitive to hyperparameters |
| **Reward hacking** | Model exploits RM rather than genuinely improving |
| **Cost** | Human preference data is expensive |
| **Reproducibility** | Small changes β very different outcomes |
These challenges motivated the development of DPO.
## 3.6 Constitutional AI (RLAIF)
π *"Constitutional AI"* (Bai et al., 2022) β [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)
**Key idea:** Replace human feedback with **AI feedback**. An AI evaluates responses against principles (the "constitution"). Dramatically reduces cost and scales the feedback process.
"""
CH4_DPO = r"""
# Chapter 4: Direct Preference Optimization (DPO)
## 4.1 The Key Insight
π *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"* β [arXiv:2305.18290](https://arxiv.org/abs/2305.18290)
**You don't need a separate reward model or RL training loop.** The language model itself implicitly represents a reward model.
The optimal solution to the RLHF objective can be expressed in closed form:
```
Ο*(y|x) = (1/Z(x)) Β· Ο_ref(y|x) Β· exp((1/Ξ²) Β· r(x,y))
```
Rearranging to express reward in terms of the policy:
```
r(x,y) = Ξ² Β· log(Ο_ΞΈ(y|x) / Ο_ref(y|x)) + Ξ² Β· log Z(x)
```
Since Bradley-Terry only uses **reward differences**, the partition function Z(x) cancels:
```
L_DPO = -E[log Ο(Ξ² Β· log(Ο_ΞΈ(y_w|x)/Ο_ref(y_w|x)) - Ξ² Β· log(Ο_ΞΈ(y_l|x)/Ο_ref(y_l|x)))]
```
## 4.2 DPO vs RLHF
| Aspect | RLHF (PPO) | DPO |
|--------|-------------|-----|
| Models in memory | 4 | 2 (policy + reference) |
| Training loop | Complex RL with generation | Simple supervised training |
| Hyperparameters | Many | Few (mainly Ξ²) |
| Stability | Often unstable | Very stable |
| Sampling during training | Required | Not required |
| Performance | Strong | Comparable or better |
## 4.3 The DPO Gradient β Intuition
```
βL_DPO β -Ξ² Β· [weight] Β· [βlog Ο(y_w|x) - βlog Ο(y_l|x)]
```
- **Increase** likelihood of preferred response y_w
- **Decrease** likelihood of rejected response y_l
- **Weight** by how wrong the model currently is
If the model already prefers y_w correctly β small gradient (don't fix what isn't broken).
## 4.4 DPO Data Format
```python
# Each example needs: prompt + chosen + rejected
{
"prompt": [{"role": "user", "content": "Explain gravity"}],
"chosen": [{"role": "assistant", "content": "Gravity is a fundamental force..."}],
"rejected": [{"role": "assistant", "content": "Gravity is when things fall down."}]
}
```
## 4.5 DPO Code Example
```python
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
config = DPOConfig(
output_dir="./dpo-output",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7, # Much lower than SFT!
beta=0.1, # KL penalty strength
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/your-dpo-model",
)
trainer = DPOTrainer(
model="your-sft-model", # SFT model from stage 1
args=config,
train_dataset=dataset,
)
trainer.train()
```
## 4.6 Key Hyperparameters
| Parameter | Typical Range | Notes |
|-----------|--------------|-------|
| **Ξ² (beta)** | 0.01 β 0.5 | Higher = stay closer to reference. Default: 0.1 |
| **Learning rate** | 1e-7 β 5e-6 | Much lower than SFT. Very sensitive. |
| **Epochs** | 1 β 3 | Overfitting is common with more |
"""
CH5_PREFERENCE_ZOO = r"""
# Chapter 5: The Preference Optimization Zoo
After DPO, researchers developed many variants. Here's what each adds.
## 5.1 IPO β Identity Preference Optimization
**Problem:** DPO can overfit, especially with noisy preferences.
**Solution:** Adds regularization without assuming the Bradley-Terry model:
```
L_IPO = E[(log(Ο_ΞΈ(y_w)/Ο_ref(y_w)) - log(Ο_ΞΈ(y_l)/Ο_ref(y_l)) - 1/(2Ξ²))Β²]
```
**When to use:** Noisy preference data, DPO overfitting.
## 5.2 KTO β Kahneman-Tversky Optimization
π [arXiv:2402.01306](https://arxiv.org/abs/2402.01306)
**Problem:** DPO needs *paired* preferences (chosen AND rejected per prompt). Expensive to collect.
**Solution:** Works with **unpaired binary feedback** β just "good" or "bad" per response. Based on prospect theory: losses hurt more than equivalent gains.
```python
{"prompt": "...", "completion": "...", "label": True} # π
{"prompt": "...", "completion": "...", "label": False} # π
```
**When to use:** Thumbs-up/down feedback, no pairwise comparisons.
## 5.3 ORPO β Odds Ratio Preference Optimization
**Problem:** DPO needs a separate SFT stage + reference model.
**Solution:** Combines SFT and preference optimization in **one step**:
```
L_ORPO = L_SFT + Ξ» Β· L_OR
```
**When to use:** Simplest possible pipeline.
## 5.4 SimPO β Simple Preference Optimization
**Problem:** DPO needs a reference model in memory (doubles GPU needs).
**Solution:** Eliminates reference model. Uses **average log probability** as the implicit reward (length-normalized).
**When to use:** GPU memory constraints.
## 5.5 CPO β Contrastive Preference Optimization
Removes reference model using a contrastive loss. Similar to SimPO, different formulation.
## 5.6 Online DPO
**Problem:** Standard DPO uses a static dataset β data goes stale as the model changes.
**Solution:** Generates new completions from the *current* model during training, scored by a reward model.
## 5.7 Comparison Table
| Method | Ref Model? | Paired Data? | Needs RM? | Separate SFT? | Key Advantage |
|--------|-----------|-------------|-----------|---------------|---------------|
| PPO | Yes | No | **Yes** | Yes | Gold standard |
| DPO | Yes | **Yes** | No | Yes | Simple, stable |
| IPO | Yes | Yes | No | Yes | Robust to noise |
| KTO | Yes | **No** | No | Yes | Binary feedback |
| ORPO | **No** | Yes | No | **No** | Simplest pipeline |
| SimPO | **No** | Yes | No | Yes | Memory efficient |
| CPO | **No** | Yes | No | Yes | Memory efficient |
| Online DPO | Yes | Online | **Yes** | Yes | Fresh data |
| GRPO | Yes (soft) | No | **Yes** / funcs | Yes | Best for reasoning |
"""
CH6_GRPO = r"""
# Chapter 6: GRPO and the Reasoning Revolution
## 6.1 What is GRPO?
π *"DeepSeekMath"* (Shao et al., 2024) β [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
**Group Relative Policy Optimization** β a PPO variant that's more memory-efficient and great for reasoning.
**Key idea:** Instead of a separate value model (critic), GRPO generates **multiple completions per prompt** and uses the group average as the baseline.
## 6.2 How GRPO Works
```
For each prompt:
1. Generate G completions (e.g., G=16)
2. Score each with a reward function
3. Compute advantage: Γ_i = (r_i - mean(r)) / std(r)
4. Update model: β probability of high-advantage completions
β probability of low-advantage completions
```
**The GRPO loss:**
```
L = -E[min(ratio Β· Γ, clip(ratio, 1-Ξ΅, 1+Ξ΅) Β· Γ)] + Ξ² Β· KL
```
"Group Relative" = advantage computed *relative to the group* for the same prompt.
## 6.3 The DeepSeek-R1 Story
π *"DeepSeek-R1"* (DeepSeek-AI, 2025) β [arXiv:2501.12948](https://arxiv.org/abs/2501.12948)
GRPO was used to train DeepSeek-R1 β a model that learned chain-of-thought reasoning **purely through RL**.
With the right reward (accuracy on math/code), the model **spontaneously developed:**
- Chain-of-thought reasoning
- Self-verification ("Let me check...")
- Error correction
- Problem decomposition
No one explicitly taught these behaviors β they emerged from the reward signal.
## 6.4 Reward Functions
GRPO is flexible β rewards can be:
1. **Python functions** (rule-based): Is the math answer correct? Does the code pass tests?
2. **Reward models** (learned): A neural network scoring responses
3. **Multiple functions**: accuracy + format + length
```python
import re
def accuracy_reward(completions, ground_truth, **kwargs):
matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions]
contents = [m.group(1) if m else "" for m in matches]
return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
def format_reward(completions, **kwargs):
pattern = r"
From Pretraining to Alignment β SFT Β· RLHF Β· DPO Β· GRPO Β· LoRA and more