NeuralNoble's picture
Add Gradio app with 12 chapters + search
84f8da4 verified
import gradio as gr
# ============================================================
# CONTENT — each chapter stored as a Python string (Markdown)
# ============================================================
OVERVIEW = r"""
# 📖 The Complete Guide to Post-Training of Large Language Models
### From Pretraining to Alignment — Everything You Need to Know
---
> **Who is this for?** You've learned how pretraining works — you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens *after* pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini.
---
## 🗺️ Roadmap
| # | Chapter | What You'll Learn |
|---|---------|-------------------|
| 1 | **The Big Picture** | Why pretrained models aren't useful yet; the 3-stage pipeline |
| 2 | **SFT** | Supervised Fine-Tuning — loss function, data formats, key papers |
| 3 | **RLHF** | Reward models, PPO, KL divergence, reward hacking |
| 4 | **DPO** | Direct Preference Optimization — RLHF without RL |
| 5 | **Preference Zoo** | KTO, ORPO, SimPO, CPO, IPO, Online DPO |
| 6 | **GRPO & Reasoning** | DeepSeek-R1, reward functions, the reasoning revolution |
| 7 | **PEFT** | LoRA, QLoRA — fine-tune on consumer GPUs |
| 8 | **Toolbox** | TRL, Transformers, vLLM, Accelerate, DeepSpeed |
| 9 | **Datasets** | What to train on — curated lists with Hub links |
| 10 | **Evaluation** | Benchmarks, LLM-as-Judge, human eval |
| 11 | **Full Recipe** | End-to-end pipeline with code |
| 12 | **Reading List** | 18 must-read papers in 4 tiers |
---
### The Three Stages of Post-Training
```
┌─────────────┐ ┌──────────────────┐ ┌─────────────────────────┐
│ STAGE 1: SFT │ ──> │ STAGE 2: Reward │ ──> │ STAGE 3: RL │
│ │ │ Model Training │ │ (PPO / DPO / GRPO) │
│ Teach format │ │ Learn preferences│ │ Optimize for preferences│
│ & behavior │ │ from comparisons │ │ while staying close to │
│ │ │ │ │ the SFT model │
└─────────────┘ └──────────────────┘ └─────────────────────────┘
Input: Pretrained LM Output: Aligned Assistant
```
### The Evolution Timeline
| Year | Method | Key Idea |
|------|--------|----------|
| 2017 | RLHF (original) | Human preferences → reward model → RL |
| 2020 | RLHF for LLMs | Applied to text summarization |
| 2022 | **InstructGPT** | Full SFT → RM → PPO pipeline |
| 2022 | Constitutional AI | AI feedback replaces human feedback |
| 2023 | **DPO** | No reward model needed — direct optimization |
| 2024 | KTO / ORPO | Binary feedback / combined SFT+preference |
| 2024 | **GRPO** | Group-based RL for reasoning (DeepSeek) |
| 2025 | DeepSeek-R1 | RL teaches chain-of-thought from scratch |
"""
CH1_BIG_PICTURE = r"""
# Chapter 1: The Big Picture — Why Post-Training Exists
## 1.1 The Gap Between Pretraining and Usefulness
You've pretrained a language model. It can predict the next token with impressive accuracy. It has absorbed vast knowledge from the internet. But try asking it a question:
```
User: What is the capital of France?
Model: What is the capital of Germany? What is the capital of Italy? What is the...
```
The model doesn't *answer* — it *continues*. The pretraining objective (`P(next_token | context)`) optimizes for predicting what comes next in web text, not for being helpful.
> *"Large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users."*
> — **InstructGPT** (Ouyang et al., 2022)
## 1.2 What Post-Training Does
Post-training is everything after pretraining that makes a model useful, safe, and aligned. It has three stages:
**Stage 1 — SFT (Supervised Fine-Tuning):** Teach the model the *format* of helpful responses using demonstrations.
**Stage 2 — Reward Modeling:** Train a model to predict which response a human would prefer.
**Stage 3 — RL Optimization:** Optimize the SFT model to generate responses the reward model scores highly.
## 1.3 The Superficial Alignment Hypothesis
> *"A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."*
> — **LIMA** (Zhou et al., 2023)
This is the key insight: **post-training doesn't teach new knowledge** — it teaches the model to *surface existing knowledge in the right way*. The pretrained model already "knows" the capital of France; SFT teaches it to respond to questions rather than generate more questions.
## 1.4 Why This Matters
The difference is dramatic:
- **InstructGPT 1.3B** (post-trained) was preferred over **GPT-3 175B** (pretrained only) by human evaluators
- That's a 100× smaller model winning because of post-training
- Post-training is what turns a text predictor into an assistant
"""
CH2_SFT = r"""
# Chapter 2: Supervised Fine-Tuning (SFT)
## 2.1 What SFT Does
SFT is the bridge between a pretrained language model and a useful assistant.
**Before SFT:**
```
Input: "Explain quantum computing in simple terms."
Output: "Explain quantum computing to a 5-year-old. Explain quantum computing..."
```
**After SFT:**
```
Input: "Explain quantum computing in simple terms."
Output: "Quantum computing uses the principles of quantum mechanics to process
information. Unlike classical computers that use bits (0 or 1),
quantum computers use qubits that can be both 0 and 1 simultaneously..."
```
## 2.2 The SFT Loss Function
If you understand pretraining, you understand SFT — with one crucial difference:
**Pretraining loss** (on ALL tokens):
```
L_pretrain = -Σ log P(token_i | token_1, ..., token_{i-1}) for ALL tokens
```
**SFT loss** (on RESPONSE tokens only):
```
L_SFT = -Σ log P(c_i | prompt, c_1, ..., c_{i-1}) for COMPLETION tokens only
```
The prompt tokens are masked from the loss. We don't want the model to learn to generate instructions — we want it to learn to *respond* to them.
```
Sequence: [User: What is 2+2?] [Assistant: 4]
Loss mask: [ ----IGNORED---- ] [COMPUTED HERE ]
```
## 2.3 Chat Templates & Data Format
Modern SFT uses structured conversations:
```python
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
```
Each model family converts this to its own template:
```
# ChatML (Qwen, etc.):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
# Llama-3:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|>
```
The `transformers` library handles this automatically:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
# For training:
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
# For inference:
text = tokenizer.apply_chat_template(messages[:1], tokenize=False, add_generation_prompt=True)
```
## 2.4 Key SFT Papers
### FLAN (2021) — Instruction Tuning Works
📄 *"Finetuned Language Models Are Zero-Shot Learners"* — [arXiv:2109.01652](https://arxiv.org/abs/2109.01652)
- 62 NLP datasets formatted as instructions → fine-tune LaMDA-PT 137B
- **Result:** Surpassed zero-shot GPT-3 175B on 20/25 tasks
- **Key insight:** Instructions matter — same tasks without instructions = much weaker
- **Recipe:** Adafactor, lr=3e-5, 30K steps, batch 8192 tokens
### Self-Instruct (2022) — Synthetic Data
📄 *"Self-Instruct"* — [arXiv:2212.10560](https://arxiv.org/abs/2212.10560)
- 175 seed tasks → GPT-3 generates 52K instructions + responses
- **Result:** +33% over vanilla GPT-3 on SuperNaturalInstructions
- **Led to:** Stanford Alpaca (LLaMA + 52K GPT instructions for <$600)
### InstructGPT (2022) — SFT as Stage 1
📄 *"Training Language Models to Follow Instructions with Human Feedback"* — [arXiv:2203.02155](https://arxiv.org/abs/2203.02155)
- ~13K human demonstrations, 16 epochs, cosine LR, dropout 0.2
- **Key finding:** SFT overfits on val loss after 1 epoch, but more epochs ↑ RM score
- **Result:** 1.3B InstructGPT preferred over 175B GPT-3
### LIMA (2023) — Less Is More
📄 *"LIMA: Less Is More for Alignment"* — [arXiv:2305.11206](https://arxiv.org/abs/2305.11206)
- Just **1,000 curated examples** → competitive with GPT-3.5 (DaVinci003)
- **Recipe:** AdamW, lr 1e-5→1e-6, 15 epochs, batch 32, max len 2048
- **Takeaway:** Data quality >> data quantity
## 2.5 SFT Code Example
```python
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
config = SFTConfig(
output_dir="./sft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
max_seq_length=2048,
gradient_checkpointing=True,
bf16=True,
logging_steps=10,
push_to_hub=True,
hub_model_id="your-username/your-sft-model",
)
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
args=config,
train_dataset=dataset,
)
trainer.train()
```
SFTTrainer automatically detects the `messages` column, applies the chat template, and masks prompt tokens from the loss.
"""
CH3_RLHF = r"""
# Chapter 3: Reinforcement Learning from Human Feedback (RLHF)
## 3.1 Why SFT Isn't Enough
SFT teaches format and basic behavior, but:
- It only learns from demonstrations (can't be better than its training data)
- It can't express preferences (treats all tokens equally)
- It can learn bad habits from imperfect training data
RLHF addresses this by training on **which outputs are better**, not what specific tokens to generate.
## 3.2 Step 1: Train a Reward Model
A **reward model (RM)** scores how good a response is (prompt + response → scalar score).
**Training process:**
1. Generate multiple responses per prompt using the SFT model
2. Humans rank responses (e.g., A > B)
3. Train RM to predict these rankings
**The Bradley-Terry preference model:**
```
P(A preferred over B) = σ(r(A) - r(B))
Loss: L_RM = -E[log σ(r(x, y_chosen) - r(x, y_rejected))]
```
**Architecture:** Same as the LM, but output head → single scalar. InstructGPT used a 6B RM (not 175B — larger was unstable).
## 3.3 Step 2: Optimize with PPO
**The RLHF objective:**
```
maximize E[RM(prompt, response)] - β · KL(π_θ || π_ref)
↑ score high ↑ don't deviate too far
on reward model from original SFT model
```
The **KL penalty** prevents **reward hacking** — without it, the model generates gibberish that tricks the RM.
**PPO (Proximal Policy Optimization) loop:**
1. **Generate** responses from the current model
2. **Score** them with the reward model
3. **Compute advantage** (how much better than expected)
4. **Update** weights to favor high-advantage responses
5. **Clip** updates for stability
```
L_PPO = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)]
where ratio = π_θ(a|s) / π_old(a|s)
```
## 3.4 InstructGPT Training Details
- β = 0.02 for KL penalty
- Mixed 10% pretraining data during PPO (prevents capability regression)
- LR range: 2.55e-6 to 2.55e-5 (>8.05e-6 diverged)
- 256K PPO episodes total
- 4 models in memory simultaneously: policy, reference, reward, value
## 3.5 Why RLHF is Hard
| Challenge | Description |
|-----------|-------------|
| **Complexity** | 4 models in memory simultaneously |
| **Instability** | PPO is sensitive to hyperparameters |
| **Reward hacking** | Model exploits RM rather than genuinely improving |
| **Cost** | Human preference data is expensive |
| **Reproducibility** | Small changes → very different outcomes |
These challenges motivated the development of DPO.
## 3.6 Constitutional AI (RLAIF)
📄 *"Constitutional AI"* (Bai et al., 2022) — [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)
**Key idea:** Replace human feedback with **AI feedback**. An AI evaluates responses against principles (the "constitution"). Dramatically reduces cost and scales the feedback process.
"""
CH4_DPO = r"""
# Chapter 4: Direct Preference Optimization (DPO)
## 4.1 The Key Insight
📄 *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"* — [arXiv:2305.18290](https://arxiv.org/abs/2305.18290)
**You don't need a separate reward model or RL training loop.** The language model itself implicitly represents a reward model.
The optimal solution to the RLHF objective can be expressed in closed form:
```
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x,y))
```
Rearranging to express reward in terms of the policy:
```
r(x,y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)
```
Since Bradley-Terry only uses **reward differences**, the partition function Z(x) cancels:
```
L_DPO = -E[log σ(β · log(π_θ(y_w|x)/π_ref(y_w|x)) - β · log(π_θ(y_l|x)/π_ref(y_l|x)))]
```
## 4.2 DPO vs RLHF
| Aspect | RLHF (PPO) | DPO |
|--------|-------------|-----|
| Models in memory | 4 | 2 (policy + reference) |
| Training loop | Complex RL with generation | Simple supervised training |
| Hyperparameters | Many | Few (mainly β) |
| Stability | Often unstable | Very stable |
| Sampling during training | Required | Not required |
| Performance | Strong | Comparable or better |
## 4.3 The DPO Gradient — Intuition
```
∇L_DPO ∝ -β · [weight] · [∇log π(y_w|x) - ∇log π(y_l|x)]
```
- **Increase** likelihood of preferred response y_w
- **Decrease** likelihood of rejected response y_l
- **Weight** by how wrong the model currently is
If the model already prefers y_w correctly → small gradient (don't fix what isn't broken).
## 4.4 DPO Data Format
```python
# Each example needs: prompt + chosen + rejected
{
"prompt": [{"role": "user", "content": "Explain gravity"}],
"chosen": [{"role": "assistant", "content": "Gravity is a fundamental force..."}],
"rejected": [{"role": "assistant", "content": "Gravity is when things fall down."}]
}
```
## 4.5 DPO Code Example
```python
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
config = DPOConfig(
output_dir="./dpo-output",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7, # Much lower than SFT!
beta=0.1, # KL penalty strength
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/your-dpo-model",
)
trainer = DPOTrainer(
model="your-sft-model", # SFT model from stage 1
args=config,
train_dataset=dataset,
)
trainer.train()
```
## 4.6 Key Hyperparameters
| Parameter | Typical Range | Notes |
|-----------|--------------|-------|
| **β (beta)** | 0.01 – 0.5 | Higher = stay closer to reference. Default: 0.1 |
| **Learning rate** | 1e-7 – 5e-6 | Much lower than SFT. Very sensitive. |
| **Epochs** | 1 – 3 | Overfitting is common with more |
"""
CH5_PREFERENCE_ZOO = r"""
# Chapter 5: The Preference Optimization Zoo
After DPO, researchers developed many variants. Here's what each adds.
## 5.1 IPO — Identity Preference Optimization
**Problem:** DPO can overfit, especially with noisy preferences.
**Solution:** Adds regularization without assuming the Bradley-Terry model:
```
L_IPO = E[(log(π_θ(y_w)/π_ref(y_w)) - log(π_θ(y_l)/π_ref(y_l)) - 1/(2β))²]
```
**When to use:** Noisy preference data, DPO overfitting.
## 5.2 KTO — Kahneman-Tversky Optimization
📄 [arXiv:2402.01306](https://arxiv.org/abs/2402.01306)
**Problem:** DPO needs *paired* preferences (chosen AND rejected per prompt). Expensive to collect.
**Solution:** Works with **unpaired binary feedback** — just "good" or "bad" per response. Based on prospect theory: losses hurt more than equivalent gains.
```python
{"prompt": "...", "completion": "...", "label": True} # 👍
{"prompt": "...", "completion": "...", "label": False} # 👎
```
**When to use:** Thumbs-up/down feedback, no pairwise comparisons.
## 5.3 ORPO — Odds Ratio Preference Optimization
**Problem:** DPO needs a separate SFT stage + reference model.
**Solution:** Combines SFT and preference optimization in **one step**:
```
L_ORPO = L_SFT + λ · L_OR
```
**When to use:** Simplest possible pipeline.
## 5.4 SimPO — Simple Preference Optimization
**Problem:** DPO needs a reference model in memory (doubles GPU needs).
**Solution:** Eliminates reference model. Uses **average log probability** as the implicit reward (length-normalized).
**When to use:** GPU memory constraints.
## 5.5 CPO — Contrastive Preference Optimization
Removes reference model using a contrastive loss. Similar to SimPO, different formulation.
## 5.6 Online DPO
**Problem:** Standard DPO uses a static dataset — data goes stale as the model changes.
**Solution:** Generates new completions from the *current* model during training, scored by a reward model.
## 5.7 Comparison Table
| Method | Ref Model? | Paired Data? | Needs RM? | Separate SFT? | Key Advantage |
|--------|-----------|-------------|-----------|---------------|---------------|
| PPO | Yes | No | **Yes** | Yes | Gold standard |
| DPO | Yes | **Yes** | No | Yes | Simple, stable |
| IPO | Yes | Yes | No | Yes | Robust to noise |
| KTO | Yes | **No** | No | Yes | Binary feedback |
| ORPO | **No** | Yes | No | **No** | Simplest pipeline |
| SimPO | **No** | Yes | No | Yes | Memory efficient |
| CPO | **No** | Yes | No | Yes | Memory efficient |
| Online DPO | Yes | Online | **Yes** | Yes | Fresh data |
| GRPO | Yes (soft) | No | **Yes** / funcs | Yes | Best for reasoning |
"""
CH6_GRPO = r"""
# Chapter 6: GRPO and the Reasoning Revolution
## 6.1 What is GRPO?
📄 *"DeepSeekMath"* (Shao et al., 2024) — [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
**Group Relative Policy Optimization** — a PPO variant that's more memory-efficient and great for reasoning.
**Key idea:** Instead of a separate value model (critic), GRPO generates **multiple completions per prompt** and uses the group average as the baseline.
## 6.2 How GRPO Works
```
For each prompt:
1. Generate G completions (e.g., G=16)
2. Score each with a reward function
3. Compute advantage: Â_i = (r_i - mean(r)) / std(r)
4. Update model: ↑ probability of high-advantage completions
↓ probability of low-advantage completions
```
**The GRPO loss:**
```
L = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)] + β · KL
```
"Group Relative" = advantage computed *relative to the group* for the same prompt.
## 6.3 The DeepSeek-R1 Story
📄 *"DeepSeek-R1"* (DeepSeek-AI, 2025) — [arXiv:2501.12948](https://arxiv.org/abs/2501.12948)
GRPO was used to train DeepSeek-R1 — a model that learned chain-of-thought reasoning **purely through RL**.
With the right reward (accuracy on math/code), the model **spontaneously developed:**
- Chain-of-thought reasoning
- Self-verification ("Let me check...")
- Error correction
- Problem decomposition
No one explicitly taught these behaviors — they emerged from the reward signal.
## 6.4 Reward Functions
GRPO is flexible — rewards can be:
1. **Python functions** (rule-based): Is the math answer correct? Does the code pass tests?
2. **Reward models** (learned): A neural network scoring responses
3. **Multiple functions**: accuracy + format + length
```python
import re
def accuracy_reward(completions, ground_truth, **kwargs):
matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions]
contents = [m.group(1) if m else "" for m in matches]
return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
def format_reward(completions, **kwargs):
pattern = r"<think>.*?</think>.*<answer>.*?</answer>"
return [1.0 if re.match(pattern, c, re.DOTALL) else 0.0 for c in completions]
```
## 6.5 GRPO Code Example
```python
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
config = GRPOConfig(
output_dir="./grpo-output",
learning_rate=1e-6,
per_device_train_batch_size=4,
num_generations=16, # G completions per prompt
max_completion_length=512,
bf16=True,
gradient_checkpointing=True,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
args=config,
train_dataset=dataset,
)
trainer.train()
```
"""
CH7_PEFT = r"""
# Chapter 7: Parameter-Efficient Fine-Tuning (PEFT)
## 7.1 The Memory Problem
Fine-tuning 7B parameters requires:
| Component | Memory |
|-----------|--------|
| Model weights (bf16) | 14 GB |
| Gradients | 14 GB |
| Optimizer states (AdamW) | 28 GB |
| Activations | 10-30 GB |
| **Total** | **~60-80 GB** |
That's one A100 for SFT alone. RLHF with PPO (4 models) = 4× this.
## 7.2 LoRA: Low-Rank Adaptation
📄 [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
**Insight:** Fine-tuning weight updates have low rank — approximate them with small matrices.
```
W' = W + α · B × A
W: original frozen weight (d × d) — NOT trained
A: down projection (d × r) — trained
B: up projection (r × d) — trained
r: rank (typically 8-32) r << d
```
**Example:** 4096 × 4096 weight matrix:
- Full fine-tuning: **16.7M** parameters
- LoRA (r=16): 2 × 4096 × 16 = **131K** parameters (128× fewer!)
## 7.3 QLoRA: Quantized LoRA
📄 [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
Quantize the frozen base model to **4-bit**, add LoRA adapters in bf16.
- 7B model: ~4 GB (quantized) + small LoRA adapters
- Fine-tune 7B on a single RTX 4090 (24 GB)!
## 7.4 LoRA with TRL
```python
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling (usually 2×r)
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
config = SFTConfig(
output_dir="./sft-lora",
num_train_epochs=3,
learning_rate=2e-4, # Higher LR for LoRA
bf16=True,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model="meta-llama/Llama-3.1-8B",
args=config,
train_dataset=dataset,
peft_config=lora_config, # ← pass LoRA config
)
trainer.train()
```
## 7.5 When to Use What
| Scenario | Recommendation |
|----------|---------------|
| Limited GPU memory | LoRA / QLoRA |
| Quick experiment | LoRA |
| Maximum quality | Full fine-tuning |
| Multiple adapters from same base | LoRA (swap adapters) |
| Very small dataset | LoRA (acts as regularizer) |
**LoRA is ~95-99% as good as full fine-tuning** at a fraction of the compute.
"""
CH8_TOOLBOX = r"""
# Chapter 8: The Toolbox
## 8.1 TRL — Transformers Reinforcement Learning
🔗 [github.com/huggingface/trl](https://github.com/huggingface/trl) · [Docs](https://huggingface.co/docs/trl)
The central library for all post-training methods:
| Trainer | Method | Dataset Type |
|---------|--------|-------------|
| `SFTTrainer` | Supervised Fine-Tuning | `messages` or `prompt`+`completion` |
| `DPOTrainer` | Direct Preference Optimization | `prompt`+`chosen`+`rejected` |
| `GRPOTrainer` | Group Relative Policy Optimization | `prompt` only |
| `RLOOTrainer` | REINFORCE Leave-One-Out | `prompt` only |
| `RewardTrainer` | Reward Model Training | `prompt`+`chosen`+`rejected` |
| `KTOTrainer` | Kahneman-Tversky Optimization | `prompt`+`completion`+`label` |
| `ORPOTrainer` | Odds Ratio Preference Opt. | `prompt`+`chosen`+`rejected` |
| `CPOTrainer` | Contrastive Preference Opt. | `prompt`+`chosen`+`rejected` |
| `OnlineDPOTrainer` | Online DPO | `prompt` only |
| `PPOTrainer` | Proximal Policy Optimization | tokenized |
| `XPOTrainer` | Exploratory Preference Opt. | `prompt` only |
| `NashMDTrainer` | Nash Mirror Descent | `prompt` only |
| `PRMTrainer` | Process Reward Model | stepwise supervision |
## 8.2 Core Libraries
| Library | Purpose |
|---------|---------|
| **transformers** | Model loading, tokenization, chat templates |
| **datasets** | Dataset loading and processing |
| **peft** | LoRA, QLoRA, adapters |
| **accelerate** | Distributed training, DeepSpeed |
| **bitsandbytes** | 4/8-bit quantization |
## 8.3 Inference & Serving
| Tool | Purpose |
|------|---------|
| **vLLM** | High-throughput inference (5-10× faster generation) |
| **TGI** | HuggingFace Text Generation Inference |
| **Unsloth** | 2-5× faster LoRA training |
TRL integrates vLLM directly:
```python
config = GRPOConfig(use_vllm=True, vllm_mode="colocate")
```
## 8.4 Experiment Tracking & Evaluation
| Tool | Purpose |
|------|---------|
| **Weights & Biases** | Experiment tracking |
| **Trackio** | HF-native tracking |
| **lm-eval-harness** | Standardized LLM benchmarks |
| **AlpacaEval** | GPT-4 based evaluation |
## 8.5 TRL CLI Commands
```bash
# Install
pip install trl
# SFT from command line
trl sft --model_name_or_path Qwen/Qwen3-0.6B \
--dataset_name trl-lib/Capybara --output_dir ./sft
# DPO from command line
trl dpo --model_name_or_path your-sft-model \
--dataset_name trl-lib/ultrafeedback_binarized --output_dir ./dpo
# GRPO from command line
trl grpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/DeepMath-103K --output_dir ./grpo
# Multi-GPU
accelerate launch --num_processes 4 train.py
# DeepSpeed ZeRO-3
accelerate launch --config_file deepspeed_zero3.yaml train.py
```
"""
CH9_DATASETS = r"""
# Chapter 9: Datasets — What to Train On
## 9.1 SFT Datasets
| Dataset | Size | Description |
|---------|------|-------------|
| [**trl-lib/Capybara**](https://huggingface.co/datasets/trl-lib/Capybara) | ~90K | Multi-turn conversations, high quality |
| [**HuggingFaceH4/ultrachat_200k**](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 200K | Diverse multi-turn conversations |
| [**allenai/tulu-3-sft-mixture**](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | ~1.3M | Large-scale SFT mixture |
| [**OpenAssistant/oasst1**](https://huggingface.co/datasets/OpenAssistant/oasst1) | 161K | Crowdsourced conversation trees |
| [**tatsu-lab/alpaca**](https://huggingface.co/datasets/tatsu-lab/alpaca) | 52K | GPT-generated instructions |
| [**teknium/OpenHermes-2.5**](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M | Large synthetic instructions |
## 9.2 Preference Datasets (DPO / KTO / ORPO)
| Dataset | Size | Description |
|---------|------|-------------|
| [**trl-lib/ultrafeedback_binarized**](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized) | 60K | Binarized UltraFeedback |
| [**Anthropic/hh-rlhf**](https://huggingface.co/datasets/Anthropic/hh-rlhf) | 170K | Helpful + harmless preferences |
| [**argilla/ultrafeedback-binarized-preferences**](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences) | 60K | Cleaned UltraFeedback |
## 9.3 Prompt-Only Datasets (GRPO / RLOO)
| Dataset | Size | Description |
|---------|------|-------------|
| [**trl-lib/DeepMath-103K**](https://huggingface.co/datasets/trl-lib/DeepMath-103K) | 103K | Math with verifiable answers |
| [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) | ~70K | Competition math |
## 9.4 How to Choose
1. **First experiment:** `trl-lib/Capybara` (SFT) or `trl-lib/ultrafeedback_binarized` (DPO)
2. **Quality > Quantity:** LIMA: 1K great > 52K mediocre
3. **Match your use case:** Math model → math data, chat model → diverse conversations
4. **Always inspect first:**
```python
from datasets import load_dataset
ds = load_dataset("trl-lib/Capybara", split="train")
print(ds[0]) # Look at the data!
```
"""
CH10_EVALUATION = r"""
# Chapter 10: Evaluation — How to Know If It Worked
## 10.1 Why Evaluation is Hard
- Open-ended outputs can be correct in many ways
- Perplexity doesn't correlate with usefulness (LIMA proved this)
- Benchmark scores ≠ real-world performance
- Human evaluation is expensive and subjective
## 10.2 Automated Benchmarks
| Benchmark | Measures | How |
|-----------|----------|-----|
| **MMLU** | Knowledge (57 subjects) | Multiple choice |
| **HellaSwag** | Commonsense reasoning | Sentence completion |
| **ARC** | Science reasoning | Multiple choice |
| **TruthfulQA** | Truthfulness | Trick questions |
| **GSM8K** | Math reasoning | Grade-school word problems |
| **MATH** | Advanced math | Competition-level |
| **HumanEval** | Code generation | Python problems |
| **MBPP** | Code generation | Basic Python |
| **IFEval** | Instruction following | Verifiable constraints |
**Run with lm-eval-harness:**
```bash
lm_eval --model hf \
--model_args pretrained=your-model \
--tasks mmlu,gsm8k,hellaswag \
--batch_size 8
```
## 10.3 LLM-as-Judge
| Evaluation | Description |
|------------|-------------|
| **AlpacaEval** | GPT-4 compares to reference |
| **MT-Bench** | Multi-turn dialogue, GPT-4 judged |
| **Arena Hard** | Challenging prompts, GPT-4 judged |
## 10.4 Human Evaluation
- **Side-by-side:** Show two responses, pick the better one
- **Likert scale:** Rate helpfulness, accuracy, harmlessness (1-7)
- **Chatbot Arena:** [lmarena.ai](https://lmarena.ai/) — crowdsourced blind comparisons
## 10.5 The Open LLM Leaderboard
🔗 [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Standardized evaluation of open-source models. The primary way the community tracks progress.
"""
CH11_RECIPE = r"""
# Chapter 11: Putting It All Together
## 11.1 The Standard Recipe (2024-2025)
```
Step 1: Choose Base Model
├── Qwen3 (0.6B–235B) — Top-performing
├── Llama 3.1/3.2 (1B–405B) — Meta
├── Gemma 3/4 (1B–27B) — Google
└── Mistral/Mixtral — Efficient
Step 2: SFT
├── Dataset: trl-lib/Capybara or ultrachat_200k
├── Method: SFTTrainer + LoRA (or full)
├── Epochs: 2-3, LR: 2e-5 (full) / 2e-4 (LoRA)
└── Output → reference model for Stage 3
Step 3: Preference Optimization
├── DPO (simplest): β=0.1, LR=5e-7, 1-2 epochs
├── GRPO (reasoning): 16 generations, LR=1e-6
└── KTO (binary feedback): unpaired data
Step 4: Evaluation
├── lm-eval-harness (MMLU, GSM8K)
├── MT-Bench / AlpacaEval
└── Manual testing
```
## 11.2 Full Code: SFT → DPO
```python
# ═══ STAGE 1: SFT ═══
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
sft_dataset = load_dataset("trl-lib/Capybara", split="train")
sft_config = SFTConfig(
output_dir="./sft-model",
num_train_epochs=2,
per_device_train_batch_size=4,
learning_rate=2e-5,
max_seq_length=2048,
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/my-sft-model",
)
SFTTrainer(model="Qwen/Qwen3-0.6B", args=sft_config,
train_dataset=sft_dataset).train()
# ═══ STAGE 2: DPO ═══
from trl import DPOTrainer, DPOConfig
dpo_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
dpo_config = DPOConfig(
output_dir="./dpo-model",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7,
beta=0.1,
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/my-dpo-model",
)
DPOTrainer(model="your-username/my-sft-model", args=dpo_config,
train_dataset=dpo_dataset).train()
```
## 11.3 Hardware Guidelines
| Model Size | Minimum | Recommended | With LoRA |
|-----------|---------|-------------|-----------|
| 0.5–3B | 1× A10G (24GB) | 1× A100 (80GB) | 1× T4 (16GB) |
| 7–8B | 1× A100 (80GB) | 2× A100 | 1× A10G (24GB) |
| 13B | 2× A100 | 4× A100 | 1× A100 (80GB) |
| 70B | 4× A100 | 8× A100 | 2× A100 |
"""
CH12_READING_LIST = r"""
# Chapter 12: The Reading List
## Tier 1: Must-Read (The Foundations)
| # | Paper | Year | Why Read It |
|---|-------|------|------------|
| 1 | [**InstructGPT**](https://arxiv.org/abs/2203.02155) — *Training LMs to Follow Instructions with Human Feedback* | 2022 | The SFT→RM→PPO pipeline. Everything starts here. |
| 2 | [**DPO**](https://arxiv.org/abs/2305.18290) — *Direct Preference Optimization* | 2023 | No reward model, no RL. Most widely used method. |
| 3 | [**LoRA**](https://arxiv.org/abs/2106.09685) — *Low-Rank Adaptation of Large Language Models* | 2021 | Made fine-tuning accessible. Used everywhere. |
| 4 | [**DeepSeek-R1**](https://arxiv.org/abs/2501.12948) — *Incentivizing Reasoning via RL* | 2025 | RL teaches reasoning from scratch. The reasoning era. |
## Tier 2: Important (Deeper Understanding)
| # | Paper | Year | Why Read It |
|---|-------|------|------------|
| 5 | [**LIMA**](https://arxiv.org/abs/2305.11206) — *Less Is More for Alignment* | 2023 | Data quality >> quantity. |
| 6 | [**Constitutional AI**](https://arxiv.org/abs/2212.08073) — *Harmlessness from AI Feedback* | 2022 | AI feedback replaces human feedback (RLAIF). |
| 7 | [**FLAN**](https://arxiv.org/abs/2109.01652) — *Finetuned LMs Are Zero-Shot Learners* | 2021 | Proved instruction tuning works. |
| 8 | [**Self-Instruct**](https://arxiv.org/abs/2212.10560) — *Aligning LMs with Self-Generated Instructions* | 2022 | Synthetic data for SFT. Led to Alpaca. |
| 9 | [**DeepSeekMath**](https://arxiv.org/abs/2402.03300) — *Pushing the Limits of Mathematical Reasoning* | 2024 | Introduced GRPO. |
| 10 | [**QLoRA**](https://arxiv.org/abs/2305.14314) — *Efficient Finetuning of Quantized LMs* | 2023 | 7B fine-tuning on consumer GPUs. |
## Tier 3: Advanced (Cutting Edge)
| # | Paper | Year |
|---|-------|------|
| 11 | [**KTO**](https://arxiv.org/abs/2402.01306) — *Prospect Theoretic Optimization* | 2024 |
| 12 | [**ORPO**](https://arxiv.org/abs/2403.07691) — *Monolithic Preference Optimization* | 2024 |
| 13 | [**SimPO**](https://arxiv.org/abs/2405.14734) — *Reference-Free Reward* | 2024 |
| 14 | **Tulu 3** — *Open Language Model Post-Training* | 2024 |
| 15 | [**Zephyr**](https://arxiv.org/abs/2310.16944) — *Direct Distillation of LM Alignment* | 2023 |
## Tier 4: RL Foundations
| # | Paper | Year |
|---|-------|------|
| 16 | [**PPO**](https://arxiv.org/abs/1707.06347) — *Proximal Policy Optimization* | 2017 |
| 17 | [**Learning to Summarize from Human Feedback**](https://arxiv.org/abs/2009.01325) | 2020 |
| 18 | [**Fine-Tuning LMs from Human Preferences**](https://arxiv.org/abs/1909.08593) | 2019 |
"""
GLOSSARY = r"""
# Glossary & Quick Reference
## Key Terms
| Term | Definition |
|------|-----------|
| **Alignment** | Making a model behave according to human intentions and values |
| **RLHF** | Reinforcement Learning from Human Feedback |
| **RLAIF** | RL from AI Feedback (AI replaces human annotators) |
| **SFT** | Supervised Fine-Tuning on instruction-response pairs |
| **DPO** | Direct Preference Optimization — no reward model needed |
| **GRPO** | Group Relative Policy Optimization — for reasoning |
| **PPO** | Proximal Policy Optimization — the RL algorithm in RLHF |
| **Reward Model** | Scores responses based on human preferences |
| **Policy** | The language model being trained (RL terminology) |
| **Reference Model (π_ref)** | SFT model used as baseline |
| **KL Divergence** | Measures how far the policy drifts from the reference |
| **Bradley-Terry** | Preference model: P(A>B) = σ(score(A) - score(B)) |
| **Reward Hacking** | Model exploits the reward model instead of improving |
| **LoRA** | Low-Rank Adaptation — parameter-efficient fine-tuning |
| **QLoRA** | 4-bit quantized base model + LoRA adapters |
| **Chat Template** | Format for structuring conversations (special tokens + roles) |
| **On-policy** | Training on data from the *current* model |
| **Off-policy** | Training on data from a *different* model |
| **Advantage** | How much better an action is vs. the expected value |
## Dataset Formats by Trainer
```python
# SFT
{"messages": [{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}]}
# DPO / ORPO / CPO
{"prompt": "...", "chosen": "...", "rejected": "..."}
# GRPO / RLOO / Online DPO
{"prompt": "..."}
# KTO
{"prompt": "...", "completion": "...", "label": True}
# Reward Model
{"prompt": "...", "chosen": "...", "rejected": "..."}
# PRM (Process Reward Model)
{"prompt": "...", "completions": ["step1", "step2"],
"labels": [True, False]}
```
"""
SEARCH_CONTENT = {
"Overview & Roadmap": OVERVIEW,
"Ch 1: The Big Picture": CH1_BIG_PICTURE,
"Ch 2: SFT": CH2_SFT,
"Ch 3: RLHF": CH3_RLHF,
"Ch 4: DPO": CH4_DPO,
"Ch 5: Preference Zoo": CH5_PREFERENCE_ZOO,
"Ch 6: GRPO & Reasoning": CH6_GRPO,
"Ch 7: PEFT (LoRA)": CH7_PEFT,
"Ch 8: Toolbox": CH8_TOOLBOX,
"Ch 9: Datasets": CH9_DATASETS,
"Ch 10: Evaluation": CH10_EVALUATION,
"Ch 11: Full Recipe": CH11_RECIPE,
"Ch 12: Reading List": CH12_READING_LIST,
"Glossary & Reference": GLOSSARY,
}
def search_guide(query):
if not query or not query.strip():
return "Type a keyword to search across all chapters (e.g., *DPO*, *reward*, *LoRA*, *dataset*)."
query_lower = query.lower().strip()
results = []
for title, content in SEARCH_CONTENT.items():
lines = content.split("\n")
for i, line in enumerate(lines):
if query_lower in line.lower():
start = max(0, i - 1)
end = min(len(lines), i + 4)
snippet = "\n".join(lines[start:end])
results.append(f"### 📍 {title}\n\n{snippet}\n\n---\n")
break # one match per chapter
if results:
return f"## Found in {len(results)} chapter(s)\n\n" + "\n".join(results)
return f"No results for **'{query}'**. Try: *SFT*, *DPO*, *GRPO*, *LoRA*, *reward*, *PPO*, *dataset*, *evaluation*."
# ============================================================
# GRADIO UI
# ============================================================
CUSTOM_CSS = """
.chapter-content {
max-width: 900px;
margin: 0 auto;
font-size: 16px;
line-height: 1.7;
}
.gradio-container {
max-width: 1100px !important;
}
footer { display: none !important; }
"""
THEME = gr.themes.Soft(
primary_hue="blue",
secondary_hue="slate",
font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"],
)
with gr.Blocks(title="Post-Training Guide for LLMs") as demo:
gr.HTML("""
<div style="text-align:center; padding: 20px 0 5px 0;">
<h1 style="margin-bottom:4px;">📖 The Complete Guide to Post-Training of LLMs</h1>
<p style="color:#666; font-size:17px; margin-top:0;">
From Pretraining to Alignment — SFT · RLHF · DPO · GRPO · LoRA and more
</p>
</div>
""")
with gr.Tabs():
with gr.Tab("🏠 Overview"):
gr.Markdown(OVERVIEW, elem_classes="chapter-content")
with gr.Tab("1️⃣ Big Picture"):
gr.Markdown(CH1_BIG_PICTURE, elem_classes="chapter-content")
with gr.Tab("2️⃣ SFT"):
gr.Markdown(CH2_SFT, elem_classes="chapter-content")
with gr.Tab("3️⃣ RLHF"):
gr.Markdown(CH3_RLHF, elem_classes="chapter-content")
with gr.Tab("4️⃣ DPO"):
gr.Markdown(CH4_DPO, elem_classes="chapter-content")
with gr.Tab("5️⃣ Preference Zoo"):
gr.Markdown(CH5_PREFERENCE_ZOO, elem_classes="chapter-content")
with gr.Tab("6️⃣ GRPO"):
gr.Markdown(CH6_GRPO, elem_classes="chapter-content")
with gr.Tab("7️⃣ PEFT"):
gr.Markdown(CH7_PEFT, elem_classes="chapter-content")
with gr.Tab("8️⃣ Toolbox"):
gr.Markdown(CH8_TOOLBOX, elem_classes="chapter-content")
with gr.Tab("9️⃣ Datasets"):
gr.Markdown(CH9_DATASETS, elem_classes="chapter-content")
with gr.Tab("🔟 Evaluation"):
gr.Markdown(CH10_EVALUATION, elem_classes="chapter-content")
with gr.Tab("📋 Full Recipe"):
gr.Markdown(CH11_RECIPE, elem_classes="chapter-content")
with gr.Tab("📚 Reading List"):
gr.Markdown(CH12_READING_LIST, elem_classes="chapter-content")
with gr.Tab("📖 Glossary"):
gr.Markdown(GLOSSARY, elem_classes="chapter-content")
with gr.Tab("🔍 Search"):
gr.Markdown("## Search the Guide\nFind any topic across all chapters.")
search_input = gr.Textbox(
placeholder="Type a keyword (e.g. DPO, reward, LoRA, dataset, PPO)...",
label="Search",
lines=1,
)
search_output = gr.Markdown(elem_classes="chapter-content")
search_input.submit(search_guide, inputs=search_input, outputs=search_output)
search_input.change(search_guide, inputs=search_input, outputs=search_output)
gr.HTML("""
<div style="text-align:center; padding:15px; color:#888; font-size:13px; border-top:1px solid #eee; margin-top:20px;">
Built from primary research papers & official HF documentation · All code uses TRL v1.2+ APIs ·
<a href="https://huggingface.co/docs/trl" target="_blank">TRL Docs</a> ·
<a href="https://github.com/huggingface/trl" target="_blank">TRL GitHub</a> ·
Last updated April 2026
</div>
""")
demo.launch(theme=THEME, css=CUSTOM_CSS)