import gradio as gr

# ============================================================
# CONTENT — each chapter stored as a Python string (Markdown)
# ============================================================

OVERVIEW = r"""
# 📖 The Complete Guide to Post-Training of Large Language Models
### From Pretraining to Alignment — Everything You Need to Know

---

> **Who is this for?** You've learned how pretraining works — you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens *after* pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini.

---

## 🗺️ Roadmap

| # | Chapter | What You'll Learn |
|---|---------|-------------------|
| 1 | **The Big Picture** | Why pretrained models aren't useful yet; the 3-stage pipeline |
| 2 | **SFT** | Supervised Fine-Tuning — loss function, data formats, key papers |
| 3 | **RLHF** | Reward models, PPO, KL divergence, reward hacking |
| 4 | **DPO** | Direct Preference Optimization — RLHF without RL |
| 5 | **Preference Zoo** | KTO, ORPO, SimPO, CPO, IPO, Online DPO |
| 6 | **GRPO & Reasoning** | DeepSeek-R1, reward functions, the reasoning revolution |
| 7 | **PEFT** | LoRA, QLoRA — fine-tune on consumer GPUs |
| 8 | **Toolbox** | TRL, Transformers, vLLM, Accelerate, DeepSpeed |
| 9 | **Datasets** | What to train on — curated lists with Hub links |
| 10 | **Evaluation** | Benchmarks, LLM-as-Judge, human eval |
| 11 | **Full Recipe** | End-to-end pipeline with code |
| 12 | **Reading List** | 18 must-read papers in 4 tiers |

---

### The Three Stages of Post-Training

```
┌─────────────┐     ┌──────────────────┐     ┌─────────────────────────┐
│ STAGE 1: SFT │ ──> │ STAGE 2: Reward  │ ──> │ STAGE 3: RL             │
│              │     │ Model Training   │     │ (PPO / DPO / GRPO)      │
│ Teach format │     │ Learn preferences│     │ Optimize for preferences│
│ & behavior   │     │ from comparisons │     │ while staying close to  │
│              │     │                  │     │ the SFT model           │
└─────────────┘     └──────────────────┘     └─────────────────────────┘

Input: Pretrained LM                         Output: Aligned Assistant
```

### The Evolution Timeline

| Year | Method | Key Idea |
|------|--------|----------|
| 2017 | RLHF (original) | Human preferences → reward model → RL |
| 2020 | RLHF for LLMs | Applied to text summarization |
| 2022 | **InstructGPT** | Full SFT → RM → PPO pipeline |
| 2022 | Constitutional AI | AI feedback replaces human feedback |
| 2023 | **DPO** | No reward model needed — direct optimization |
| 2024 | KTO / ORPO | Binary feedback / combined SFT+preference |
| 2024 | **GRPO** | Group-based RL for reasoning (DeepSeek) |
| 2025 | DeepSeek-R1 | RL teaches chain-of-thought from scratch |
"""

CH1_BIG_PICTURE = r"""
# Chapter 1: The Big Picture — Why Post-Training Exists

## 1.1 The Gap Between Pretraining and Usefulness

You've pretrained a language model. It can predict the next token with impressive accuracy. It has absorbed vast knowledge from the internet. But try asking it a question:

```
User: What is the capital of France?
Model: What is the capital of Germany? What is the capital of Italy? What is the...
```

The model doesn't *answer* — it *continues*. The pretraining objective (`P(next_token | context)`) optimizes for predicting what comes next in web text, not for being helpful.

> *"Large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users."*
> — **InstructGPT** (Ouyang et al., 2022)

## 1.2 What Post-Training Does

Post-training is everything after pretraining that makes a model useful, safe, and aligned. It has three stages:

**Stage 1 — SFT (Supervised Fine-Tuning):** Teach the model the *format* of helpful responses using demonstrations.

**Stage 2 — Reward Modeling:** Train a model to predict which response a human would prefer.

**Stage 3 — RL Optimization:** Optimize the SFT model to generate responses the reward model scores highly.

## 1.3 The Superficial Alignment Hypothesis

> *"A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."*
> — **LIMA** (Zhou et al., 2023)

This is the key insight: **post-training doesn't teach new knowledge** — it teaches the model to *surface existing knowledge in the right way*. The pretrained model already "knows" the capital of France; SFT teaches it to respond to questions rather than generate more questions.

## 1.4 Why This Matters

The difference is dramatic:
- **InstructGPT 1.3B** (post-trained) was preferred over **GPT-3 175B** (pretrained only) by human evaluators
- That's a 100× smaller model winning because of post-training
- Post-training is what turns a text predictor into an assistant
"""

CH2_SFT = r"""
# Chapter 2: Supervised Fine-Tuning (SFT)

## 2.1 What SFT Does

SFT is the bridge between a pretrained language model and a useful assistant.

**Before SFT:**
```
Input:  "Explain quantum computing in simple terms."
Output: "Explain quantum computing to a 5-year-old. Explain quantum computing..."
```

**After SFT:**
```
Input:  "Explain quantum computing in simple terms."
Output: "Quantum computing uses the principles of quantum mechanics to process
         information. Unlike classical computers that use bits (0 or 1),
         quantum computers use qubits that can be both 0 and 1 simultaneously..."
```

## 2.2 The SFT Loss Function

If you understand pretraining, you understand SFT — with one crucial difference:

**Pretraining loss** (on ALL tokens):
```
L_pretrain = -Σ log P(token_i | token_1, ..., token_{i-1})   for ALL tokens
```

**SFT loss** (on RESPONSE tokens only):
```
L_SFT = -Σ log P(c_i | prompt, c_1, ..., c_{i-1})           for COMPLETION tokens only
```

The prompt tokens are masked from the loss. We don't want the model to learn to generate instructions — we want it to learn to *respond* to them.

```
Sequence:  [User: What is 2+2?] [Assistant: 4]
Loss mask: [  ----IGNORED----  ] [COMPUTED HERE ]
```

## 2.3 Chat Templates & Data Format

Modern SFT uses structured conversations:

```python
{
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."}
    ]
}
```

Each model family converts this to its own template:

```
# ChatML (Qwen, etc.):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>

# Llama-3:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|>
```

The `transformers` library handles this automatically:

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"}
]
# For training:
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
# For inference:
text = tokenizer.apply_chat_template(messages[:1], tokenize=False, add_generation_prompt=True)
```

## 2.4 Key SFT Papers

### FLAN (2021) — Instruction Tuning Works
📄 *"Finetuned Language Models Are Zero-Shot Learners"* — [arXiv:2109.01652](https://arxiv.org/abs/2109.01652)

- 62 NLP datasets formatted as instructions → fine-tune LaMDA-PT 137B
- **Result:** Surpassed zero-shot GPT-3 175B on 20/25 tasks
- **Key insight:** Instructions matter — same tasks without instructions = much weaker
- **Recipe:** Adafactor, lr=3e-5, 30K steps, batch 8192 tokens

### Self-Instruct (2022) — Synthetic Data
📄 *"Self-Instruct"* — [arXiv:2212.10560](https://arxiv.org/abs/2212.10560)

- 175 seed tasks → GPT-3 generates 52K instructions + responses
- **Result:** +33% over vanilla GPT-3 on SuperNaturalInstructions
- **Led to:** Stanford Alpaca (LLaMA + 52K GPT instructions for <$600)

### InstructGPT (2022) — SFT as Stage 1
📄 *"Training Language Models to Follow Instructions with Human Feedback"* — [arXiv:2203.02155](https://arxiv.org/abs/2203.02155)

- ~13K human demonstrations, 16 epochs, cosine LR, dropout 0.2
- **Key finding:** SFT overfits on val loss after 1 epoch, but more epochs ↑ RM score
- **Result:** 1.3B InstructGPT preferred over 175B GPT-3

### LIMA (2023) — Less Is More
📄 *"LIMA: Less Is More for Alignment"* — [arXiv:2305.11206](https://arxiv.org/abs/2305.11206)

- Just **1,000 curated examples** → competitive with GPT-3.5 (DaVinci003)
- **Recipe:** AdamW, lr 1e-5→1e-6, 15 epochs, batch 32, max len 2048
- **Takeaway:** Data quality >> data quantity

## 2.5 SFT Code Example

```python
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

config = SFTConfig(
    output_dir="./sft-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    max_seq_length=2048,
    gradient_checkpointing=True,
    bf16=True,
    logging_steps=10,
    push_to_hub=True,
    hub_model_id="your-username/your-sft-model",
)

trainer = SFTTrainer(
    model="Qwen/Qwen3-0.6B",
    args=config,
    train_dataset=dataset,
)
trainer.train()
```

SFTTrainer automatically detects the `messages` column, applies the chat template, and masks prompt tokens from the loss.
"""

CH3_RLHF = r"""
# Chapter 3: Reinforcement Learning from Human Feedback (RLHF)

## 3.1 Why SFT Isn't Enough

SFT teaches format and basic behavior, but:
- It only learns from demonstrations (can't be better than its training data)
- It can't express preferences (treats all tokens equally)
- It can learn bad habits from imperfect training data

RLHF addresses this by training on **which outputs are better**, not what specific tokens to generate.

## 3.2 Step 1: Train a Reward Model

A **reward model (RM)** scores how good a response is (prompt + response → scalar score).

**Training process:**
1. Generate multiple responses per prompt using the SFT model
2. Humans rank responses (e.g., A > B)
3. Train RM to predict these rankings

**The Bradley-Terry preference model:**
```
P(A preferred over B) = σ(r(A) - r(B))

Loss: L_RM = -E[log σ(r(x, y_chosen) - r(x, y_rejected))]
```

**Architecture:** Same as the LM, but output head → single scalar. InstructGPT used a 6B RM (not 175B — larger was unstable).

## 3.3 Step 2: Optimize with PPO

**The RLHF objective:**
```
maximize  E[RM(prompt, response)]  -  β · KL(π_θ || π_ref)
           ↑ score high              ↑ don't deviate too far
           on reward model            from original SFT model
```

The **KL penalty** prevents **reward hacking** — without it, the model generates gibberish that tricks the RM.

**PPO (Proximal Policy Optimization) loop:**
1. **Generate** responses from the current model
2. **Score** them with the reward model
3. **Compute advantage** (how much better than expected)
4. **Update** weights to favor high-advantage responses
5. **Clip** updates for stability

```
L_PPO = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)]
where ratio = π_θ(a|s) / π_old(a|s)
```

## 3.4 InstructGPT Training Details

- β = 0.02 for KL penalty
- Mixed 10% pretraining data during PPO (prevents capability regression)
- LR range: 2.55e-6 to 2.55e-5 (>8.05e-6 diverged)
- 256K PPO episodes total
- 4 models in memory simultaneously: policy, reference, reward, value

## 3.5 Why RLHF is Hard

| Challenge | Description |
|-----------|-------------|
| **Complexity** | 4 models in memory simultaneously |
| **Instability** | PPO is sensitive to hyperparameters |
| **Reward hacking** | Model exploits RM rather than genuinely improving |
| **Cost** | Human preference data is expensive |
| **Reproducibility** | Small changes → very different outcomes |

These challenges motivated the development of DPO.

## 3.6 Constitutional AI (RLAIF)

📄 *"Constitutional AI"* (Bai et al., 2022) — [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)

**Key idea:** Replace human feedback with **AI feedback**. An AI evaluates responses against principles (the "constitution"). Dramatically reduces cost and scales the feedback process.
"""

CH4_DPO = r"""
# Chapter 4: Direct Preference Optimization (DPO)

## 4.1 The Key Insight

📄 *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"* — [arXiv:2305.18290](https://arxiv.org/abs/2305.18290)

**You don't need a separate reward model or RL training loop.** The language model itself implicitly represents a reward model.

The optimal solution to the RLHF objective can be expressed in closed form:

```
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x,y))
```

Rearranging to express reward in terms of the policy:

```
r(x,y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)
```

Since Bradley-Terry only uses **reward differences**, the partition function Z(x) cancels:

```
L_DPO = -E[log σ(β · log(π_θ(y_w|x)/π_ref(y_w|x)) - β · log(π_θ(y_l|x)/π_ref(y_l|x)))]
```

## 4.2 DPO vs RLHF

| Aspect | RLHF (PPO) | DPO |
|--------|-------------|-----|
| Models in memory | 4 | 2 (policy + reference) |
| Training loop | Complex RL with generation | Simple supervised training |
| Hyperparameters | Many | Few (mainly β) |
| Stability | Often unstable | Very stable |
| Sampling during training | Required | Not required |
| Performance | Strong | Comparable or better |

## 4.3 The DPO Gradient — Intuition

```
∇L_DPO ∝ -β · [weight] · [∇log π(y_w|x) - ∇log π(y_l|x)]
```

- **Increase** likelihood of preferred response y_w
- **Decrease** likelihood of rejected response y_l
- **Weight** by how wrong the model currently is

If the model already prefers y_w correctly → small gradient (don't fix what isn't broken).

## 4.4 DPO Data Format

```python
# Each example needs: prompt + chosen + rejected
{
    "prompt": [{"role": "user", "content": "Explain gravity"}],
    "chosen": [{"role": "assistant", "content": "Gravity is a fundamental force..."}],
    "rejected": [{"role": "assistant", "content": "Gravity is when things fall down."}]
}
```

## 4.5 DPO Code Example

```python
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

config = DPOConfig(
    output_dir="./dpo-output",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,      # Much lower than SFT!
    beta=0.1,                # KL penalty strength
    bf16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="your-username/your-dpo-model",
)

trainer = DPOTrainer(
    model="your-sft-model",   # SFT model from stage 1
    args=config,
    train_dataset=dataset,
)
trainer.train()
```

## 4.6 Key Hyperparameters

| Parameter | Typical Range | Notes |
|-----------|--------------|-------|
| **β (beta)** | 0.01 – 0.5 | Higher = stay closer to reference. Default: 0.1 |
| **Learning rate** | 1e-7 – 5e-6 | Much lower than SFT. Very sensitive. |
| **Epochs** | 1 – 3 | Overfitting is common with more |
"""

CH5_PREFERENCE_ZOO = r"""
# Chapter 5: The Preference Optimization Zoo

After DPO, researchers developed many variants. Here's what each adds.

## 5.1 IPO — Identity Preference Optimization

**Problem:** DPO can overfit, especially with noisy preferences.

**Solution:** Adds regularization without assuming the Bradley-Terry model:
```
L_IPO = E[(log(π_θ(y_w)/π_ref(y_w)) - log(π_θ(y_l)/π_ref(y_l)) - 1/(2β))²]
```

**When to use:** Noisy preference data, DPO overfitting.

## 5.2 KTO — Kahneman-Tversky Optimization
📄 [arXiv:2402.01306](https://arxiv.org/abs/2402.01306)

**Problem:** DPO needs *paired* preferences (chosen AND rejected per prompt). Expensive to collect.

**Solution:** Works with **unpaired binary feedback** — just "good" or "bad" per response. Based on prospect theory: losses hurt more than equivalent gains.

```python
{"prompt": "...", "completion": "...", "label": True}   # 👍
{"prompt": "...", "completion": "...", "label": False}  # 👎
```

**When to use:** Thumbs-up/down feedback, no pairwise comparisons.

## 5.3 ORPO — Odds Ratio Preference Optimization

**Problem:** DPO needs a separate SFT stage + reference model.

**Solution:** Combines SFT and preference optimization in **one step**:
```
L_ORPO = L_SFT + λ · L_OR
```

**When to use:** Simplest possible pipeline.

## 5.4 SimPO — Simple Preference Optimization

**Problem:** DPO needs a reference model in memory (doubles GPU needs).

**Solution:** Eliminates reference model. Uses **average log probability** as the implicit reward (length-normalized).

**When to use:** GPU memory constraints.

## 5.5 CPO — Contrastive Preference Optimization

Removes reference model using a contrastive loss. Similar to SimPO, different formulation.

## 5.6 Online DPO

**Problem:** Standard DPO uses a static dataset — data goes stale as the model changes.

**Solution:** Generates new completions from the *current* model during training, scored by a reward model.

## 5.7 Comparison Table

| Method | Ref Model? | Paired Data? | Needs RM? | Separate SFT? | Key Advantage |
|--------|-----------|-------------|-----------|---------------|---------------|
| PPO | Yes | No | **Yes** | Yes | Gold standard |
| DPO | Yes | **Yes** | No | Yes | Simple, stable |
| IPO | Yes | Yes | No | Yes | Robust to noise |
| KTO | Yes | **No** | No | Yes | Binary feedback |
| ORPO | **No** | Yes | No | **No** | Simplest pipeline |
| SimPO | **No** | Yes | No | Yes | Memory efficient |
| CPO | **No** | Yes | No | Yes | Memory efficient |
| Online DPO | Yes | Online | **Yes** | Yes | Fresh data |
| GRPO | Yes (soft) | No | **Yes** / funcs | Yes | Best for reasoning |
"""

CH6_GRPO = r"""
# Chapter 6: GRPO and the Reasoning Revolution

## 6.1 What is GRPO?

📄 *"DeepSeekMath"* (Shao et al., 2024) — [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)

**Group Relative Policy Optimization** — a PPO variant that's more memory-efficient and great for reasoning.

**Key idea:** Instead of a separate value model (critic), GRPO generates **multiple completions per prompt** and uses the group average as the baseline.

## 6.2 How GRPO Works

```
For each prompt:
  1. Generate G completions (e.g., G=16)
  2. Score each with a reward function
  3. Compute advantage:  Â_i = (r_i - mean(r)) / std(r)
  4. Update model: ↑ probability of high-advantage completions
                   ↓ probability of low-advantage completions
```

**The GRPO loss:**
```
L = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)] + β · KL
```

"Group Relative" = advantage computed *relative to the group* for the same prompt.

## 6.3 The DeepSeek-R1 Story

📄 *"DeepSeek-R1"* (DeepSeek-AI, 2025) — [arXiv:2501.12948](https://arxiv.org/abs/2501.12948)

GRPO was used to train DeepSeek-R1 — a model that learned chain-of-thought reasoning **purely through RL**.

With the right reward (accuracy on math/code), the model **spontaneously developed:**
- Chain-of-thought reasoning
- Self-verification ("Let me check...")
- Error correction
- Problem decomposition

No one explicitly taught these behaviors — they emerged from the reward signal.

## 6.4 Reward Functions

GRPO is flexible — rewards can be:

1. **Python functions** (rule-based): Is the math answer correct? Does the code pass tests?
2. **Reward models** (learned): A neural network scoring responses
3. **Multiple functions**: accuracy + format + length

```python
import re

def accuracy_reward(completions, ground_truth, **kwargs):
    matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions]
    contents = [m.group(1) if m else "" for m in matches]
    return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]

def format_reward(completions, **kwargs):
    pattern = r"<think>.*?</think>.*<answer>.*?</answer>"
    return [1.0 if re.match(pattern, c, re.DOTALL) else 0.0 for c in completions]
```

## 6.5 GRPO Code Example

```python
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

config = GRPOConfig(
    output_dir="./grpo-output",
    learning_rate=1e-6,
    per_device_train_batch_size=4,
    num_generations=16,        # G completions per prompt
    max_completion_length=512,
    bf16=True,
    gradient_checkpointing=True,
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=config,
    train_dataset=dataset,
)
trainer.train()
```
"""

CH7_PEFT = r"""
# Chapter 7: Parameter-Efficient Fine-Tuning (PEFT)

## 7.1 The Memory Problem

Fine-tuning 7B parameters requires:

| Component | Memory |
|-----------|--------|
| Model weights (bf16) | 14 GB |
| Gradients | 14 GB |
| Optimizer states (AdamW) | 28 GB |
| Activations | 10-30 GB |
| **Total** | **~60-80 GB** |

That's one A100 for SFT alone. RLHF with PPO (4 models) = 4× this.

## 7.2 LoRA: Low-Rank Adaptation

📄 [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)

**Insight:** Fine-tuning weight updates have low rank — approximate them with small matrices.

```
W' = W + α · B × A

W: original frozen weight   (d × d)     — NOT trained
A: down projection          (d × r)     — trained
B: up projection            (r × d)     — trained
r: rank (typically 8-32)    r << d
```

**Example:** 4096 × 4096 weight matrix:
- Full fine-tuning: **16.7M** parameters
- LoRA (r=16): 2 × 4096 × 16 = **131K** parameters (128× fewer!)

## 7.3 QLoRA: Quantized LoRA

📄 [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)

Quantize the frozen base model to **4-bit**, add LoRA adapters in bf16.

- 7B model: ~4 GB (quantized) + small LoRA adapters
- Fine-tune 7B on a single RTX 4090 (24 GB)!

## 7.4 LoRA with TRL

```python
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling (usually 2×r)
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

config = SFTConfig(
    output_dir="./sft-lora",
    num_train_epochs=3,
    learning_rate=2e-4,      # Higher LR for LoRA
    bf16=True,
    gradient_checkpointing=True,
)

trainer = SFTTrainer(
    model="meta-llama/Llama-3.1-8B",
    args=config,
    train_dataset=dataset,
    peft_config=lora_config,  # ← pass LoRA config
)
trainer.train()
```

## 7.5 When to Use What

| Scenario | Recommendation |
|----------|---------------|
| Limited GPU memory | LoRA / QLoRA |
| Quick experiment | LoRA |
| Maximum quality | Full fine-tuning |
| Multiple adapters from same base | LoRA (swap adapters) |
| Very small dataset | LoRA (acts as regularizer) |

**LoRA is ~95-99% as good as full fine-tuning** at a fraction of the compute.
"""

CH8_TOOLBOX = r"""
# Chapter 8: The Toolbox

## 8.1 TRL — Transformers Reinforcement Learning

🔗 [github.com/huggingface/trl](https://github.com/huggingface/trl) · [Docs](https://huggingface.co/docs/trl)

The central library for all post-training methods:

| Trainer | Method | Dataset Type |
|---------|--------|-------------|
| `SFTTrainer` | Supervised Fine-Tuning | `messages` or `prompt`+`completion` |
| `DPOTrainer` | Direct Preference Optimization | `prompt`+`chosen`+`rejected` |
| `GRPOTrainer` | Group Relative Policy Optimization | `prompt` only |
| `RLOOTrainer` | REINFORCE Leave-One-Out | `prompt` only |
| `RewardTrainer` | Reward Model Training | `prompt`+`chosen`+`rejected` |
| `KTOTrainer` | Kahneman-Tversky Optimization | `prompt`+`completion`+`label` |
| `ORPOTrainer` | Odds Ratio Preference Opt. | `prompt`+`chosen`+`rejected` |
| `CPOTrainer` | Contrastive Preference Opt. | `prompt`+`chosen`+`rejected` |
| `OnlineDPOTrainer` | Online DPO | `prompt` only |
| `PPOTrainer` | Proximal Policy Optimization | tokenized |
| `XPOTrainer` | Exploratory Preference Opt. | `prompt` only |
| `NashMDTrainer` | Nash Mirror Descent | `prompt` only |
| `PRMTrainer` | Process Reward Model | stepwise supervision |

## 8.2 Core Libraries

| Library | Purpose |
|---------|---------|
| **transformers** | Model loading, tokenization, chat templates |
| **datasets** | Dataset loading and processing |
| **peft** | LoRA, QLoRA, adapters |
| **accelerate** | Distributed training, DeepSpeed |
| **bitsandbytes** | 4/8-bit quantization |

## 8.3 Inference & Serving

| Tool | Purpose |
|------|---------|
| **vLLM** | High-throughput inference (5-10× faster generation) |
| **TGI** | HuggingFace Text Generation Inference |
| **Unsloth** | 2-5× faster LoRA training |

TRL integrates vLLM directly:
```python
config = GRPOConfig(use_vllm=True, vllm_mode="colocate")
```

## 8.4 Experiment Tracking & Evaluation

| Tool | Purpose |
|------|---------|
| **Weights & Biases** | Experiment tracking |
| **Trackio** | HF-native tracking |
| **lm-eval-harness** | Standardized LLM benchmarks |
| **AlpacaEval** | GPT-4 based evaluation |

## 8.5 TRL CLI Commands

```bash
# Install
pip install trl

# SFT from command line
trl sft --model_name_or_path Qwen/Qwen3-0.6B \
    --dataset_name trl-lib/Capybara --output_dir ./sft

# DPO from command line
trl dpo --model_name_or_path your-sft-model \
    --dataset_name trl-lib/ultrafeedback_binarized --output_dir ./dpo

# GRPO from command line
trl grpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name trl-lib/DeepMath-103K --output_dir ./grpo

# Multi-GPU
accelerate launch --num_processes 4 train.py

# DeepSpeed ZeRO-3
accelerate launch --config_file deepspeed_zero3.yaml train.py
```
"""

CH9_DATASETS = r"""
# Chapter 9: Datasets — What to Train On

## 9.1 SFT Datasets

| Dataset | Size | Description |
|---------|------|-------------|
| [**trl-lib/Capybara**](https://huggingface.co/datasets/trl-lib/Capybara) | ~90K | Multi-turn conversations, high quality |
| [**HuggingFaceH4/ultrachat_200k**](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 200K | Diverse multi-turn conversations |
| [**allenai/tulu-3-sft-mixture**](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | ~1.3M | Large-scale SFT mixture |
| [**OpenAssistant/oasst1**](https://huggingface.co/datasets/OpenAssistant/oasst1) | 161K | Crowdsourced conversation trees |
| [**tatsu-lab/alpaca**](https://huggingface.co/datasets/tatsu-lab/alpaca) | 52K | GPT-generated instructions |
| [**teknium/OpenHermes-2.5**](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M | Large synthetic instructions |

## 9.2 Preference Datasets (DPO / KTO / ORPO)

| Dataset | Size | Description |
|---------|------|-------------|
| [**trl-lib/ultrafeedback_binarized**](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized) | 60K | Binarized UltraFeedback |
| [**Anthropic/hh-rlhf**](https://huggingface.co/datasets/Anthropic/hh-rlhf) | 170K | Helpful + harmless preferences |
| [**argilla/ultrafeedback-binarized-preferences**](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences) | 60K | Cleaned UltraFeedback |

## 9.3 Prompt-Only Datasets (GRPO / RLOO)

| Dataset | Size | Description |
|---------|------|-------------|
| [**trl-lib/DeepMath-103K**](https://huggingface.co/datasets/trl-lib/DeepMath-103K) | 103K | Math with verifiable answers |
| [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) | ~70K | Competition math |

## 9.4 How to Choose

1. **First experiment:** `trl-lib/Capybara` (SFT) or `trl-lib/ultrafeedback_binarized` (DPO)
2. **Quality > Quantity:** LIMA: 1K great > 52K mediocre
3. **Match your use case:** Math model → math data, chat model → diverse conversations
4. **Always inspect first:**

```python
from datasets import load_dataset
ds = load_dataset("trl-lib/Capybara", split="train")
print(ds[0])  # Look at the data!
```
"""

CH10_EVALUATION = r"""
# Chapter 10: Evaluation — How to Know If It Worked

## 10.1 Why Evaluation is Hard

- Open-ended outputs can be correct in many ways
- Perplexity doesn't correlate with usefulness (LIMA proved this)
- Benchmark scores ≠ real-world performance
- Human evaluation is expensive and subjective

## 10.2 Automated Benchmarks

| Benchmark | Measures | How |
|-----------|----------|-----|
| **MMLU** | Knowledge (57 subjects) | Multiple choice |
| **HellaSwag** | Commonsense reasoning | Sentence completion |
| **ARC** | Science reasoning | Multiple choice |
| **TruthfulQA** | Truthfulness | Trick questions |
| **GSM8K** | Math reasoning | Grade-school word problems |
| **MATH** | Advanced math | Competition-level |
| **HumanEval** | Code generation | Python problems |
| **MBPP** | Code generation | Basic Python |
| **IFEval** | Instruction following | Verifiable constraints |

**Run with lm-eval-harness:**
```bash
lm_eval --model hf \
    --model_args pretrained=your-model \
    --tasks mmlu,gsm8k,hellaswag \
    --batch_size 8
```

## 10.3 LLM-as-Judge

| Evaluation | Description |
|------------|-------------|
| **AlpacaEval** | GPT-4 compares to reference |
| **MT-Bench** | Multi-turn dialogue, GPT-4 judged |
| **Arena Hard** | Challenging prompts, GPT-4 judged |

## 10.4 Human Evaluation

- **Side-by-side:** Show two responses, pick the better one
- **Likert scale:** Rate helpfulness, accuracy, harmlessness (1-7)
- **Chatbot Arena:** [lmarena.ai](https://lmarena.ai/) — crowdsourced blind comparisons

## 10.5 The Open LLM Leaderboard

🔗 [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)

Standardized evaluation of open-source models. The primary way the community tracks progress.
"""

CH11_RECIPE = r"""
# Chapter 11: Putting It All Together

## 11.1 The Standard Recipe (2024-2025)

```
Step 1: Choose Base Model
├── Qwen3 (0.6B–235B) — Top-performing
├── Llama 3.1/3.2 (1B–405B) — Meta
├── Gemma 3/4 (1B–27B) — Google
└── Mistral/Mixtral — Efficient

Step 2: SFT
├── Dataset: trl-lib/Capybara or ultrachat_200k
├── Method: SFTTrainer + LoRA (or full)
├── Epochs: 2-3,  LR: 2e-5 (full) / 2e-4 (LoRA)
└── Output → reference model for Stage 3

Step 3: Preference Optimization
├── DPO (simplest): β=0.1, LR=5e-7, 1-2 epochs
├── GRPO (reasoning): 16 generations, LR=1e-6
└── KTO (binary feedback): unpaired data

Step 4: Evaluation
├── lm-eval-harness (MMLU, GSM8K)
├── MT-Bench / AlpacaEval
└── Manual testing
```

## 11.2 Full Code: SFT → DPO

```python
# ═══ STAGE 1: SFT ═══
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

sft_dataset = load_dataset("trl-lib/Capybara", split="train")

sft_config = SFTConfig(
    output_dir="./sft-model",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    max_seq_length=2048,
    bf16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="your-username/my-sft-model",
)
SFTTrainer(model="Qwen/Qwen3-0.6B", args=sft_config,
           train_dataset=sft_dataset).train()

# ═══ STAGE 2: DPO ═══
from trl import DPOTrainer, DPOConfig

dpo_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

dpo_config = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=5e-7,
    beta=0.1,
    bf16=True,
    gradient_checkpointing=True,
    push_to_hub=True,
    hub_model_id="your-username/my-dpo-model",
)
DPOTrainer(model="your-username/my-sft-model", args=dpo_config,
           train_dataset=dpo_dataset).train()
```

## 11.3 Hardware Guidelines

| Model Size | Minimum | Recommended | With LoRA |
|-----------|---------|-------------|-----------|
| 0.5–3B | 1× A10G (24GB) | 1× A100 (80GB) | 1× T4 (16GB) |
| 7–8B | 1× A100 (80GB) | 2× A100 | 1× A10G (24GB) |
| 13B | 2× A100 | 4× A100 | 1× A100 (80GB) |
| 70B | 4× A100 | 8× A100 | 2× A100 |
"""

CH12_READING_LIST = r"""
# Chapter 12: The Reading List

## Tier 1: Must-Read (The Foundations)

| # | Paper | Year | Why Read It |
|---|-------|------|------------|
| 1 | [**InstructGPT**](https://arxiv.org/abs/2203.02155) — *Training LMs to Follow Instructions with Human Feedback* | 2022 | The SFT→RM→PPO pipeline. Everything starts here. |
| 2 | [**DPO**](https://arxiv.org/abs/2305.18290) — *Direct Preference Optimization* | 2023 | No reward model, no RL. Most widely used method. |
| 3 | [**LoRA**](https://arxiv.org/abs/2106.09685) — *Low-Rank Adaptation of Large Language Models* | 2021 | Made fine-tuning accessible. Used everywhere. |
| 4 | [**DeepSeek-R1**](https://arxiv.org/abs/2501.12948) — *Incentivizing Reasoning via RL* | 2025 | RL teaches reasoning from scratch. The reasoning era. |

## Tier 2: Important (Deeper Understanding)

| # | Paper | Year | Why Read It |
|---|-------|------|------------|
| 5 | [**LIMA**](https://arxiv.org/abs/2305.11206) — *Less Is More for Alignment* | 2023 | Data quality >> quantity. |
| 6 | [**Constitutional AI**](https://arxiv.org/abs/2212.08073) — *Harmlessness from AI Feedback* | 2022 | AI feedback replaces human feedback (RLAIF). |
| 7 | [**FLAN**](https://arxiv.org/abs/2109.01652) — *Finetuned LMs Are Zero-Shot Learners* | 2021 | Proved instruction tuning works. |
| 8 | [**Self-Instruct**](https://arxiv.org/abs/2212.10560) — *Aligning LMs with Self-Generated Instructions* | 2022 | Synthetic data for SFT. Led to Alpaca. |
| 9 | [**DeepSeekMath**](https://arxiv.org/abs/2402.03300) — *Pushing the Limits of Mathematical Reasoning* | 2024 | Introduced GRPO. |
| 10 | [**QLoRA**](https://arxiv.org/abs/2305.14314) — *Efficient Finetuning of Quantized LMs* | 2023 | 7B fine-tuning on consumer GPUs. |

## Tier 3: Advanced (Cutting Edge)

| # | Paper | Year |
|---|-------|------|
| 11 | [**KTO**](https://arxiv.org/abs/2402.01306) — *Prospect Theoretic Optimization* | 2024 |
| 12 | [**ORPO**](https://arxiv.org/abs/2403.07691) — *Monolithic Preference Optimization* | 2024 |
| 13 | [**SimPO**](https://arxiv.org/abs/2405.14734) — *Reference-Free Reward* | 2024 |
| 14 | **Tulu 3** — *Open Language Model Post-Training* | 2024 |
| 15 | [**Zephyr**](https://arxiv.org/abs/2310.16944) — *Direct Distillation of LM Alignment* | 2023 |

## Tier 4: RL Foundations

| # | Paper | Year |
|---|-------|------|
| 16 | [**PPO**](https://arxiv.org/abs/1707.06347) — *Proximal Policy Optimization* | 2017 |
| 17 | [**Learning to Summarize from Human Feedback**](https://arxiv.org/abs/2009.01325) | 2020 |
| 18 | [**Fine-Tuning LMs from Human Preferences**](https://arxiv.org/abs/1909.08593) | 2019 |
"""

GLOSSARY = r"""
# Glossary & Quick Reference

## Key Terms

| Term | Definition |
|------|-----------|
| **Alignment** | Making a model behave according to human intentions and values |
| **RLHF** | Reinforcement Learning from Human Feedback |
| **RLAIF** | RL from AI Feedback (AI replaces human annotators) |
| **SFT** | Supervised Fine-Tuning on instruction-response pairs |
| **DPO** | Direct Preference Optimization — no reward model needed |
| **GRPO** | Group Relative Policy Optimization — for reasoning |
| **PPO** | Proximal Policy Optimization — the RL algorithm in RLHF |
| **Reward Model** | Scores responses based on human preferences |
| **Policy** | The language model being trained (RL terminology) |
| **Reference Model (π_ref)** | SFT model used as baseline |
| **KL Divergence** | Measures how far the policy drifts from the reference |
| **Bradley-Terry** | Preference model: P(A>B) = σ(score(A) - score(B)) |
| **Reward Hacking** | Model exploits the reward model instead of improving |
| **LoRA** | Low-Rank Adaptation — parameter-efficient fine-tuning |
| **QLoRA** | 4-bit quantized base model + LoRA adapters |
| **Chat Template** | Format for structuring conversations (special tokens + roles) |
| **On-policy** | Training on data from the *current* model |
| **Off-policy** | Training on data from a *different* model |
| **Advantage** | How much better an action is vs. the expected value |

## Dataset Formats by Trainer

```python
# SFT
{"messages": [{"role": "user", "content": "..."},
              {"role": "assistant", "content": "..."}]}

# DPO / ORPO / CPO
{"prompt": "...", "chosen": "...", "rejected": "..."}

# GRPO / RLOO / Online DPO
{"prompt": "..."}

# KTO
{"prompt": "...", "completion": "...", "label": True}

# Reward Model
{"prompt": "...", "chosen": "...", "rejected": "..."}

# PRM (Process Reward Model)
{"prompt": "...", "completions": ["step1", "step2"],
 "labels": [True, False]}
```
"""

SEARCH_CONTENT = {
    "Overview & Roadmap": OVERVIEW,
    "Ch 1: The Big Picture": CH1_BIG_PICTURE,
    "Ch 2: SFT": CH2_SFT,
    "Ch 3: RLHF": CH3_RLHF,
    "Ch 4: DPO": CH4_DPO,
    "Ch 5: Preference Zoo": CH5_PREFERENCE_ZOO,
    "Ch 6: GRPO & Reasoning": CH6_GRPO,
    "Ch 7: PEFT (LoRA)": CH7_PEFT,
    "Ch 8: Toolbox": CH8_TOOLBOX,
    "Ch 9: Datasets": CH9_DATASETS,
    "Ch 10: Evaluation": CH10_EVALUATION,
    "Ch 11: Full Recipe": CH11_RECIPE,
    "Ch 12: Reading List": CH12_READING_LIST,
    "Glossary & Reference": GLOSSARY,
}

def search_guide(query):
    if not query or not query.strip():
        return "Type a keyword to search across all chapters (e.g., *DPO*, *reward*, *LoRA*, *dataset*)."
    query_lower = query.lower().strip()
    results = []
    for title, content in SEARCH_CONTENT.items():
        lines = content.split("\n")
        for i, line in enumerate(lines):
            if query_lower in line.lower():
                start = max(0, i - 1)
                end = min(len(lines), i + 4)
                snippet = "\n".join(lines[start:end])
                results.append(f"### 📍 {title}\n\n{snippet}\n\n---\n")
                break  # one match per chapter
    if results:
        return f"## Found in {len(results)} chapter(s)\n\n" + "\n".join(results)
    return f"No results for **'{query}'**. Try: *SFT*, *DPO*, *GRPO*, *LoRA*, *reward*, *PPO*, *dataset*, *evaluation*."


# ============================================================
# GRADIO UI
# ============================================================

CUSTOM_CSS = """
.chapter-content {
    max-width: 900px;
    margin: 0 auto;
    font-size: 16px;
    line-height: 1.7;
}
.gradio-container {
    max-width: 1100px !important;
}
footer { display: none !important; }
"""

THEME = gr.themes.Soft(
    primary_hue="blue",
    secondary_hue="slate",
    font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"],
)

with gr.Blocks(title="Post-Training Guide for LLMs") as demo:

    gr.HTML("""
    <div style="text-align:center; padding: 20px 0 5px 0;">
        <h1 style="margin-bottom:4px;">📖 The Complete Guide to Post-Training of LLMs</h1>
        <p style="color:#666; font-size:17px; margin-top:0;">
            From Pretraining to Alignment — SFT · RLHF · DPO · GRPO · LoRA and more
        </p>
    </div>
    """)

    with gr.Tabs():
        with gr.Tab("🏠 Overview"):
            gr.Markdown(OVERVIEW, elem_classes="chapter-content")

        with gr.Tab("1️⃣ Big Picture"):
            gr.Markdown(CH1_BIG_PICTURE, elem_classes="chapter-content")

        with gr.Tab("2️⃣ SFT"):
            gr.Markdown(CH2_SFT, elem_classes="chapter-content")

        with gr.Tab("3️⃣ RLHF"):
            gr.Markdown(CH3_RLHF, elem_classes="chapter-content")

        with gr.Tab("4️⃣ DPO"):
            gr.Markdown(CH4_DPO, elem_classes="chapter-content")

        with gr.Tab("5️⃣ Preference Zoo"):
            gr.Markdown(CH5_PREFERENCE_ZOO, elem_classes="chapter-content")

        with gr.Tab("6️⃣ GRPO"):
            gr.Markdown(CH6_GRPO, elem_classes="chapter-content")

        with gr.Tab("7️⃣ PEFT"):
            gr.Markdown(CH7_PEFT, elem_classes="chapter-content")

        with gr.Tab("8️⃣ Toolbox"):
            gr.Markdown(CH8_TOOLBOX, elem_classes="chapter-content")

        with gr.Tab("9️⃣ Datasets"):
            gr.Markdown(CH9_DATASETS, elem_classes="chapter-content")

        with gr.Tab("🔟 Evaluation"):
            gr.Markdown(CH10_EVALUATION, elem_classes="chapter-content")

        with gr.Tab("📋 Full Recipe"):
            gr.Markdown(CH11_RECIPE, elem_classes="chapter-content")

        with gr.Tab("📚 Reading List"):
            gr.Markdown(CH12_READING_LIST, elem_classes="chapter-content")

        with gr.Tab("📖 Glossary"):
            gr.Markdown(GLOSSARY, elem_classes="chapter-content")

        with gr.Tab("🔍 Search"):
            gr.Markdown("## Search the Guide\nFind any topic across all chapters.")
            search_input = gr.Textbox(
                placeholder="Type a keyword (e.g. DPO, reward, LoRA, dataset, PPO)...",
                label="Search",
                lines=1,
            )
            search_output = gr.Markdown(elem_classes="chapter-content")
            search_input.submit(search_guide, inputs=search_input, outputs=search_output)
            search_input.change(search_guide, inputs=search_input, outputs=search_output)

    gr.HTML("""
    <div style="text-align:center; padding:15px; color:#888; font-size:13px; border-top:1px solid #eee; margin-top:20px;">
        Built from primary research papers & official HF documentation · All code uses TRL v1.2+ APIs ·
        <a href="https://huggingface.co/docs/trl" target="_blank">TRL Docs</a> ·
        <a href="https://github.com/huggingface/trl" target="_blank">TRL GitHub</a> ·
        Last updated April 2026
    </div>
    """)

demo.launch(theme=THEME, css=CUSTOM_CSS)