Spaces:
Sleeping
Sleeping
| import gradio as gr | |
| # ============================================================ | |
| # CONTENT — each chapter stored as a Python string (Markdown) | |
| # ============================================================ | |
| OVERVIEW = r""" | |
| # 📖 The Complete Guide to Post-Training of Large Language Models | |
| ### From Pretraining to Alignment — Everything You Need to Know | |
| --- | |
| > **Who is this for?** You've learned how pretraining works — you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens *after* pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini. | |
| --- | |
| ## 🗺️ Roadmap | |
| | # | Chapter | What You'll Learn | | |
| |---|---------|-------------------| | |
| | 1 | **The Big Picture** | Why pretrained models aren't useful yet; the 3-stage pipeline | | |
| | 2 | **SFT** | Supervised Fine-Tuning — loss function, data formats, key papers | | |
| | 3 | **RLHF** | Reward models, PPO, KL divergence, reward hacking | | |
| | 4 | **DPO** | Direct Preference Optimization — RLHF without RL | | |
| | 5 | **Preference Zoo** | KTO, ORPO, SimPO, CPO, IPO, Online DPO | | |
| | 6 | **GRPO & Reasoning** | DeepSeek-R1, reward functions, the reasoning revolution | | |
| | 7 | **PEFT** | LoRA, QLoRA — fine-tune on consumer GPUs | | |
| | 8 | **Toolbox** | TRL, Transformers, vLLM, Accelerate, DeepSpeed | | |
| | 9 | **Datasets** | What to train on — curated lists with Hub links | | |
| | 10 | **Evaluation** | Benchmarks, LLM-as-Judge, human eval | | |
| | 11 | **Full Recipe** | End-to-end pipeline with code | | |
| | 12 | **Reading List** | 18 must-read papers in 4 tiers | | |
| --- | |
| ### The Three Stages of Post-Training | |
| ``` | |
| ┌─────────────┐ ┌──────────────────┐ ┌─────────────────────────┐ | |
| │ STAGE 1: SFT │ ──> │ STAGE 2: Reward │ ──> │ STAGE 3: RL │ | |
| │ │ │ Model Training │ │ (PPO / DPO / GRPO) │ | |
| │ Teach format │ │ Learn preferences│ │ Optimize for preferences│ | |
| │ & behavior │ │ from comparisons │ │ while staying close to │ | |
| │ │ │ │ │ the SFT model │ | |
| └─────────────┘ └──────────────────┘ └─────────────────────────┘ | |
| Input: Pretrained LM Output: Aligned Assistant | |
| ``` | |
| ### The Evolution Timeline | |
| | Year | Method | Key Idea | | |
| |------|--------|----------| | |
| | 2017 | RLHF (original) | Human preferences → reward model → RL | | |
| | 2020 | RLHF for LLMs | Applied to text summarization | | |
| | 2022 | **InstructGPT** | Full SFT → RM → PPO pipeline | | |
| | 2022 | Constitutional AI | AI feedback replaces human feedback | | |
| | 2023 | **DPO** | No reward model needed — direct optimization | | |
| | 2024 | KTO / ORPO | Binary feedback / combined SFT+preference | | |
| | 2024 | **GRPO** | Group-based RL for reasoning (DeepSeek) | | |
| | 2025 | DeepSeek-R1 | RL teaches chain-of-thought from scratch | | |
| """ | |
| CH1_BIG_PICTURE = r""" | |
| # Chapter 1: The Big Picture — Why Post-Training Exists | |
| ## 1.1 The Gap Between Pretraining and Usefulness | |
| You've pretrained a language model. It can predict the next token with impressive accuracy. It has absorbed vast knowledge from the internet. But try asking it a question: | |
| ``` | |
| User: What is the capital of France? | |
| Model: What is the capital of Germany? What is the capital of Italy? What is the... | |
| ``` | |
| The model doesn't *answer* — it *continues*. The pretraining objective (`P(next_token | context)`) optimizes for predicting what comes next in web text, not for being helpful. | |
| > *"Large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users."* | |
| > — **InstructGPT** (Ouyang et al., 2022) | |
| ## 1.2 What Post-Training Does | |
| Post-training is everything after pretraining that makes a model useful, safe, and aligned. It has three stages: | |
| **Stage 1 — SFT (Supervised Fine-Tuning):** Teach the model the *format* of helpful responses using demonstrations. | |
| **Stage 2 — Reward Modeling:** Train a model to predict which response a human would prefer. | |
| **Stage 3 — RL Optimization:** Optimize the SFT model to generate responses the reward model scores highly. | |
| ## 1.3 The Superficial Alignment Hypothesis | |
| > *"A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."* | |
| > — **LIMA** (Zhou et al., 2023) | |
| This is the key insight: **post-training doesn't teach new knowledge** — it teaches the model to *surface existing knowledge in the right way*. The pretrained model already "knows" the capital of France; SFT teaches it to respond to questions rather than generate more questions. | |
| ## 1.4 Why This Matters | |
| The difference is dramatic: | |
| - **InstructGPT 1.3B** (post-trained) was preferred over **GPT-3 175B** (pretrained only) by human evaluators | |
| - That's a 100× smaller model winning because of post-training | |
| - Post-training is what turns a text predictor into an assistant | |
| """ | |
| CH2_SFT = r""" | |
| # Chapter 2: Supervised Fine-Tuning (SFT) | |
| ## 2.1 What SFT Does | |
| SFT is the bridge between a pretrained language model and a useful assistant. | |
| **Before SFT:** | |
| ``` | |
| Input: "Explain quantum computing in simple terms." | |
| Output: "Explain quantum computing to a 5-year-old. Explain quantum computing..." | |
| ``` | |
| **After SFT:** | |
| ``` | |
| Input: "Explain quantum computing in simple terms." | |
| Output: "Quantum computing uses the principles of quantum mechanics to process | |
| information. Unlike classical computers that use bits (0 or 1), | |
| quantum computers use qubits that can be both 0 and 1 simultaneously..." | |
| ``` | |
| ## 2.2 The SFT Loss Function | |
| If you understand pretraining, you understand SFT — with one crucial difference: | |
| **Pretraining loss** (on ALL tokens): | |
| ``` | |
| L_pretrain = -Σ log P(token_i | token_1, ..., token_{i-1}) for ALL tokens | |
| ``` | |
| **SFT loss** (on RESPONSE tokens only): | |
| ``` | |
| L_SFT = -Σ log P(c_i | prompt, c_1, ..., c_{i-1}) for COMPLETION tokens only | |
| ``` | |
| The prompt tokens are masked from the loss. We don't want the model to learn to generate instructions — we want it to learn to *respond* to them. | |
| ``` | |
| Sequence: [User: What is 2+2?] [Assistant: 4] | |
| Loss mask: [ ----IGNORED---- ] [COMPUTED HERE ] | |
| ``` | |
| ## 2.3 Chat Templates & Data Format | |
| Modern SFT uses structured conversations: | |
| ```python | |
| { | |
| "messages": [ | |
| {"role": "system", "content": "You are a helpful assistant."}, | |
| {"role": "user", "content": "What is the capital of France?"}, | |
| {"role": "assistant", "content": "The capital of France is Paris."} | |
| ] | |
| } | |
| ``` | |
| Each model family converts this to its own template: | |
| ``` | |
| # ChatML (Qwen, etc.): | |
| <|im_start|>system | |
| You are a helpful assistant.<|im_end|> | |
| <|im_start|>user | |
| What is the capital of France?<|im_end|> | |
| <|im_start|>assistant | |
| The capital of France is Paris.<|im_end|> | |
| # Llama-3: | |
| <|begin_of_text|><|start_header_id|>system<|end_header_id|> | |
| You are a helpful assistant.<|eot_id|> | |
| <|start_header_id|>user<|end_header_id|> | |
| What is the capital of France?<|eot_id|> | |
| ``` | |
| The `transformers` library handles this automatically: | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") | |
| messages = [ | |
| {"role": "user", "content": "What is 2+2?"}, | |
| {"role": "assistant", "content": "4"} | |
| ] | |
| # For training: | |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) | |
| # For inference: | |
| text = tokenizer.apply_chat_template(messages[:1], tokenize=False, add_generation_prompt=True) | |
| ``` | |
| ## 2.4 Key SFT Papers | |
| ### FLAN (2021) — Instruction Tuning Works | |
| 📄 *"Finetuned Language Models Are Zero-Shot Learners"* — [arXiv:2109.01652](https://arxiv.org/abs/2109.01652) | |
| - 62 NLP datasets formatted as instructions → fine-tune LaMDA-PT 137B | |
| - **Result:** Surpassed zero-shot GPT-3 175B on 20/25 tasks | |
| - **Key insight:** Instructions matter — same tasks without instructions = much weaker | |
| - **Recipe:** Adafactor, lr=3e-5, 30K steps, batch 8192 tokens | |
| ### Self-Instruct (2022) — Synthetic Data | |
| 📄 *"Self-Instruct"* — [arXiv:2212.10560](https://arxiv.org/abs/2212.10560) | |
| - 175 seed tasks → GPT-3 generates 52K instructions + responses | |
| - **Result:** +33% over vanilla GPT-3 on SuperNaturalInstructions | |
| - **Led to:** Stanford Alpaca (LLaMA + 52K GPT instructions for <$600) | |
| ### InstructGPT (2022) — SFT as Stage 1 | |
| 📄 *"Training Language Models to Follow Instructions with Human Feedback"* — [arXiv:2203.02155](https://arxiv.org/abs/2203.02155) | |
| - ~13K human demonstrations, 16 epochs, cosine LR, dropout 0.2 | |
| - **Key finding:** SFT overfits on val loss after 1 epoch, but more epochs ↑ RM score | |
| - **Result:** 1.3B InstructGPT preferred over 175B GPT-3 | |
| ### LIMA (2023) — Less Is More | |
| 📄 *"LIMA: Less Is More for Alignment"* — [arXiv:2305.11206](https://arxiv.org/abs/2305.11206) | |
| - Just **1,000 curated examples** → competitive with GPT-3.5 (DaVinci003) | |
| - **Recipe:** AdamW, lr 1e-5→1e-6, 15 epochs, batch 32, max len 2048 | |
| - **Takeaway:** Data quality >> data quantity | |
| ## 2.5 SFT Code Example | |
| ```python | |
| from trl import SFTTrainer, SFTConfig | |
| from datasets import load_dataset | |
| dataset = load_dataset("trl-lib/Capybara", split="train") | |
| config = SFTConfig( | |
| output_dir="./sft-output", | |
| num_train_epochs=3, | |
| per_device_train_batch_size=4, | |
| learning_rate=2e-5, | |
| max_seq_length=2048, | |
| gradient_checkpointing=True, | |
| bf16=True, | |
| logging_steps=10, | |
| push_to_hub=True, | |
| hub_model_id="your-username/your-sft-model", | |
| ) | |
| trainer = SFTTrainer( | |
| model="Qwen/Qwen3-0.6B", | |
| args=config, | |
| train_dataset=dataset, | |
| ) | |
| trainer.train() | |
| ``` | |
| SFTTrainer automatically detects the `messages` column, applies the chat template, and masks prompt tokens from the loss. | |
| """ | |
| CH3_RLHF = r""" | |
| # Chapter 3: Reinforcement Learning from Human Feedback (RLHF) | |
| ## 3.1 Why SFT Isn't Enough | |
| SFT teaches format and basic behavior, but: | |
| - It only learns from demonstrations (can't be better than its training data) | |
| - It can't express preferences (treats all tokens equally) | |
| - It can learn bad habits from imperfect training data | |
| RLHF addresses this by training on **which outputs are better**, not what specific tokens to generate. | |
| ## 3.2 Step 1: Train a Reward Model | |
| A **reward model (RM)** scores how good a response is (prompt + response → scalar score). | |
| **Training process:** | |
| 1. Generate multiple responses per prompt using the SFT model | |
| 2. Humans rank responses (e.g., A > B) | |
| 3. Train RM to predict these rankings | |
| **The Bradley-Terry preference model:** | |
| ``` | |
| P(A preferred over B) = σ(r(A) - r(B)) | |
| Loss: L_RM = -E[log σ(r(x, y_chosen) - r(x, y_rejected))] | |
| ``` | |
| **Architecture:** Same as the LM, but output head → single scalar. InstructGPT used a 6B RM (not 175B — larger was unstable). | |
| ## 3.3 Step 2: Optimize with PPO | |
| **The RLHF objective:** | |
| ``` | |
| maximize E[RM(prompt, response)] - β · KL(π_θ || π_ref) | |
| ↑ score high ↑ don't deviate too far | |
| on reward model from original SFT model | |
| ``` | |
| The **KL penalty** prevents **reward hacking** — without it, the model generates gibberish that tricks the RM. | |
| **PPO (Proximal Policy Optimization) loop:** | |
| 1. **Generate** responses from the current model | |
| 2. **Score** them with the reward model | |
| 3. **Compute advantage** (how much better than expected) | |
| 4. **Update** weights to favor high-advantage responses | |
| 5. **Clip** updates for stability | |
| ``` | |
| L_PPO = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)] | |
| where ratio = π_θ(a|s) / π_old(a|s) | |
| ``` | |
| ## 3.4 InstructGPT Training Details | |
| - β = 0.02 for KL penalty | |
| - Mixed 10% pretraining data during PPO (prevents capability regression) | |
| - LR range: 2.55e-6 to 2.55e-5 (>8.05e-6 diverged) | |
| - 256K PPO episodes total | |
| - 4 models in memory simultaneously: policy, reference, reward, value | |
| ## 3.5 Why RLHF is Hard | |
| | Challenge | Description | | |
| |-----------|-------------| | |
| | **Complexity** | 4 models in memory simultaneously | | |
| | **Instability** | PPO is sensitive to hyperparameters | | |
| | **Reward hacking** | Model exploits RM rather than genuinely improving | | |
| | **Cost** | Human preference data is expensive | | |
| | **Reproducibility** | Small changes → very different outcomes | | |
| These challenges motivated the development of DPO. | |
| ## 3.6 Constitutional AI (RLAIF) | |
| 📄 *"Constitutional AI"* (Bai et al., 2022) — [arXiv:2212.08073](https://arxiv.org/abs/2212.08073) | |
| **Key idea:** Replace human feedback with **AI feedback**. An AI evaluates responses against principles (the "constitution"). Dramatically reduces cost and scales the feedback process. | |
| """ | |
| CH4_DPO = r""" | |
| # Chapter 4: Direct Preference Optimization (DPO) | |
| ## 4.1 The Key Insight | |
| 📄 *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"* — [arXiv:2305.18290](https://arxiv.org/abs/2305.18290) | |
| **You don't need a separate reward model or RL training loop.** The language model itself implicitly represents a reward model. | |
| The optimal solution to the RLHF objective can be expressed in closed form: | |
| ``` | |
| π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x,y)) | |
| ``` | |
| Rearranging to express reward in terms of the policy: | |
| ``` | |
| r(x,y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x) | |
| ``` | |
| Since Bradley-Terry only uses **reward differences**, the partition function Z(x) cancels: | |
| ``` | |
| L_DPO = -E[log σ(β · log(π_θ(y_w|x)/π_ref(y_w|x)) - β · log(π_θ(y_l|x)/π_ref(y_l|x)))] | |
| ``` | |
| ## 4.2 DPO vs RLHF | |
| | Aspect | RLHF (PPO) | DPO | | |
| |--------|-------------|-----| | |
| | Models in memory | 4 | 2 (policy + reference) | | |
| | Training loop | Complex RL with generation | Simple supervised training | | |
| | Hyperparameters | Many | Few (mainly β) | | |
| | Stability | Often unstable | Very stable | | |
| | Sampling during training | Required | Not required | | |
| | Performance | Strong | Comparable or better | | |
| ## 4.3 The DPO Gradient — Intuition | |
| ``` | |
| ∇L_DPO ∝ -β · [weight] · [∇log π(y_w|x) - ∇log π(y_l|x)] | |
| ``` | |
| - **Increase** likelihood of preferred response y_w | |
| - **Decrease** likelihood of rejected response y_l | |
| - **Weight** by how wrong the model currently is | |
| If the model already prefers y_w correctly → small gradient (don't fix what isn't broken). | |
| ## 4.4 DPO Data Format | |
| ```python | |
| # Each example needs: prompt + chosen + rejected | |
| { | |
| "prompt": [{"role": "user", "content": "Explain gravity"}], | |
| "chosen": [{"role": "assistant", "content": "Gravity is a fundamental force..."}], | |
| "rejected": [{"role": "assistant", "content": "Gravity is when things fall down."}] | |
| } | |
| ``` | |
| ## 4.5 DPO Code Example | |
| ```python | |
| from trl import DPOTrainer, DPOConfig | |
| from datasets import load_dataset | |
| dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train") | |
| config = DPOConfig( | |
| output_dir="./dpo-output", | |
| num_train_epochs=1, | |
| per_device_train_batch_size=4, | |
| learning_rate=5e-7, # Much lower than SFT! | |
| beta=0.1, # KL penalty strength | |
| bf16=True, | |
| gradient_checkpointing=True, | |
| push_to_hub=True, | |
| hub_model_id="your-username/your-dpo-model", | |
| ) | |
| trainer = DPOTrainer( | |
| model="your-sft-model", # SFT model from stage 1 | |
| args=config, | |
| train_dataset=dataset, | |
| ) | |
| trainer.train() | |
| ``` | |
| ## 4.6 Key Hyperparameters | |
| | Parameter | Typical Range | Notes | | |
| |-----------|--------------|-------| | |
| | **β (beta)** | 0.01 – 0.5 | Higher = stay closer to reference. Default: 0.1 | | |
| | **Learning rate** | 1e-7 – 5e-6 | Much lower than SFT. Very sensitive. | | |
| | **Epochs** | 1 – 3 | Overfitting is common with more | | |
| """ | |
| CH5_PREFERENCE_ZOO = r""" | |
| # Chapter 5: The Preference Optimization Zoo | |
| After DPO, researchers developed many variants. Here's what each adds. | |
| ## 5.1 IPO — Identity Preference Optimization | |
| **Problem:** DPO can overfit, especially with noisy preferences. | |
| **Solution:** Adds regularization without assuming the Bradley-Terry model: | |
| ``` | |
| L_IPO = E[(log(π_θ(y_w)/π_ref(y_w)) - log(π_θ(y_l)/π_ref(y_l)) - 1/(2β))²] | |
| ``` | |
| **When to use:** Noisy preference data, DPO overfitting. | |
| ## 5.2 KTO — Kahneman-Tversky Optimization | |
| 📄 [arXiv:2402.01306](https://arxiv.org/abs/2402.01306) | |
| **Problem:** DPO needs *paired* preferences (chosen AND rejected per prompt). Expensive to collect. | |
| **Solution:** Works with **unpaired binary feedback** — just "good" or "bad" per response. Based on prospect theory: losses hurt more than equivalent gains. | |
| ```python | |
| {"prompt": "...", "completion": "...", "label": True} # 👍 | |
| {"prompt": "...", "completion": "...", "label": False} # 👎 | |
| ``` | |
| **When to use:** Thumbs-up/down feedback, no pairwise comparisons. | |
| ## 5.3 ORPO — Odds Ratio Preference Optimization | |
| **Problem:** DPO needs a separate SFT stage + reference model. | |
| **Solution:** Combines SFT and preference optimization in **one step**: | |
| ``` | |
| L_ORPO = L_SFT + λ · L_OR | |
| ``` | |
| **When to use:** Simplest possible pipeline. | |
| ## 5.4 SimPO — Simple Preference Optimization | |
| **Problem:** DPO needs a reference model in memory (doubles GPU needs). | |
| **Solution:** Eliminates reference model. Uses **average log probability** as the implicit reward (length-normalized). | |
| **When to use:** GPU memory constraints. | |
| ## 5.5 CPO — Contrastive Preference Optimization | |
| Removes reference model using a contrastive loss. Similar to SimPO, different formulation. | |
| ## 5.6 Online DPO | |
| **Problem:** Standard DPO uses a static dataset — data goes stale as the model changes. | |
| **Solution:** Generates new completions from the *current* model during training, scored by a reward model. | |
| ## 5.7 Comparison Table | |
| | Method | Ref Model? | Paired Data? | Needs RM? | Separate SFT? | Key Advantage | | |
| |--------|-----------|-------------|-----------|---------------|---------------| | |
| | PPO | Yes | No | **Yes** | Yes | Gold standard | | |
| | DPO | Yes | **Yes** | No | Yes | Simple, stable | | |
| | IPO | Yes | Yes | No | Yes | Robust to noise | | |
| | KTO | Yes | **No** | No | Yes | Binary feedback | | |
| | ORPO | **No** | Yes | No | **No** | Simplest pipeline | | |
| | SimPO | **No** | Yes | No | Yes | Memory efficient | | |
| | CPO | **No** | Yes | No | Yes | Memory efficient | | |
| | Online DPO | Yes | Online | **Yes** | Yes | Fresh data | | |
| | GRPO | Yes (soft) | No | **Yes** / funcs | Yes | Best for reasoning | | |
| """ | |
| CH6_GRPO = r""" | |
| # Chapter 6: GRPO and the Reasoning Revolution | |
| ## 6.1 What is GRPO? | |
| 📄 *"DeepSeekMath"* (Shao et al., 2024) — [arXiv:2402.03300](https://arxiv.org/abs/2402.03300) | |
| **Group Relative Policy Optimization** — a PPO variant that's more memory-efficient and great for reasoning. | |
| **Key idea:** Instead of a separate value model (critic), GRPO generates **multiple completions per prompt** and uses the group average as the baseline. | |
| ## 6.2 How GRPO Works | |
| ``` | |
| For each prompt: | |
| 1. Generate G completions (e.g., G=16) | |
| 2. Score each with a reward function | |
| 3. Compute advantage: Â_i = (r_i - mean(r)) / std(r) | |
| 4. Update model: ↑ probability of high-advantage completions | |
| ↓ probability of low-advantage completions | |
| ``` | |
| **The GRPO loss:** | |
| ``` | |
| L = -E[min(ratio · Â, clip(ratio, 1-ε, 1+ε) · Â)] + β · KL | |
| ``` | |
| "Group Relative" = advantage computed *relative to the group* for the same prompt. | |
| ## 6.3 The DeepSeek-R1 Story | |
| 📄 *"DeepSeek-R1"* (DeepSeek-AI, 2025) — [arXiv:2501.12948](https://arxiv.org/abs/2501.12948) | |
| GRPO was used to train DeepSeek-R1 — a model that learned chain-of-thought reasoning **purely through RL**. | |
| With the right reward (accuracy on math/code), the model **spontaneously developed:** | |
| - Chain-of-thought reasoning | |
| - Self-verification ("Let me check...") | |
| - Error correction | |
| - Problem decomposition | |
| No one explicitly taught these behaviors — they emerged from the reward signal. | |
| ## 6.4 Reward Functions | |
| GRPO is flexible — rewards can be: | |
| 1. **Python functions** (rule-based): Is the math answer correct? Does the code pass tests? | |
| 2. **Reward models** (learned): A neural network scoring responses | |
| 3. **Multiple functions**: accuracy + format + length | |
| ```python | |
| import re | |
| def accuracy_reward(completions, ground_truth, **kwargs): | |
| matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions] | |
| contents = [m.group(1) if m else "" for m in matches] | |
| return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)] | |
| def format_reward(completions, **kwargs): | |
| pattern = r"<think>.*?</think>.*<answer>.*?</answer>" | |
| return [1.0 if re.match(pattern, c, re.DOTALL) else 0.0 for c in completions] | |
| ``` | |
| ## 6.5 GRPO Code Example | |
| ```python | |
| from trl import GRPOTrainer, GRPOConfig | |
| from datasets import load_dataset | |
| from trl.rewards import accuracy_reward | |
| dataset = load_dataset("trl-lib/DeepMath-103K", split="train") | |
| config = GRPOConfig( | |
| output_dir="./grpo-output", | |
| learning_rate=1e-6, | |
| per_device_train_batch_size=4, | |
| num_generations=16, # G completions per prompt | |
| max_completion_length=512, | |
| bf16=True, | |
| gradient_checkpointing=True, | |
| ) | |
| trainer = GRPOTrainer( | |
| model="Qwen/Qwen2.5-0.5B-Instruct", | |
| reward_funcs=accuracy_reward, | |
| args=config, | |
| train_dataset=dataset, | |
| ) | |
| trainer.train() | |
| ``` | |
| """ | |
| CH7_PEFT = r""" | |
| # Chapter 7: Parameter-Efficient Fine-Tuning (PEFT) | |
| ## 7.1 The Memory Problem | |
| Fine-tuning 7B parameters requires: | |
| | Component | Memory | | |
| |-----------|--------| | |
| | Model weights (bf16) | 14 GB | | |
| | Gradients | 14 GB | | |
| | Optimizer states (AdamW) | 28 GB | | |
| | Activations | 10-30 GB | | |
| | **Total** | **~60-80 GB** | | |
| That's one A100 for SFT alone. RLHF with PPO (4 models) = 4× this. | |
| ## 7.2 LoRA: Low-Rank Adaptation | |
| 📄 [arXiv:2106.09685](https://arxiv.org/abs/2106.09685) | |
| **Insight:** Fine-tuning weight updates have low rank — approximate them with small matrices. | |
| ``` | |
| W' = W + α · B × A | |
| W: original frozen weight (d × d) — NOT trained | |
| A: down projection (d × r) — trained | |
| B: up projection (r × d) — trained | |
| r: rank (typically 8-32) r << d | |
| ``` | |
| **Example:** 4096 × 4096 weight matrix: | |
| - Full fine-tuning: **16.7M** parameters | |
| - LoRA (r=16): 2 × 4096 × 16 = **131K** parameters (128× fewer!) | |
| ## 7.3 QLoRA: Quantized LoRA | |
| 📄 [arXiv:2305.14314](https://arxiv.org/abs/2305.14314) | |
| Quantize the frozen base model to **4-bit**, add LoRA adapters in bf16. | |
| - 7B model: ~4 GB (quantized) + small LoRA adapters | |
| - Fine-tune 7B on a single RTX 4090 (24 GB)! | |
| ## 7.4 LoRA with TRL | |
| ```python | |
| from peft import LoraConfig | |
| from trl import SFTTrainer, SFTConfig | |
| lora_config = LoraConfig( | |
| r=16, # Rank | |
| lora_alpha=32, # Scaling (usually 2×r) | |
| lora_dropout=0.05, | |
| target_modules=["q_proj", "v_proj", "k_proj", "o_proj", | |
| "gate_proj", "up_proj", "down_proj"], | |
| task_type="CAUSAL_LM", | |
| ) | |
| config = SFTConfig( | |
| output_dir="./sft-lora", | |
| num_train_epochs=3, | |
| learning_rate=2e-4, # Higher LR for LoRA | |
| bf16=True, | |
| gradient_checkpointing=True, | |
| ) | |
| trainer = SFTTrainer( | |
| model="meta-llama/Llama-3.1-8B", | |
| args=config, | |
| train_dataset=dataset, | |
| peft_config=lora_config, # ← pass LoRA config | |
| ) | |
| trainer.train() | |
| ``` | |
| ## 7.5 When to Use What | |
| | Scenario | Recommendation | | |
| |----------|---------------| | |
| | Limited GPU memory | LoRA / QLoRA | | |
| | Quick experiment | LoRA | | |
| | Maximum quality | Full fine-tuning | | |
| | Multiple adapters from same base | LoRA (swap adapters) | | |
| | Very small dataset | LoRA (acts as regularizer) | | |
| **LoRA is ~95-99% as good as full fine-tuning** at a fraction of the compute. | |
| """ | |
| CH8_TOOLBOX = r""" | |
| # Chapter 8: The Toolbox | |
| ## 8.1 TRL — Transformers Reinforcement Learning | |
| 🔗 [github.com/huggingface/trl](https://github.com/huggingface/trl) · [Docs](https://huggingface.co/docs/trl) | |
| The central library for all post-training methods: | |
| | Trainer | Method | Dataset Type | | |
| |---------|--------|-------------| | |
| | `SFTTrainer` | Supervised Fine-Tuning | `messages` or `prompt`+`completion` | | |
| | `DPOTrainer` | Direct Preference Optimization | `prompt`+`chosen`+`rejected` | | |
| | `GRPOTrainer` | Group Relative Policy Optimization | `prompt` only | | |
| | `RLOOTrainer` | REINFORCE Leave-One-Out | `prompt` only | | |
| | `RewardTrainer` | Reward Model Training | `prompt`+`chosen`+`rejected` | | |
| | `KTOTrainer` | Kahneman-Tversky Optimization | `prompt`+`completion`+`label` | | |
| | `ORPOTrainer` | Odds Ratio Preference Opt. | `prompt`+`chosen`+`rejected` | | |
| | `CPOTrainer` | Contrastive Preference Opt. | `prompt`+`chosen`+`rejected` | | |
| | `OnlineDPOTrainer` | Online DPO | `prompt` only | | |
| | `PPOTrainer` | Proximal Policy Optimization | tokenized | | |
| | `XPOTrainer` | Exploratory Preference Opt. | `prompt` only | | |
| | `NashMDTrainer` | Nash Mirror Descent | `prompt` only | | |
| | `PRMTrainer` | Process Reward Model | stepwise supervision | | |
| ## 8.2 Core Libraries | |
| | Library | Purpose | | |
| |---------|---------| | |
| | **transformers** | Model loading, tokenization, chat templates | | |
| | **datasets** | Dataset loading and processing | | |
| | **peft** | LoRA, QLoRA, adapters | | |
| | **accelerate** | Distributed training, DeepSpeed | | |
| | **bitsandbytes** | 4/8-bit quantization | | |
| ## 8.3 Inference & Serving | |
| | Tool | Purpose | | |
| |------|---------| | |
| | **vLLM** | High-throughput inference (5-10× faster generation) | | |
| | **TGI** | HuggingFace Text Generation Inference | | |
| | **Unsloth** | 2-5× faster LoRA training | | |
| TRL integrates vLLM directly: | |
| ```python | |
| config = GRPOConfig(use_vllm=True, vllm_mode="colocate") | |
| ``` | |
| ## 8.4 Experiment Tracking & Evaluation | |
| | Tool | Purpose | | |
| |------|---------| | |
| | **Weights & Biases** | Experiment tracking | | |
| | **Trackio** | HF-native tracking | | |
| | **lm-eval-harness** | Standardized LLM benchmarks | | |
| | **AlpacaEval** | GPT-4 based evaluation | | |
| ## 8.5 TRL CLI Commands | |
| ```bash | |
| # Install | |
| pip install trl | |
| # SFT from command line | |
| trl sft --model_name_or_path Qwen/Qwen3-0.6B \ | |
| --dataset_name trl-lib/Capybara --output_dir ./sft | |
| # DPO from command line | |
| trl dpo --model_name_or_path your-sft-model \ | |
| --dataset_name trl-lib/ultrafeedback_binarized --output_dir ./dpo | |
| # GRPO from command line | |
| trl grpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \ | |
| --dataset_name trl-lib/DeepMath-103K --output_dir ./grpo | |
| # Multi-GPU | |
| accelerate launch --num_processes 4 train.py | |
| # DeepSpeed ZeRO-3 | |
| accelerate launch --config_file deepspeed_zero3.yaml train.py | |
| ``` | |
| """ | |
| CH9_DATASETS = r""" | |
| # Chapter 9: Datasets — What to Train On | |
| ## 9.1 SFT Datasets | |
| | Dataset | Size | Description | | |
| |---------|------|-------------| | |
| | [**trl-lib/Capybara**](https://huggingface.co/datasets/trl-lib/Capybara) | ~90K | Multi-turn conversations, high quality | | |
| | [**HuggingFaceH4/ultrachat_200k**](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) | 200K | Diverse multi-turn conversations | | |
| | [**allenai/tulu-3-sft-mixture**](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) | ~1.3M | Large-scale SFT mixture | | |
| | [**OpenAssistant/oasst1**](https://huggingface.co/datasets/OpenAssistant/oasst1) | 161K | Crowdsourced conversation trees | | |
| | [**tatsu-lab/alpaca**](https://huggingface.co/datasets/tatsu-lab/alpaca) | 52K | GPT-generated instructions | | |
| | [**teknium/OpenHermes-2.5**](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M | Large synthetic instructions | | |
| ## 9.2 Preference Datasets (DPO / KTO / ORPO) | |
| | Dataset | Size | Description | | |
| |---------|------|-------------| | |
| | [**trl-lib/ultrafeedback_binarized**](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized) | 60K | Binarized UltraFeedback | | |
| | [**Anthropic/hh-rlhf**](https://huggingface.co/datasets/Anthropic/hh-rlhf) | 170K | Helpful + harmless preferences | | |
| | [**argilla/ultrafeedback-binarized-preferences**](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences) | 60K | Cleaned UltraFeedback | | |
| ## 9.3 Prompt-Only Datasets (GRPO / RLOO) | |
| | Dataset | Size | Description | | |
| |---------|------|-------------| | |
| | [**trl-lib/DeepMath-103K**](https://huggingface.co/datasets/trl-lib/DeepMath-103K) | 103K | Math with verifiable answers | | |
| | [**AI-MO/NuminaMath-TIR**](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) | ~70K | Competition math | | |
| ## 9.4 How to Choose | |
| 1. **First experiment:** `trl-lib/Capybara` (SFT) or `trl-lib/ultrafeedback_binarized` (DPO) | |
| 2. **Quality > Quantity:** LIMA: 1K great > 52K mediocre | |
| 3. **Match your use case:** Math model → math data, chat model → diverse conversations | |
| 4. **Always inspect first:** | |
| ```python | |
| from datasets import load_dataset | |
| ds = load_dataset("trl-lib/Capybara", split="train") | |
| print(ds[0]) # Look at the data! | |
| ``` | |
| """ | |
| CH10_EVALUATION = r""" | |
| # Chapter 10: Evaluation — How to Know If It Worked | |
| ## 10.1 Why Evaluation is Hard | |
| - Open-ended outputs can be correct in many ways | |
| - Perplexity doesn't correlate with usefulness (LIMA proved this) | |
| - Benchmark scores ≠ real-world performance | |
| - Human evaluation is expensive and subjective | |
| ## 10.2 Automated Benchmarks | |
| | Benchmark | Measures | How | | |
| |-----------|----------|-----| | |
| | **MMLU** | Knowledge (57 subjects) | Multiple choice | | |
| | **HellaSwag** | Commonsense reasoning | Sentence completion | | |
| | **ARC** | Science reasoning | Multiple choice | | |
| | **TruthfulQA** | Truthfulness | Trick questions | | |
| | **GSM8K** | Math reasoning | Grade-school word problems | | |
| | **MATH** | Advanced math | Competition-level | | |
| | **HumanEval** | Code generation | Python problems | | |
| | **MBPP** | Code generation | Basic Python | | |
| | **IFEval** | Instruction following | Verifiable constraints | | |
| **Run with lm-eval-harness:** | |
| ```bash | |
| lm_eval --model hf \ | |
| --model_args pretrained=your-model \ | |
| --tasks mmlu,gsm8k,hellaswag \ | |
| --batch_size 8 | |
| ``` | |
| ## 10.3 LLM-as-Judge | |
| | Evaluation | Description | | |
| |------------|-------------| | |
| | **AlpacaEval** | GPT-4 compares to reference | | |
| | **MT-Bench** | Multi-turn dialogue, GPT-4 judged | | |
| | **Arena Hard** | Challenging prompts, GPT-4 judged | | |
| ## 10.4 Human Evaluation | |
| - **Side-by-side:** Show two responses, pick the better one | |
| - **Likert scale:** Rate helpfulness, accuracy, harmlessness (1-7) | |
| - **Chatbot Arena:** [lmarena.ai](https://lmarena.ai/) — crowdsourced blind comparisons | |
| ## 10.5 The Open LLM Leaderboard | |
| 🔗 [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) | |
| Standardized evaluation of open-source models. The primary way the community tracks progress. | |
| """ | |
| CH11_RECIPE = r""" | |
| # Chapter 11: Putting It All Together | |
| ## 11.1 The Standard Recipe (2024-2025) | |
| ``` | |
| Step 1: Choose Base Model | |
| ├── Qwen3 (0.6B–235B) — Top-performing | |
| ├── Llama 3.1/3.2 (1B–405B) — Meta | |
| ├── Gemma 3/4 (1B–27B) — Google | |
| └── Mistral/Mixtral — Efficient | |
| Step 2: SFT | |
| ├── Dataset: trl-lib/Capybara or ultrachat_200k | |
| ├── Method: SFTTrainer + LoRA (or full) | |
| ├── Epochs: 2-3, LR: 2e-5 (full) / 2e-4 (LoRA) | |
| └── Output → reference model for Stage 3 | |
| Step 3: Preference Optimization | |
| ├── DPO (simplest): β=0.1, LR=5e-7, 1-2 epochs | |
| ├── GRPO (reasoning): 16 generations, LR=1e-6 | |
| └── KTO (binary feedback): unpaired data | |
| Step 4: Evaluation | |
| ├── lm-eval-harness (MMLU, GSM8K) | |
| ├── MT-Bench / AlpacaEval | |
| └── Manual testing | |
| ``` | |
| ## 11.2 Full Code: SFT → DPO | |
| ```python | |
| # ═══ STAGE 1: SFT ═══ | |
| from trl import SFTTrainer, SFTConfig | |
| from datasets import load_dataset | |
| sft_dataset = load_dataset("trl-lib/Capybara", split="train") | |
| sft_config = SFTConfig( | |
| output_dir="./sft-model", | |
| num_train_epochs=2, | |
| per_device_train_batch_size=4, | |
| learning_rate=2e-5, | |
| max_seq_length=2048, | |
| bf16=True, | |
| gradient_checkpointing=True, | |
| push_to_hub=True, | |
| hub_model_id="your-username/my-sft-model", | |
| ) | |
| SFTTrainer(model="Qwen/Qwen3-0.6B", args=sft_config, | |
| train_dataset=sft_dataset).train() | |
| # ═══ STAGE 2: DPO ═══ | |
| from trl import DPOTrainer, DPOConfig | |
| dpo_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train") | |
| dpo_config = DPOConfig( | |
| output_dir="./dpo-model", | |
| num_train_epochs=1, | |
| per_device_train_batch_size=4, | |
| learning_rate=5e-7, | |
| beta=0.1, | |
| bf16=True, | |
| gradient_checkpointing=True, | |
| push_to_hub=True, | |
| hub_model_id="your-username/my-dpo-model", | |
| ) | |
| DPOTrainer(model="your-username/my-sft-model", args=dpo_config, | |
| train_dataset=dpo_dataset).train() | |
| ``` | |
| ## 11.3 Hardware Guidelines | |
| | Model Size | Minimum | Recommended | With LoRA | | |
| |-----------|---------|-------------|-----------| | |
| | 0.5–3B | 1× A10G (24GB) | 1× A100 (80GB) | 1× T4 (16GB) | | |
| | 7–8B | 1× A100 (80GB) | 2× A100 | 1× A10G (24GB) | | |
| | 13B | 2× A100 | 4× A100 | 1× A100 (80GB) | | |
| | 70B | 4× A100 | 8× A100 | 2× A100 | | |
| """ | |
| CH12_READING_LIST = r""" | |
| # Chapter 12: The Reading List | |
| ## Tier 1: Must-Read (The Foundations) | |
| | # | Paper | Year | Why Read It | | |
| |---|-------|------|------------| | |
| | 1 | [**InstructGPT**](https://arxiv.org/abs/2203.02155) — *Training LMs to Follow Instructions with Human Feedback* | 2022 | The SFT→RM→PPO pipeline. Everything starts here. | | |
| | 2 | [**DPO**](https://arxiv.org/abs/2305.18290) — *Direct Preference Optimization* | 2023 | No reward model, no RL. Most widely used method. | | |
| | 3 | [**LoRA**](https://arxiv.org/abs/2106.09685) — *Low-Rank Adaptation of Large Language Models* | 2021 | Made fine-tuning accessible. Used everywhere. | | |
| | 4 | [**DeepSeek-R1**](https://arxiv.org/abs/2501.12948) — *Incentivizing Reasoning via RL* | 2025 | RL teaches reasoning from scratch. The reasoning era. | | |
| ## Tier 2: Important (Deeper Understanding) | |
| | # | Paper | Year | Why Read It | | |
| |---|-------|------|------------| | |
| | 5 | [**LIMA**](https://arxiv.org/abs/2305.11206) — *Less Is More for Alignment* | 2023 | Data quality >> quantity. | | |
| | 6 | [**Constitutional AI**](https://arxiv.org/abs/2212.08073) — *Harmlessness from AI Feedback* | 2022 | AI feedback replaces human feedback (RLAIF). | | |
| | 7 | [**FLAN**](https://arxiv.org/abs/2109.01652) — *Finetuned LMs Are Zero-Shot Learners* | 2021 | Proved instruction tuning works. | | |
| | 8 | [**Self-Instruct**](https://arxiv.org/abs/2212.10560) — *Aligning LMs with Self-Generated Instructions* | 2022 | Synthetic data for SFT. Led to Alpaca. | | |
| | 9 | [**DeepSeekMath**](https://arxiv.org/abs/2402.03300) — *Pushing the Limits of Mathematical Reasoning* | 2024 | Introduced GRPO. | | |
| | 10 | [**QLoRA**](https://arxiv.org/abs/2305.14314) — *Efficient Finetuning of Quantized LMs* | 2023 | 7B fine-tuning on consumer GPUs. | | |
| ## Tier 3: Advanced (Cutting Edge) | |
| | # | Paper | Year | | |
| |---|-------|------| | |
| | 11 | [**KTO**](https://arxiv.org/abs/2402.01306) — *Prospect Theoretic Optimization* | 2024 | | |
| | 12 | [**ORPO**](https://arxiv.org/abs/2403.07691) — *Monolithic Preference Optimization* | 2024 | | |
| | 13 | [**SimPO**](https://arxiv.org/abs/2405.14734) — *Reference-Free Reward* | 2024 | | |
| | 14 | **Tulu 3** — *Open Language Model Post-Training* | 2024 | | |
| | 15 | [**Zephyr**](https://arxiv.org/abs/2310.16944) — *Direct Distillation of LM Alignment* | 2023 | | |
| ## Tier 4: RL Foundations | |
| | # | Paper | Year | | |
| |---|-------|------| | |
| | 16 | [**PPO**](https://arxiv.org/abs/1707.06347) — *Proximal Policy Optimization* | 2017 | | |
| | 17 | [**Learning to Summarize from Human Feedback**](https://arxiv.org/abs/2009.01325) | 2020 | | |
| | 18 | [**Fine-Tuning LMs from Human Preferences**](https://arxiv.org/abs/1909.08593) | 2019 | | |
| """ | |
| GLOSSARY = r""" | |
| # Glossary & Quick Reference | |
| ## Key Terms | |
| | Term | Definition | | |
| |------|-----------| | |
| | **Alignment** | Making a model behave according to human intentions and values | | |
| | **RLHF** | Reinforcement Learning from Human Feedback | | |
| | **RLAIF** | RL from AI Feedback (AI replaces human annotators) | | |
| | **SFT** | Supervised Fine-Tuning on instruction-response pairs | | |
| | **DPO** | Direct Preference Optimization — no reward model needed | | |
| | **GRPO** | Group Relative Policy Optimization — for reasoning | | |
| | **PPO** | Proximal Policy Optimization — the RL algorithm in RLHF | | |
| | **Reward Model** | Scores responses based on human preferences | | |
| | **Policy** | The language model being trained (RL terminology) | | |
| | **Reference Model (π_ref)** | SFT model used as baseline | | |
| | **KL Divergence** | Measures how far the policy drifts from the reference | | |
| | **Bradley-Terry** | Preference model: P(A>B) = σ(score(A) - score(B)) | | |
| | **Reward Hacking** | Model exploits the reward model instead of improving | | |
| | **LoRA** | Low-Rank Adaptation — parameter-efficient fine-tuning | | |
| | **QLoRA** | 4-bit quantized base model + LoRA adapters | | |
| | **Chat Template** | Format for structuring conversations (special tokens + roles) | | |
| | **On-policy** | Training on data from the *current* model | | |
| | **Off-policy** | Training on data from a *different* model | | |
| | **Advantage** | How much better an action is vs. the expected value | | |
| ## Dataset Formats by Trainer | |
| ```python | |
| # SFT | |
| {"messages": [{"role": "user", "content": "..."}, | |
| {"role": "assistant", "content": "..."}]} | |
| # DPO / ORPO / CPO | |
| {"prompt": "...", "chosen": "...", "rejected": "..."} | |
| # GRPO / RLOO / Online DPO | |
| {"prompt": "..."} | |
| # KTO | |
| {"prompt": "...", "completion": "...", "label": True} | |
| # Reward Model | |
| {"prompt": "...", "chosen": "...", "rejected": "..."} | |
| # PRM (Process Reward Model) | |
| {"prompt": "...", "completions": ["step1", "step2"], | |
| "labels": [True, False]} | |
| ``` | |
| """ | |
| SEARCH_CONTENT = { | |
| "Overview & Roadmap": OVERVIEW, | |
| "Ch 1: The Big Picture": CH1_BIG_PICTURE, | |
| "Ch 2: SFT": CH2_SFT, | |
| "Ch 3: RLHF": CH3_RLHF, | |
| "Ch 4: DPO": CH4_DPO, | |
| "Ch 5: Preference Zoo": CH5_PREFERENCE_ZOO, | |
| "Ch 6: GRPO & Reasoning": CH6_GRPO, | |
| "Ch 7: PEFT (LoRA)": CH7_PEFT, | |
| "Ch 8: Toolbox": CH8_TOOLBOX, | |
| "Ch 9: Datasets": CH9_DATASETS, | |
| "Ch 10: Evaluation": CH10_EVALUATION, | |
| "Ch 11: Full Recipe": CH11_RECIPE, | |
| "Ch 12: Reading List": CH12_READING_LIST, | |
| "Glossary & Reference": GLOSSARY, | |
| } | |
| def search_guide(query): | |
| if not query or not query.strip(): | |
| return "Type a keyword to search across all chapters (e.g., *DPO*, *reward*, *LoRA*, *dataset*)." | |
| query_lower = query.lower().strip() | |
| results = [] | |
| for title, content in SEARCH_CONTENT.items(): | |
| lines = content.split("\n") | |
| for i, line in enumerate(lines): | |
| if query_lower in line.lower(): | |
| start = max(0, i - 1) | |
| end = min(len(lines), i + 4) | |
| snippet = "\n".join(lines[start:end]) | |
| results.append(f"### 📍 {title}\n\n{snippet}\n\n---\n") | |
| break # one match per chapter | |
| if results: | |
| return f"## Found in {len(results)} chapter(s)\n\n" + "\n".join(results) | |
| return f"No results for **'{query}'**. Try: *SFT*, *DPO*, *GRPO*, *LoRA*, *reward*, *PPO*, *dataset*, *evaluation*." | |
| # ============================================================ | |
| # GRADIO UI | |
| # ============================================================ | |
| CUSTOM_CSS = """ | |
| .chapter-content { | |
| max-width: 900px; | |
| margin: 0 auto; | |
| font-size: 16px; | |
| line-height: 1.7; | |
| } | |
| .gradio-container { | |
| max-width: 1100px !important; | |
| } | |
| footer { display: none !important; } | |
| """ | |
| THEME = gr.themes.Soft( | |
| primary_hue="blue", | |
| secondary_hue="slate", | |
| font=[gr.themes.GoogleFont("Inter"), "system-ui", "sans-serif"], | |
| ) | |
| with gr.Blocks(title="Post-Training Guide for LLMs") as demo: | |
| gr.HTML(""" | |
| <div style="text-align:center; padding: 20px 0 5px 0;"> | |
| <h1 style="margin-bottom:4px;">📖 The Complete Guide to Post-Training of LLMs</h1> | |
| <p style="color:#666; font-size:17px; margin-top:0;"> | |
| From Pretraining to Alignment — SFT · RLHF · DPO · GRPO · LoRA and more | |
| </p> | |
| </div> | |
| """) | |
| with gr.Tabs(): | |
| with gr.Tab("🏠 Overview"): | |
| gr.Markdown(OVERVIEW, elem_classes="chapter-content") | |
| with gr.Tab("1️⃣ Big Picture"): | |
| gr.Markdown(CH1_BIG_PICTURE, elem_classes="chapter-content") | |
| with gr.Tab("2️⃣ SFT"): | |
| gr.Markdown(CH2_SFT, elem_classes="chapter-content") | |
| with gr.Tab("3️⃣ RLHF"): | |
| gr.Markdown(CH3_RLHF, elem_classes="chapter-content") | |
| with gr.Tab("4️⃣ DPO"): | |
| gr.Markdown(CH4_DPO, elem_classes="chapter-content") | |
| with gr.Tab("5️⃣ Preference Zoo"): | |
| gr.Markdown(CH5_PREFERENCE_ZOO, elem_classes="chapter-content") | |
| with gr.Tab("6️⃣ GRPO"): | |
| gr.Markdown(CH6_GRPO, elem_classes="chapter-content") | |
| with gr.Tab("7️⃣ PEFT"): | |
| gr.Markdown(CH7_PEFT, elem_classes="chapter-content") | |
| with gr.Tab("8️⃣ Toolbox"): | |
| gr.Markdown(CH8_TOOLBOX, elem_classes="chapter-content") | |
| with gr.Tab("9️⃣ Datasets"): | |
| gr.Markdown(CH9_DATASETS, elem_classes="chapter-content") | |
| with gr.Tab("🔟 Evaluation"): | |
| gr.Markdown(CH10_EVALUATION, elem_classes="chapter-content") | |
| with gr.Tab("📋 Full Recipe"): | |
| gr.Markdown(CH11_RECIPE, elem_classes="chapter-content") | |
| with gr.Tab("📚 Reading List"): | |
| gr.Markdown(CH12_READING_LIST, elem_classes="chapter-content") | |
| with gr.Tab("📖 Glossary"): | |
| gr.Markdown(GLOSSARY, elem_classes="chapter-content") | |
| with gr.Tab("🔍 Search"): | |
| gr.Markdown("## Search the Guide\nFind any topic across all chapters.") | |
| search_input = gr.Textbox( | |
| placeholder="Type a keyword (e.g. DPO, reward, LoRA, dataset, PPO)...", | |
| label="Search", | |
| lines=1, | |
| ) | |
| search_output = gr.Markdown(elem_classes="chapter-content") | |
| search_input.submit(search_guide, inputs=search_input, outputs=search_output) | |
| search_input.change(search_guide, inputs=search_input, outputs=search_output) | |
| gr.HTML(""" | |
| <div style="text-align:center; padding:15px; color:#888; font-size:13px; border-top:1px solid #eee; margin-top:20px;"> | |
| Built from primary research papers & official HF documentation · All code uses TRL v1.2+ APIs · | |
| <a href="https://huggingface.co/docs/trl" target="_blank">TRL Docs</a> · | |
| <a href="https://github.com/huggingface/trl" target="_blank">TRL GitHub</a> · | |
| Last updated April 2026 | |
| </div> | |
| """) | |
| demo.launch(theme=THEME, css=CUSTOM_CSS) | |