File size: 55,877 Bytes
2f3cc79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 | # The Complete Guide to Post-Training of Large Language Models
### From Pretraining to Alignment: Everything You Need to Know
---
> **Who is this for?** You've learned how pretraining works β you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens *after* pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini. This guide takes you from zero knowledge of post-training to a deep understanding of every major method, with pointers to the key papers, tools, and code.
---
## Table of Contents
1. [The Big Picture: Why Post-Training Exists](#chapter-1-the-big-picture--why-post-training-exists)
2. [Supervised Fine-Tuning (SFT): Teaching Models to Follow Instructions](#chapter-2-supervised-fine-tuning-sft--teaching-models-to-follow-instructions)
3. [Reinforcement Learning from Human Feedback (RLHF): The Breakthrough](#chapter-3-reinforcement-learning-from-human-feedback-rlhf--the-breakthrough)
4. [Direct Preference Optimization (DPO): RLHF Without RL](#chapter-4-direct-preference-optimization-dpo--rlhf-without-rl)
5. [The Preference Optimization Zoo: KTO, ORPO, SimPO, CPO, and More](#chapter-5-the-preference-optimization-zoo)
6. [GRPO and the Reasoning Revolution: DeepSeek-R1 and Beyond](#chapter-6-grpo-and-the-reasoning-revolution)
7. [Parameter-Efficient Fine-Tuning: LoRA, QLoRA, and Adapters](#chapter-7-parameter-efficient-fine-tuning-peft--lora-qlora-and-adapters)
8. [The Toolbox: Libraries, Frameworks, and Infrastructure](#chapter-8-the-toolbox--libraries-frameworks-and-infrastructure)
9. [Datasets: What to Train On](#chapter-9-datasets--what-to-train-on)
10. [Evaluation: How to Know If It Worked](#chapter-10-evaluation--how-to-know-if-it-worked)
11. [Putting It All Together: A Complete Post-Training Recipe](#chapter-11-putting-it-all-together--a-complete-post-training-recipe)
12. [The Reading List: Papers Every Practitioner Should Read](#chapter-12-the-reading-list--papers-every-practitioner-should-read)
---
## Chapter 1: The Big Picture β Why Post-Training Exists
### 1.1 The Gap Between Pretraining and Usefulness
You've pretrained a language model. It can predict the next token with impressive accuracy. It has absorbed vast knowledge from the internet. But try asking it a question:
```
User: What is the capital of France?
Model: What is the capital of Germany? What is the capital of Italy? What is the...
```
The model doesn't *answer* β it *continues*. That's because the pretraining objective (`P(next_token | context)`) optimizes for predicting what comes next in web text, not for being helpful. Web documents contain questions followed by more questions, not questions followed by helpful answers.
This is the **alignment problem** in its simplest form. As the InstructGPT paper (Ouyang et al., 2022) put it:
> *"Large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users."*
### 1.2 The Three Stages of Post-Training
Post-training is everything that happens after pretraining to make a model useful, safe, and aligned with human intent. The modern post-training pipeline, established by OpenAI's InstructGPT (2022), has three stages:
```
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
β STAGE 1: SFT β ββ> β STAGE 2: Reward β ββ> β STAGE 3: RL β
β β β Model Training β β (PPO / DPO / GRPO) β
β Teach format β β Learn preferencesβ β Optimize for preferencesβ
β & behavior β β from comparisons β β while staying close to β
β β β β β the SFT model β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
Input: Pretrained LM Output: Aligned Assistant
```
**Stage 1 β Supervised Fine-Tuning (SFT):** Teach the model the *format* of helpful responses using human-written demonstrations. Input: instructions. Output: high-quality responses.
**Stage 2 β Reward Modeling:** Train a separate model to predict which of two responses a human would prefer. This "reward model" captures human preferences as a scalar score.
**Stage 3 β Reinforcement Learning:** Use the reward model to further improve the SFT model. The model generates responses, gets scored by the reward model, and updates its parameters to produce higher-scoring responses.
> **Key insight from LIMA (Zhou et al., 2023):** *"A model's knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users."* This is called the **Superficial Alignment Hypothesis** β post-training doesn't teach new knowledge, it teaches the model to *surface existing knowledge in the right way*.
### 1.3 The Evolution: From RLHF to Modern Methods
The field has evolved rapidly:
| Year | Method | Key Idea | Paper |
|------|--------|----------|-------|
| 2017 | RLHF (original) | Use human preferences to train reward model, optimize with RL | Christiano et al. |
| 2020 | RLHF for LLMs | Apply RLHF to text summarization | Stiennon et al. |
| 2022 | **InstructGPT** | Full SFT β RM β PPO pipeline for general LLMs | Ouyang et al. |
| 2022 | Constitutional AI | Use AI feedback instead of human feedback (RLAIF) | Bai et al. (Anthropic) |
| 2023 | **DPO** | Eliminate reward model entirely β train directly on preferences | Rafailov et al. |
| 2024 | KTO | Train on binary feedback (good/bad) instead of pairwise | Ethayarajh et al. |
| 2024 | ORPO | Combine SFT and preference optimization in one step | Hong et al. |
| 2024 | **GRPO** | Group-based RL for mathematical reasoning (DeepSeek) | Shao et al. |
| 2025 | DeepSeek-R1 | GRPO to teach models to "think" (chain-of-thought via RL) | DeepSeek-AI |
---
## Chapter 2: Supervised Fine-Tuning (SFT) β Teaching Models to Follow Instructions
### 2.1 What SFT Does
SFT is the bridge between a pretrained language model and a useful assistant. It takes a model that predicts web text and teaches it to respond helpfully to instructions.
**Before SFT:**
```
Input: "Explain quantum computing in simple terms."
Output: "Explain quantum computing to a 5-year-old. Explain quantum computing..."
```
**After SFT:**
```
Input: "Explain quantum computing in simple terms."
Output: "Quantum computing uses the principles of quantum mechanics to process
information. Unlike classical computers that use bits (0 or 1),
quantum computers use qubits that can be both 0 and 1 simultaneously..."
```
### 2.2 The SFT Loss Function
If you understand the pretraining loss, you already understand SFT β with one crucial difference.
**Pretraining loss** (next-token prediction on everything):
```
L_pretrain = -Ξ£ log P(token_i | token_1, ..., token_{i-1})
for ALL tokens in the sequence
```
**SFT loss** (next-token prediction on the *response* only):
```
L_SFT = -Ξ£ log P(c_i | prompt_tokens, c_1, ..., c_{i-1})
for ONLY the completion/response tokens
```
The prompt tokens are fed into the model but **masked from the loss computation**. This is important: we don't want the model to learn to *generate* instructions β we want it to learn to *respond* to them.
```
Sequence: [User: What is 2+2?] [Assistant: 4]
Loss mask: [ ----IGNORED---- ] [COMPUTED HERE ]
```
### 2.3 Data Formats for SFT
Modern SFT uses **chat-formatted data** β structured conversations with roles:
```python
# The standard format: a list of messages with roles
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
```
This gets converted to a **chat template** β a specific text format that the model learns to recognize:
```
# ChatML format (used by many models):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
# Llama-3 format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
The capital of France is Paris.<|eot_id|>
```
Each model family has its own template. The `transformers` library handles this automatically:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
# For training (complete conversation):
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
# For inference (prompt the model to start generating):
text = tokenizer.apply_chat_template(messages[:1], tokenize=False, add_generation_prompt=True)
```
### 2.4 The Key SFT Papers
#### FLAN (2021) β Instruction Tuning at Scale
**Paper:** *"Finetuned Language Models Are Zero-Shot Learners"* (Wei et al., 2021) β [arXiv:2109.01652](https://arxiv.org/abs/2109.01652)
FLAN proved that fine-tuning on instructions dramatically improves zero-shot performance. They took 62 NLP datasets, formatted them as instructions, and fine-tuned LaMDA-PT 137B.
**Key result:** FLAN surpassed zero-shot GPT-3 175B on 20 out of 25 tasks.
**Key insight:** The instruction format itself is critical β fine-tuning on the same tasks *without* instructions gave much weaker results.
**Recipe:** Adafactor optimizer, lr=3e-5, 30K steps, batch size 8192 tokens, input length 1024, target length 256.
#### Self-Instruct (2022) β Bootstrapping Training Data
**Paper:** *"Self-Instruct: Aligning Language Models with Self-Generated Instructions"* (Wang et al., 2022) β [arXiv:2212.10560](https://arxiv.org/abs/2212.10560)
A breakthrough idea: use the language model itself to generate training data. Starting from 175 seed tasks, GPT-3 generated 52,445 instructions with responses.
**Key result:** +33% improvement over vanilla GPT-3 on SuperNaturalInstructions.
**Key insight:** The era of synthetic data for SFT began here. This directly led to Stanford Alpaca (fine-tuning LLaMA on 52K GPT-generated instructions for <$600).
#### InstructGPT (2022) β SFT as Stage 1
**Paper:** *"Training Language Models to Follow Instructions with Human Feedback"* (Ouyang et al., 2022) β [arXiv:2203.02155](https://arxiv.org/abs/2203.02155)
InstructGPT established SFT as the foundation of the alignment pipeline. Their SFT model was trained on ~13K human-written demonstrations.
**Key details:** 16 epochs, cosine LR decay, residual dropout 0.2. They found that SFT models overfit on validation loss after 1 epoch, but training more epochs improved the reward model score β so they selected checkpoints using the RM, not validation loss.
**Key result:** Even 1.3B InstructGPT was preferred over 175B GPT-3 by human evaluators.
#### LIMA (2023) β Less Is More
**Paper:** *"LIMA: Less Is More for Alignment"* (Zhou et al., 2023) β [arXiv:2305.11206](https://arxiv.org/abs/2305.11206)
The most provocative SFT paper: fine-tuning LLaMA-65B on just **1,000 carefully curated examples** produced a model competitive with GPT-3.5 (DaVinci003) in human evaluations.
**Key result:** 1,000 high-quality examples β 52,000 mediocre examples.
**Recipe:** AdamW, lr 1e-5 β 1e-6 linear decay, 15 epochs, batch size 32, max length 2048. Residual dropout linearly scaled from 0.0 (bottom layer) to 0.3 (top layer).
**The takeaway:** For SFT, **data quality >> data quantity**. A small number of consistently styled, high-quality demonstrations is better than a large, noisy dataset.
### 2.5 SFT in Practice with TRL
```python
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# Load a chat dataset (must have "messages" column)
dataset = load_dataset("trl-lib/Capybara", split="train")
config = SFTConfig(
output_dir="./sft-output",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
max_seq_length=2048,
gradient_checkpointing=True, # Save memory
bf16=True, # Use bfloat16 precision
logging_steps=10,
push_to_hub=True,
hub_model_id="your-username/your-sft-model",
)
trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B", # Base model
args=config,
train_dataset=dataset,
)
trainer.train()
```
The `SFTTrainer` automatically:
- Detects the `messages` column and applies the model's chat template
- Masks prompt tokens from the loss (trains only on assistant responses)
- Handles tokenization and padding
---
## Chapter 3: Reinforcement Learning from Human Feedback (RLHF) β The Breakthrough
### 3.1 Why SFT Isn't Enough
SFT teaches format and basic behavior, but it has limitations:
- **It only learns from demonstrations** β it can only be as good as the training examples
- **It can't express preferences** β it treats all tokens in a response equally
- **It can learn bad habits** β if training data contains subtle errors, the model learns those too
RLHF addresses this by training the model based on **which outputs are better**, not on what specific tokens to generate.
### 3.2 The RLHF Pipeline (Step by Step)
#### Step 1: Train a Reward Model
A **reward model (RM)** takes a prompt and a response, and outputs a scalar score indicating how good the response is.
**How it's trained:**
1. Generate multiple responses to the same prompt using the SFT model
2. Have humans rank these responses (e.g., Response A > Response B)
3. Train the RM to predict these rankings
The RM uses the **Bradley-Terry model** of preferences:
```
P(response_A is preferred over response_B) = Ο(r(A) - r(B))
```
where `Ο` is the sigmoid function and `r(Β·)` is the reward model's score. The loss function is:
```
L_RM = -E[log Ο(r(x, y_chosen) - r(x, y_rejected))]
```
**Architecture:** The reward model is typically the same architecture as the language model, but with the output head replaced by a linear layer that projects to a single scalar value.
**InstructGPT details:** They trained a 6B reward model (not 175B β larger RMs had unstable training). The RM dataset contained 33K prompts with human rankings.
#### Step 2: Optimize the Policy with PPO
Now we use the reward model to improve our language model (the "policy" in RL terminology).
**The objective:**
```
maximize E[RM(prompt, response)] - Ξ² Β· KL(Ο_ΞΈ || Ο_ref)
```
In plain English: **generate responses that score high on the reward model, but don't deviate too far from the original SFT model.**
The KL divergence penalty (`Ξ² Β· KL(Ο_ΞΈ || Ο_ref)`) is crucial β without it, the model quickly learns to exploit the reward model (generating gibberish that tricks the RM into giving high scores, a phenomenon called **reward hacking**).
**PPO (Proximal Policy Optimization)** is the RL algorithm used to optimize this objective. Here's the intuition:
1. **Generate:** The current model generates responses to a batch of prompts
2. **Score:** The reward model scores each response
3. **Compute advantage:** Calculate how much better each response is compared to the expected value
4. **Update:** Adjust model weights to make high-advantage responses more likely
5. **Clip:** Prevent too-large updates (the "proximal" part) for stability
```
L_PPO = -E[min(r_t(ΞΈ) Β· A_t, clip(r_t(ΞΈ), 1-Ξ΅, 1+Ξ΅) Β· A_t)]
```
where `r_t(ΞΈ) = Ο_ΞΈ(a_t|s_t) / Ο_old(a_t|s_t)` is the probability ratio and `A_t` is the advantage.
**InstructGPT training details:**
- PPO with Ξ² = 0.02 for KL penalty
- Mixed in 10% pretraining data during PPO to prevent regression on general capabilities
- Learning rates scanned from 2.55e-6 to 2.55e-5 (rates > 8.05e-6 diverged)
- 256K PPO episodes total
### 3.3 The Alignment Tax
RLHF improves alignment but can **hurt** performance on standard NLP benchmarks β this is the "alignment tax." InstructGPT mitigated this by mixing pretraining data into PPO training (the `PPO-ptx` variant).
### 3.4 Why RLHF is Hard
RLHF works, but it has significant practical challenges:
1. **Complexity:** Three separate models needed (policy, reference policy, reward model, value model) β 4 models in memory simultaneously
2. **Instability:** PPO training is notoriously sensitive to hyperparameters
3. **Reward hacking:** The model can learn to exploit the RM rather than genuinely improve
4. **Cost:** Human preference data is expensive to collect
5. **Reproducibility:** Small changes in setup can lead to very different outcomes
These challenges directly motivated the development of DPO.
### 3.5 Constitutional AI (RLAIF)
**Paper:** *"Constitutional AI: Harmlessness from AI Feedback"* (Bai et al., 2022) β [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)
Anthropic's key insight: you can replace human feedback with **AI feedback** (RLAIF β RL from AI Feedback). Instead of humans ranking responses, an AI system evaluates responses against a set of principles (the "constitution").
This dramatically reduces the cost and enables scaling the feedback process.
---
## Chapter 4: Direct Preference Optimization (DPO) β RLHF Without RL
### 4.1 The Key Insight
**Paper:** *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"* (Rafailov et al., 2023) β [arXiv:2305.18290](https://arxiv.org/abs/2305.18290)
DPO's central insight is beautiful in its simplicity: **you don't need a separate reward model or RL training loop.** The language model itself implicitly represents a reward model.
The authors showed that the optimal solution to the RLHF objective (maximize reward while staying close to the reference model) can be expressed in closed form:
```
Ο*(y|x) = (1/Z(x)) Β· Ο_ref(y|x) Β· exp((1/Ξ²) Β· r(x,y))
```
Rearranging this to express the reward in terms of the policy:
```
r(x,y) = Ξ² Β· log(Ο_ΞΈ(y|x) / Ο_ref(y|x)) + Ξ² Β· log Z(x)
```
Since the Bradley-Terry preference model only depends on the **difference** in rewards between two responses, the partition function `Z(x)` cancels out! This gives us the DPO loss:
```
L_DPO = -E[log Ο(Ξ² Β· log(Ο_ΞΈ(y_w|x)/Ο_ref(y_w|x)) - Ξ² Β· log(Ο_ΞΈ(y_l|x)/Ο_ref(y_l|x)))]
```
where `y_w` is the preferred ("winning") response and `y_l` is the rejected ("losing") response.
### 4.2 Why DPO is a Big Deal
| Aspect | RLHF (PPO) | DPO |
|--------|-------------|-----|
| Models in memory | 4 (policy, reference, reward, value) | 2 (policy, reference) |
| Training loop | Complex RL loop with generation | Simple supervised training |
| Hyperparameters | Many (PPO-specific: clip, value coef, etc.) | Few (mainly Ξ²) |
| Stability | Often unstable | Very stable |
| Sampling during training | Required | Not required |
| Performance | Strong | Comparable or better |
### 4.3 Understanding the DPO Gradient
The gradient of the DPO loss has a beautiful interpretation:
```
βL_DPO β -Ξ² Β· [Ο(rΜ(x,y_l) - rΜ(x,y_w))] Β· [βlog Ο(y_w|x) - βlog Ο(y_l|x)]
```
In English:
- **Increase** the likelihood of the preferred response `y_w`
- **Decrease** the likelihood of the rejected response `y_l`
- **Weight** these updates by how "wrong" the model currently is (if the model already prefers `y_w`, the gradient is small)
The weighting term `Ο(rΜ(x,y_l) - rΜ(x,y_w))` is crucial β without it, the model degenerates. This was verified experimentally: a naive "increase chosen, decrease rejected" approach without the weighting fails.
### 4.4 DPO in Practice
**Data format:** DPO needs preference pairs β for each prompt, a "chosen" (preferred) and "rejected" response:
```python
{
"prompt": [{"role": "user", "content": "Explain gravity"}],
"chosen": [{"role": "assistant", "content": "Gravity is a fundamental force..."}],
"rejected": [{"role": "assistant", "content": "Gravity is when things fall down."}]
}
```
**The DPO recipe:**
1. Start with an SFT model (this becomes Ο_ref)
2. Prepare preference dataset (prompt + chosen + rejected)
3. Train with the DPO loss
```python
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
config = DPOConfig(
output_dir="./dpo-output",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7, # DPO uses very low learning rates
beta=0.1, # KL penalty strength
logging_steps=10,
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/your-dpo-model",
)
trainer = DPOTrainer(
model="your-sft-model", # The SFT model to improve
args=config,
train_dataset=dataset,
)
trainer.train()
```
### 4.5 DPO Hyperparameters
- **Ξ² (beta):** Controls the strength of the KL constraint. Higher Ξ² = stay closer to reference model. Typical range: 0.01 to 0.5. Default in TRL: 0.1.
- **Learning rate:** Much lower than SFT β typically 1e-7 to 5e-6. DPO is sensitive to learning rate.
- **Epochs:** Usually 1-3. Overfitting is common with more epochs.
---
## Chapter 5: The Preference Optimization Zoo
After DPO, researchers developed many variants addressing different limitations. Here's a guide to the most important ones.
### 5.1 IPO β Identity Preference Optimization
**Paper:** *"A General Theoretical Paradigm to Understand Learning from Human Feedback"* (Azar et al., 2023)
**Problem with DPO:** DPO can overfit to the preference data, especially when the Bradley-Terry assumption doesn't hold perfectly.
**Solution:** IPO adds a regularization term that prevents overfitting without assuming the Bradley-Terry model:
```
L_IPO = E[(log(Ο_ΞΈ(y_w|x)/Ο_ref(y_w|x)) - log(Ο_ΞΈ(y_l|x)/Ο_ref(y_l|x)) - 1/(2Ξ²))Β²]
```
**When to use:** When you suspect your preference data is noisy or when DPO is overfitting.
### 5.2 KTO β Kahneman-Tversky Optimization
**Paper:** *"KTO: Model Alignment as Prospect Theoretic Optimization"* (Ethayarajh et al., 2024) β [arXiv:2402.01306](https://arxiv.org/abs/2402.01306)
**Problem with DPO:** DPO requires *paired* preferences (chosen AND rejected for the same prompt). This is expensive to collect. In reality, it's much easier to get binary feedback: "this response is good" or "this response is bad."
**Solution:** KTO works with **unpaired preferences** β you only need individual responses labeled as good or bad, not pairs. It's based on Kahneman and Tversky's prospect theory from behavioral economics: humans feel losses more strongly than equivalent gains.
**Data format:**
```python
{"prompt": "...", "completion": "...", "label": True} # Good response
{"prompt": "...", "completion": "...", "label": False} # Bad response
```
**When to use:** When you have thumbs-up/thumbs-down feedback but not pairwise comparisons.
### 5.3 ORPO β Odds Ratio Preference Optimization
**Paper:** *"ORPO: Monolithic Preference Optimization without Reference Model"* (Hong et al., 2024)
**Problem with DPO:** DPO still requires a separate SFT stage and a reference model.
**Solution:** ORPO combines SFT and preference optimization into a **single training step**. It adds a preference signal directly to the SFT loss using the odds ratio:
```
L_ORPO = L_SFT + Ξ» Β· L_OR
```
where `L_OR` penalizes the model when the odds of generating the rejected response exceed those of the chosen response.
**When to use:** When you want a simpler pipeline without separate SFT and preference stages.
### 5.4 SimPO β Simple Preference Optimization
**Paper:** *"SimPO: Simple Preference Optimization with a Reference-Free Reward"* (Meng et al., 2024)
**Problem with DPO:** DPO needs a reference model in memory, doubling GPU requirements.
**Solution:** SimPO eliminates the reference model entirely by using the **average log probability** of a sequence as the implicit reward (instead of the total log probability). This length-normalized reward naturally prevents the model from favoring longer responses.
**When to use:** When GPU memory is a constraint and you want to skip the reference model.
### 5.5 CPO β Contrastive Preference Optimization
Simplifies DPO by removing the reference model and using a contrastive loss. Similar motivation to SimPO but with a different formulation.
### 5.6 Online DPO
**Problem with standard DPO:** DPO trains on a fixed, static preference dataset (offline). But the model changes during training, so the preferences collected from the *old* model become stale.
**Solution:** Online DPO generates new completions from the *current* model during training and gets them scored by a reward model. This keeps the training data fresh and on-policy.
### 5.7 Summary Table
| Method | Needs Reference Model? | Needs Paired Data? | Needs RM? | Separate SFT? | Key Advantage |
|--------|----------------------|-------------------|-----------|---------------|---------------|
| PPO (RLHF) | Yes | No (uses RM) | **Yes** | Yes | Gold standard, online |
| DPO | Yes | **Yes** | No | Yes | Simple, stable |
| IPO | Yes | Yes | No | Yes | Robust to noise |
| KTO | Yes | **No** (binary) | No | Yes | Works with unpaired data |
| ORPO | **No** | Yes | No | **No** (combined) | Simplest pipeline |
| SimPO | **No** | Yes | No | Yes | Memory efficient |
| CPO | **No** | Yes | No | Yes | Memory efficient |
| Online DPO | Yes | Generated online | **Yes** | Yes | On-policy, fresh data |
| GRPO | Yes (soft) | No (uses rewards) | **Yes** (or functions) | Yes | Best for reasoning |
---
## Chapter 6: GRPO and the Reasoning Revolution
### 6.1 What is GRPO?
**Paper:** *"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models"* (Shao et al., 2024) β [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
Group Relative Policy Optimization (GRPO) is a variant of PPO designed to be more memory-efficient and particularly effective for **reasoning tasks** (math, code, logic).
**The key idea:** Instead of training a separate value model (critic) as in PPO, GRPO estimates the "baseline" by generating **multiple completions per prompt** and using the group average reward as the baseline.
### 6.2 How GRPO Works
```
For each prompt:
1. Generate G completions (e.g., G=16)
2. Score each completion with a reward function
3. Compute the advantage for each completion:
Γ_i = (r_i - mean(r)) / std(r)
4. Update the model to increase probability of high-advantage completions
and decrease probability of low-advantage completions
```
**The GRPO loss:**
```
L_GRPO = -E[min(ratio Β· Γ, clip(ratio, 1-Ξ΅, 1+Ξ΅) Β· Γ)] + Ξ² Β· KL(Ο_ΞΈ || Ο_ref)
```
where `ratio = Ο_ΞΈ(o_{i,t}) / Ο_old(o_{i,t})` is the importance sampling ratio.
**Why "Group Relative"?** The advantage is computed *relative to the group* of completions for the same prompt. A completion is "good" if it scores above the group average, and "bad" if below. This is why the method has that name.
### 6.3 Why GRPO Matters: The DeepSeek-R1 Story
GRPO became famous when DeepSeek used it to train **DeepSeek-R1** β a model that learned to "think" through chain-of-thought reasoning *purely through RL*, without being taught specific reasoning patterns.
**Paper:** *"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"* (DeepSeek-AI, 2025) β [arXiv:2501.12948](https://arxiv.org/abs/2501.12948)
The key discovery: with the right reward function (accuracy on math/coding problems) and GRPO training, the model **spontaneously develops** chain-of-thought reasoning, self-verification, and error correction β without being explicitly trained to do so.
This opened the "reasoning era" of LLM training, where RL-based methods are used to incentivize complex reasoning behaviors.
### 6.4 GRPO in Practice
GRPO requires:
- A prompt dataset (just prompts, no responses needed)
- A reward function (can be a model or a simple Python function)
```python
from trl import GRPOTrainer, GRPOConfig
from datasets import load_dataset
import re
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
# Custom reward function: checks if the answer is correct
def accuracy_reward(completions, ground_truth, **kwargs):
matches = [re.search(r"\\boxed\{(.*?)\}", c) for c in completions]
contents = [m.group(1) if m else "" for m in matches]
return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
config = GRPOConfig(
output_dir="./grpo-output",
learning_rate=1e-6,
per_device_train_batch_size=4,
num_generations=16, # G: number of completions per prompt
max_completion_length=512,
logging_steps=10,
bf16=True,
gradient_checkpointing=True,
)
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
args=config,
train_dataset=dataset,
)
trainer.train()
```
### 6.5 Reward Functions vs Reward Models
GRPO is flexible β the reward can come from:
1. **A Python function** (rule-based): Check if math answer is correct, if code passes tests, if format is right
2. **A reward model** (learned): A separate neural network that scores responses
3. **Multiple reward functions** combined: e.g., accuracy_reward + format_reward
For math/coding, rule-based rewards are often better because they provide an **exact signal** β the answer is either right or wrong. For open-ended tasks (chat, creative writing), a learned reward model is needed.
---
## Chapter 7: Parameter-Efficient Fine-Tuning (PEFT) β LoRA, QLoRA, and Adapters
### 7.1 The Memory Problem
Fine-tuning a 7B parameter model requires:
- **Model weights:** 7B Γ 2 bytes (bf16) = 14 GB
- **Gradients:** 14 GB
- **Optimizer states (AdamW):** 28 GB (2 states Γ 14 GB)
- **Activations:** Variable, often 10-30 GB
**Total: ~60-80 GB** for a single 7B model. That's one A100 GPU just for SFT. For RLHF with PPO (4 models), you'd need 4Γ this.
### 7.2 LoRA: Low-Rank Adaptation
**Paper:** *"LoRA: Low-Rank Adaptation of Large Language Models"* (Hu et al., 2021) β [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
**The insight:** When fine-tuning, the weight updates have low rank β they can be approximated by small matrices without much loss.
Instead of updating the full weight matrix W (d Γ d), LoRA adds two small matrices:
```
W' = W + Ξ± Β· B Γ A
where:
W is the original frozen weight (d Γ d)
A is a small matrix (d Γ r) β "down projection"
B is a small matrix (r Γ d) β "up projection"
r << d (typically r = 8, 16, 32) β the "rank"
Ξ± is a scaling factor
```
Only A and B are trained β the original weights are **frozen**. This reduces trainable parameters by 100-1000Γ.
**Example:** For a 4096 Γ 4096 weight matrix:
- Full fine-tuning: 16.7M parameters
- LoRA with r=16: 2 Γ 4096 Γ 16 = 131K parameters (128Γ fewer!)
### 7.3 QLoRA: Quantized LoRA
**Paper:** *"QLoRA: Efficient Finetuning of Quantized Language Models"* (Dettmers et al., 2023) β [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
QLoRA goes further: it quantizes the frozen base model to 4-bit precision, then adds LoRA adapters on top.
- **Base model:** 4-bit quantized (NF4 format) β 7B model fits in ~4 GB
- **LoRA adapters:** Trained in bf16/fp16
- **Gradient computation:** Done in bf16/fp16
This allows fine-tuning a 7B model on a single consumer GPU (e.g., RTX 4090 with 24 GB).
### 7.4 Using LoRA with TRL
```python
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
# Define LoRA configuration
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor (usually 2Γr)
lora_dropout=0.05, # Dropout for regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
config = SFTConfig(
output_dir="./sft-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-4, # LoRA typically uses higher LR than full fine-tuning
bf16=True,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model="meta-llama/Llama-3.1-8B",
args=config,
train_dataset=dataset,
peft_config=lora_config, # Pass LoRA config here
)
trainer.train()
```
### 7.5 When to Use LoRA vs Full Fine-Tuning
| Scenario | Recommendation |
|----------|---------------|
| Limited GPU memory | LoRA / QLoRA |
| Quick experiment / prototype | LoRA |
| Maximum quality, sufficient compute | Full fine-tuning |
| Multiple task-specific models from same base | LoRA (swap adapters) |
| Very small dataset | LoRA (acts as regularizer) |
**Key trade-off:** LoRA is ~95-99% as good as full fine-tuning for most tasks, at a fraction of the compute. For maximum quality (e.g., training a production model), full fine-tuning is still king.
---
## Chapter 8: The Toolbox β Libraries, Frameworks, and Infrastructure
### 8.1 TRL (Transformers Reinforcement Learning)
**Repository:** [github.com/huggingface/trl](https://github.com/huggingface/trl)
**Documentation:** [huggingface.co/docs/trl](https://huggingface.co/docs/trl)
TRL is the central library for post-training. It provides trainers for every major method:
| Trainer | Method | Config Class | Dataset Type |
|---------|--------|-------------|--------------|
| `SFTTrainer` | Supervised Fine-Tuning | `SFTConfig` | Language modeling or Prompt-completion |
| `DPOTrainer` | Direct Preference Optimization | `DPOConfig` | Preference (prompt + chosen + rejected) |
| `GRPOTrainer` | Group Relative Policy Optimization | `GRPOConfig` | Prompt-only |
| `RLOOTrainer` | REINFORCE Leave-One-Out | `RLOOConfig` | Prompt-only |
| `RewardTrainer` | Reward Model Training | `RewardConfig` | Preference |
| `KTOTrainer` | Kahneman-Tversky Optimization | `KTOConfig` | Unpaired preference |
| `ORPOTrainer` | Odds Ratio Preference Optimization | `ORPOConfig` | Preference |
| `CPOTrainer` | Contrastive Preference Optimization | `CPOConfig` | Preference |
| `OnlineDPOTrainer` | Online DPO | `OnlineDPOConfig` | Prompt-only |
| `PPOTrainer` | Proximal Policy Optimization | `PPOConfig` | Tokenized language modeling |
| `XPOTrainer` | Exploratory Preference Optimization | `XPOConfig` | Prompt-only |
| `NashMDTrainer` | Nash Mirror Descent | `NashMDConfig` | Prompt-only |
| `PRMTrainer` | Process Reward Model | `PRMConfig` | Stepwise supervision |
**Key features:**
- Integrates seamlessly with Hugging Face `transformers` and `datasets`
- Built-in PEFT/LoRA support via `peft_config` argument
- vLLM integration for fast generation in online methods
- DeepSpeed ZeRO for distributed training
- Supports both standard and conversational dataset formats
### 8.2 Transformers
**Repository:** [github.com/huggingface/transformers](https://github.com/huggingface/transformers)
The foundation library. You'll use it for:
- `AutoModelForCausalLM` β Loading language models
- `AutoTokenizer` β Tokenization and chat templates
- `TrainingArguments` β Base training configuration
- `Trainer` β Base trainer class (TRL trainers inherit from this)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
```
### 8.3 PEFT (Parameter-Efficient Fine-Tuning)
**Repository:** [github.com/huggingface/peft](https://github.com/huggingface/peft)
Provides LoRA, QLoRA, and other adapter methods. Key classes:
- `LoraConfig` β Configure LoRA adapters
- `get_peft_model()` β Wrap a model with adapters
- `PeftModel.from_pretrained()` β Load saved adapters
### 8.4 Accelerate
**Repository:** [github.com/huggingface/accelerate](https://github.com/huggingface/accelerate)
Handles distributed training across multiple GPUs/nodes. You rarely interact with it directly β it works behind the scenes when you use `accelerate launch`:
```bash
# Single GPU
python train.py
# Multi-GPU
accelerate launch --num_processes 4 train.py
# Multi-GPU with DeepSpeed
accelerate launch --config_file deepspeed_zero3.yaml train.py
```
### 8.5 Datasets
**Repository:** [github.com/huggingface/datasets](https://github.com/huggingface/datasets)
Efficient dataset loading and processing:
```python
from datasets import load_dataset
# Load from Hub
dataset = load_dataset("trl-lib/Capybara", split="train")
# Streaming (for huge datasets)
dataset = load_dataset("trl-lib/Capybara", split="train", streaming=True)
# Inspect
print(dataset.column_names) # ['messages']
print(dataset[0]) # First example
```
### 8.6 vLLM
**Repository:** [github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
High-throughput inference engine. Critical for:
- **Online methods (GRPO, RLOO, Online DPO):** Speeds up generation during training by 5-10Γ
- **Inference serving:** Deploy models for production use
TRL integrates vLLM directly:
```python
config = GRPOConfig(
use_vllm=True, # Enable vLLM for generation
vllm_mode="colocate", # Run on same GPUs as training
)
```
### 8.7 Other Important Tools
| Tool | Purpose | Link |
|------|---------|------|
| **Unsloth** | 2-5Γ faster LoRA training, lower memory | [github.com/unslothai/unsloth](https://github.com/unslothai/unsloth) |
| **bitsandbytes** | 4/8-bit quantization for QLoRA | [github.com/bitsandbytes-foundation/bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) |
| **Flash Attention** | Memory-efficient attention | [github.com/Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention) |
| **DeepSpeed** | Distributed training (ZeRO) | [github.com/microsoft/DeepSpeed](https://github.com/microsoft/DeepSpeed) |
| **Weights & Biases** | Experiment tracking | [wandb.ai](https://wandb.ai) |
| **Trackio** | HF-native experiment tracking | [HF Docs](https://huggingface.co/docs/trackio) |
| **LM Eval Harness** | Standardized LLM evaluation | [github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
---
## Chapter 9: Datasets β What to Train On
### 9.1 SFT Datasets
| Dataset | Size | Description | Link |
|---------|------|-------------|------|
| **trl-lib/Capybara** | ~90K msgs | High-quality multi-turn conversations | [Hub](https://huggingface.co/datasets/trl-lib/Capybara) |
| **HuggingFaceH4/ultrachat_200k** | 200K | Diverse multi-turn conversations | [Hub](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) |
| **allenai/tulu-3-sft-mixture** | ~1.3M | Large-scale SFT mixture from AI2 | [Hub](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) |
| **OpenAssistant/oasst1** | 161K msgs | Crowdsourced conversation trees | [Hub](https://huggingface.co/datasets/OpenAssistant/oasst1) |
| **tatsu-lab/alpaca** | 52K | GPT-generated instruction data | [Hub](https://huggingface.co/datasets/tatsu-lab/alpaca) |
| **teknium/OpenHermes-2.5** | 1M | Large synthetic instruction dataset | [Hub](https://huggingface.co/datasets/teknium/OpenHermes-2.5) |
### 9.2 Preference Datasets (for DPO/KTO/ORPO)
| Dataset | Size | Description | Link |
|---------|------|-------------|------|
| **trl-lib/ultrafeedback_binarized** | 60K | Binarized UltraFeedback preferences | [Hub](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized) |
| **Anthropic/hh-rlhf** | 170K | Human preference data (helpful + harmless) | [Hub](https://huggingface.co/datasets/Anthropic/hh-rlhf) |
| **argilla/ultrafeedback-binarized-preferences** | 60K | Cleaned UltraFeedback | [Hub](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences) |
### 9.3 Prompt-Only Datasets (for GRPO/RLOO)
| Dataset | Size | Description | Link |
|---------|------|-------------|------|
| **trl-lib/DeepMath-103K** | 103K | Math problems with verifiable answers | [Hub](https://huggingface.co/datasets/trl-lib/DeepMath-103K) |
| **AI-MO/NuminaMath-TIR** | ~70K | Math competition problems | [Hub](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) |
### 9.4 How to Choose a Dataset
1. **For your first experiment:** Use `trl-lib/Capybara` (SFT) or `trl-lib/ultrafeedback_binarized` (DPO). They're well-formatted and TRL-compatible out of the box.
2. **Quality over quantity:** LIMA showed that 1K great examples beats 52K mediocre ones. Invest in data curation.
3. **Match your use case:** If training a math model, use math-specific data. If training a general assistant, use diverse conversational data.
4. **Always inspect before training:**
```python
from datasets import load_dataset
ds = load_dataset("trl-lib/Capybara", split="train")
print(ds[0]) # Look at the data!
```
---
## Chapter 10: Evaluation β How to Know If It Worked
### 10.1 The Evaluation Problem
Evaluating LLMs is fundamentally hard because:
- **Open-ended outputs** can be correct in many different ways
- **Perplexity** doesn't correlate well with usefulness (LIMA found this explicitly)
- **Benchmark scores** don't always reflect real-world performance
- **Human evaluation** is expensive and subjective
### 10.2 Automated Benchmarks
| Benchmark | What It Measures | How It Works |
|-----------|-----------------|-------------|
| **MMLU** | Knowledge across 57 subjects | Multiple-choice questions |
| **HellaSwag** | Commonsense reasoning | Sentence completion |
| **ARC** | Science reasoning | Multiple-choice science questions |
| **TruthfulQA** | Truthfulness | Questions designed to elicit false claims |
| **GSM8K** | Math reasoning | Grade-school math word problems |
| **MATH** | Advanced math | Competition-level math problems |
| **HumanEval** | Code generation | Python programming problems |
| **MBPP** | Code generation | Basic Python problems |
| **IFEval** | Instruction following | Verifiable instruction constraints |
**Tool:** [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) runs all of these:
```bash
lm_eval --model hf \
--model_args pretrained=your-model \
--tasks mmlu,gsm8k,hellaswag \
--batch_size 8
```
### 10.3 LLM-as-Judge Evaluations
| Evaluation | Description | Link |
|------------|-------------|------|
| **AlpacaEval** | GPT-4 compares model outputs to reference | [github](https://github.com/tatsu-lab/alpaca_eval) |
| **MT-Bench** | Multi-turn dialogue evaluation by GPT-4 | Part of lmsys |
| **Arena Hard** | Challenging prompts, GPT-4 judged | Part of lmsys |
### 10.4 Human Evaluation
The gold standard. Key approaches:
- **Side-by-side comparison:** Show humans two responses, ask which is better
- **Likert scale:** Rate each response on helpfulness, accuracy, harmlessness (1-7)
- **Chatbot Arena:** Users chat with two anonymous models and vote for the better one
The [LMSYS Chatbot Arena](https://lmarena.ai/) provides the most widely-cited human evaluation through crowdsourced blind comparisons.
### 10.5 The Open LLM Leaderboard
Hugging Face hosts the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which evaluates open-source models across standardized benchmarks. It's the primary way the community tracks progress.
---
## Chapter 11: Putting It All Together β A Complete Post-Training Recipe
### 11.1 The Standard Recipe (2024-2025)
Here's a typical post-training pipeline for building a chat model:
```
Step 1: Choose Base Model
βββ Qwen3 (0.6B to 235B) β Currently top-performing family
βββ Llama 3.1/3.2 (1B to 405B) β Meta's open models
βββ Gemma 3/4 (1B to 27B) β Google's open models
βββ Mistral/Mixtral β Strong efficiency
Step 2: SFT
βββ Dataset: trl-lib/Capybara or HuggingFaceH4/ultrachat_200k
βββ Method: SFTTrainer with LoRA (for efficiency) or full fine-tuning
βββ Epochs: 2-3
βββ LR: 2e-5 (full) or 2e-4 (LoRA)
βββ Output: SFT model (becomes reference model for Stage 3)
Step 3: Preference Optimization (choose one)
βββ Option A: DPO (simplest, most popular)
β βββ Dataset: trl-lib/ultrafeedback_binarized
β βββ Ξ²: 0.1
β βββ LR: 5e-7
β βββ Epochs: 1-2
βββ Option B: GRPO (best for reasoning tasks)
β βββ Dataset: trl-lib/DeepMath-103K (math)
β βββ Reward: accuracy_reward + format_reward
β βββ num_generations: 16
β βββ LR: 1e-6
βββ Option C: KTO (if you only have binary feedback)
βββ Dataset: unpaired preference data
βββ Similar to DPO hyperparameters
Step 4: Evaluation
βββ Automated: lm-eval-harness (MMLU, GSM8K, etc.)
βββ LLM-Judge: MT-Bench, AlpacaEval
βββ Manual: Test with real prompts
```
### 11.2 Minimal Working Example: SFT + DPO
```python
# === Stage 1: SFT ===
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
sft_dataset = load_dataset("trl-lib/Capybara", split="train")
sft_config = SFTConfig(
output_dir="./sft-model",
num_train_epochs=2,
per_device_train_batch_size=4,
learning_rate=2e-5,
max_seq_length=2048,
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/my-sft-model",
)
sft_trainer = SFTTrainer(
model="Qwen/Qwen3-0.6B",
args=sft_config,
train_dataset=sft_dataset,
)
sft_trainer.train()
# === Stage 2: DPO ===
from trl import DPOTrainer, DPOConfig
dpo_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
dpo_config = DPOConfig(
output_dir="./dpo-model",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=5e-7,
beta=0.1,
bf16=True,
gradient_checkpointing=True,
push_to_hub=True,
hub_model_id="your-username/my-dpo-model",
)
dpo_trainer = DPOTrainer(
model="your-username/my-sft-model", # SFT model from stage 1
args=dpo_config,
train_dataset=dpo_dataset,
)
dpo_trainer.train()
```
### 11.3 Hardware Guidelines
| Model Size | Minimum GPU | Recommended | With LoRA |
|-----------|-------------|-------------|-----------|
| 0.5-3B | 1Γ A10G (24 GB) | 1Γ A100 (80 GB) | 1Γ T4 (16 GB) |
| 7-8B | 1Γ A100 (80 GB) | 2Γ A100 | 1Γ A10G (24 GB) |
| 13B | 2Γ A100 | 4Γ A100 | 1Γ A100 (80 GB) |
| 70B | 4Γ A100 | 8Γ A100 | 2Γ A100 |
---
## Chapter 12: The Reading List β Papers Every Practitioner Should Read
### Tier 1: Must-Read (The Foundations)
1. **InstructGPT** β *"Training Language Models to Follow Instructions with Human Feedback"*
- Ouyang et al., 2022 β [arXiv:2203.02155](https://arxiv.org/abs/2203.02155)
- Why: Established the SFT β RM β PPO pipeline. Everything starts here.
2. **DPO** β *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model"*
- Rafailov et al., 2023 β [arXiv:2305.18290](https://arxiv.org/abs/2305.18290)
- Why: Eliminated reward model + RL. The most widely used preference optimization method.
3. **LoRA** β *"LoRA: Low-Rank Adaptation of Large Language Models"*
- Hu et al., 2021 β [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
- Why: Made fine-tuning accessible. Practically every fine-tuning workflow uses LoRA.
4. **DeepSeek-R1** β *"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning"*
- DeepSeek-AI, 2025 β [arXiv:2501.12948](https://arxiv.org/abs/2501.12948)
- Why: Showed RL can teach reasoning from scratch. Opened the "reasoning era."
### Tier 2: Important (Deepening Understanding)
5. **LIMA** β *"LIMA: Less Is More for Alignment"*
- Zhou et al., 2023 β [arXiv:2305.11206](https://arxiv.org/abs/2305.11206)
- Why: Superficial Alignment Hypothesis. Data quality >> quantity.
6. **Constitutional AI** β *"Constitutional AI: Harmlessness from AI Feedback"*
- Bai et al., 2022 β [arXiv:2212.08073](https://arxiv.org/abs/2212.08073)
- Why: AI feedback replacing human feedback. Foundation for RLAIF.
7. **FLAN** β *"Finetuned Language Models Are Zero-Shot Learners"*
- Wei et al., 2021 β [arXiv:2109.01652](https://arxiv.org/abs/2109.01652)
- Why: Proved instruction tuning works. Foundation for SFT.
8. **Self-Instruct** β *"Self-Instruct: Aligning Language Models with Self-Generated Instructions"*
- Wang et al., 2022 β [arXiv:2212.10560](https://arxiv.org/abs/2212.10560)
- Why: Synthetic data generation. Led to Alpaca and the open-source SFT revolution.
9. **DeepSeekMath** β *"DeepSeekMath: Pushing the Limits of Mathematical Reasoning"*
- Shao et al., 2024 β [arXiv:2402.03300](https://arxiv.org/abs/2402.03300)
- Why: Introduced GRPO. The paper that started the GRPO wave.
10. **QLoRA** β *"QLoRA: Efficient Finetuning of Quantized Language Models"*
- Dettmers et al., 2023 β [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
- Why: Made 7B fine-tuning possible on consumer GPUs.
### Tier 3: Advanced (Cutting Edge)
11. **KTO** β *"KTO: Model Alignment as Prospect Theoretic Optimization"*
- Ethayarajh et al., 2024 β [arXiv:2402.01306](https://arxiv.org/abs/2402.01306)
12. **ORPO** β *"ORPO: Monolithic Preference Optimization without Reference Model"*
- Hong et al., 2024 β [arXiv:2403.07691](https://arxiv.org/abs/2403.07691)
13. **SimPO** β *"SimPO: Simple Preference Optimization with a Reference-Free Reward"*
- Meng et al., 2024 β [arXiv:2405.14734](https://arxiv.org/abs/2405.14734)
14. **Tulu 3** β *"Tulu 3: Pushing Frontiers in Open Language Model Post-Training"*
- AI2, 2024 β A comprehensive open-source post-training recipe
15. **Zephyr** β *"Zephyr: Direct Distillation of LM Alignment"*
- Tunstall et al., 2023 β [arXiv:2310.16944](https://arxiv.org/abs/2310.16944)
- Why: Open-source recipe for DPO that matched much larger models.
### Tier 4: Background (RL Foundations, if you want to go deeper)
16. **PPO** β *"Proximal Policy Optimization Algorithms"*
- Schulman et al., 2017 β [arXiv:1707.06347](https://arxiv.org/abs/1707.06347)
17. **Learning to Summarize from Human Feedback**
- Stiennon et al., 2020 β [arXiv:2009.01325](https://arxiv.org/abs/2009.01325)
- Why: First application of RLHF to LLMs (summarization).
18. **Fine-Tuning Language Models from Human Preferences**
- Ziegler et al., 2019 β [arXiv:1909.08593](https://arxiv.org/abs/1909.08593)
- Why: The original RLHF for language models paper.
---
## Glossary
| Term | Definition |
|------|-----------|
| **Alignment** | Making a model behave according to human intentions and values |
| **RLHF** | Reinforcement Learning from Human Feedback β using human preference data to train a reward model, then optimizing the LM with RL |
| **RLAIF** | RL from AI Feedback β using an AI system instead of humans to provide feedback |
| **SFT** | Supervised Fine-Tuning β training on instruction-response pairs with standard cross-entropy loss |
| **DPO** | Direct Preference Optimization β training directly on preference pairs without a separate reward model or RL |
| **GRPO** | Group Relative Policy Optimization β RL method that normalizes rewards within a group of completions |
| **PPO** | Proximal Policy Optimization β the RL algorithm used in classical RLHF |
| **Reward Model (RM)** | A model trained to score responses based on human preferences |
| **Policy** | In RL terms, the language model being trained (maps states/prompts to actions/tokens) |
| **Reference Model (Ο_ref)** | The SFT model used as a baseline to prevent the policy from deviating too far |
| **KL Divergence** | A measure of how different two probability distributions are β used to keep the policy close to the reference |
| **Bradley-Terry Model** | A probabilistic model for pairwise comparisons: P(A > B) = Ο(score(A) - score(B)) |
| **Reward Hacking** | When the model learns to exploit the reward model rather than genuinely improve |
| **LoRA** | Low-Rank Adaptation β parameter-efficient fine-tuning using small rank-decomposed matrices |
| **QLoRA** | Quantized LoRA β combines 4-bit quantization of the base model with LoRA adapters |
| **Chat Template** | The specific text format (special tokens, roles) a model uses for conversations |
| **On-policy** | Training on data generated by the current model (e.g., GRPO, Online DPO) |
| **Off-policy** | Training on data generated by a different model (e.g., standard DPO on static datasets) |
| **Preference Data** | Pairs of responses where one is marked as preferred over the other |
| **Advantage** | How much better a specific action is compared to the expected value |
---
## Quick Reference: TRL Commands
```bash
# Install TRL
pip install trl
# Run SFT from command line
trl sft --model_name_or_path Qwen/Qwen3-0.6B \
--dataset_name trl-lib/Capybara \
--output_dir ./sft-output
# Run DPO from command line
trl dpo --model_name_or_path your-sft-model \
--dataset_name trl-lib/ultrafeedback_binarized \
--output_dir ./dpo-output
# Run GRPO from command line
trl grpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/DeepMath-103K \
--output_dir ./grpo-output
# Start vLLM server for fast inference
trl vllm-serve --model Qwen/Qwen3-0.6B
# Multi-GPU training
accelerate launch --num_processes 4 train.py
# With DeepSpeed ZeRO-3
accelerate launch --config_file deepspeed_zero3.yaml train.py
```
---
## Quick Reference: Dataset Formats by Trainer
```python
# SFT (Language modeling format)
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
# SFT (Prompt-completion format)
{"prompt": "...", "completion": "..."}
# DPO / ORPO / CPO (Preference format)
{"prompt": "...", "chosen": "...", "rejected": "..."}
# Or conversational:
{"prompt": [{"role": "user", "content": "..."}],
"chosen": [{"role": "assistant", "content": "..."}],
"rejected": [{"role": "assistant", "content": "..."}]}
# GRPO / RLOO / Online DPO (Prompt-only format)
{"prompt": "..."}
# Or conversational:
{"prompt": [{"role": "user", "content": "..."}]}
# KTO (Unpaired preference format)
{"prompt": "...", "completion": "...", "label": True}
# Reward Model (Preference format β same as DPO)
{"prompt": "...", "chosen": "...", "rejected": "..."}
# PRM (Stepwise supervision format)
{"prompt": "...", "completions": ["step1", "step2"], "labels": [True, False]}
```
---
## Where to Go Next
1. **Hands-on:** Try the [TRL notebooks on Google Colab](https://github.com/huggingface/trl/tree/main/examples/notebooks) β they run for free
2. **Course:** The [Hugging Face smol course](https://huggingface.co/learn/smol-course) covers post-training step by step
3. **Community:** Join the [Hugging Face Discord](https://hf.co/join/discord) and the `#trl` channel
4. **Papers:** Start with InstructGPT and DPO from the reading list, then follow your interests
5. **Experiment:** Fine-tune a small model (Qwen3-0.6B) on your own data β the best way to learn is by doing
---
*This guide was compiled from primary research papers, official Hugging Face documentation, and the TRL library source code. All paper citations link to their arXiv pages. All code examples use current API patterns from TRL v1.2+.*
*Last updated: April 2026* |