Spaces:
Running
Running
File size: 5,011 Bytes
6ab17a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
# TRL Training Methods Overview
TRL (Transformer Reinforcement Learning) provides multiple training methods for fine-tuning and aligning language models. This reference provides a brief overview of each method.
## Supervised Fine-Tuning (SFT)
**What it is:** Standard instruction tuning with supervised learning on demonstration data.
**When to use:**
- Initial fine-tuning of base models on task-specific data
- Teaching new capabilities or domains
- Most common starting point for fine-tuning
**Dataset format:** Conversational format with "messages" field, OR text field, OR prompt/completion pairs
**Example:**
```python
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
args=SFTConfig(
output_dir="my-model",
push_to_hub=True,
hub_model_id="username/my-model",
eval_strategy="no", # Disable eval for simple example
# max_length=1024 is the default - only set if you need different length
)
)
trainer.train()
```
**Note:** For production training with evaluation monitoring, see `scripts/train_sft_example.py`
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer")`
## Direct Preference Optimization (DPO)
**What it is:** Alignment method that trains directly on preference pairs (chosen vs rejected responses) without requiring a reward model.
**When to use:**
- Aligning models to human preferences
- Improving response quality after SFT
- Have paired preference data (chosen/rejected responses)
**Dataset format:** Preference pairs with "chosen" and "rejected" fields
**Example:**
```python
from trl import DPOTrainer, DPOConfig
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model
train_dataset=dataset,
args=DPOConfig(
output_dir="dpo-model",
beta=0.1, # KL penalty coefficient
eval_strategy="no", # Disable eval for simple example
# max_length=1024 is the default - only set if you need different length
)
)
trainer.train()
```
**Note:** For production training with evaluation monitoring, see `scripts/train_dpo_example.py`
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
## Group Relative Policy Optimization (GRPO)
**What it is:** Online RL method that optimizes relative to group performance, useful for tasks with verifiable rewards.
**When to use:**
- Tasks with automatic reward signals (code execution, math verification)
- Online learning scenarios
- When DPO offline data is insufficient
**Dataset format:** Prompt-only format (model generates responses, reward computed online)
**Example:**
```python
# Use TRL maintained script
hf_jobs("uv", {
"script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
"script_args": [
"--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
"--dataset_name", "trl-lib/math_shepherd",
"--output_dir", "grpo-model"
],
"flavor": "a10g-large",
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
## Reward Modeling
**What it is:** Train a reward model to score responses, used as a component in RLHF pipelines.
**When to use:**
- Building RLHF pipeline
- Need automatic quality scoring
- Creating reward signals for PPO training
**Dataset format:** Preference pairs with "chosen" and "rejected" responses
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/reward_trainer")`
## Method Selection Guide
| Method | Complexity | Data Required | Use Case |
|--------|-----------|---------------|----------|
| **SFT** | Low | Demonstrations | Initial fine-tuning |
| **DPO** | Medium | Paired preferences | Post-SFT alignment |
| **GRPO** | Medium | Prompts + reward fn | Online RL with automatic rewards |
| **Reward** | Medium | Paired preferences | Building RLHF pipeline |
## Recommended Pipeline
**For most use cases:**
1. **Start with SFT** - Fine-tune base model on task data
2. **Follow with DPO** - Align to preferences using paired data
3. **Optional: GGUF conversion** - Deploy for local inference
**For advanced RL scenarios:**
1. **Start with SFT** - Fine-tune base model
2. **Train reward model** - On preference data
## Dataset Format Reference
For complete dataset format specifications, use:
```python
hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
```
Or validate your dataset:
```bash
uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
--dataset your/dataset --split train
```
## See Also
- `references/training_patterns.md` - Common training patterns and examples
- `scripts/train_sft_example.py` - Complete SFT template
- `scripts/train_dpo_example.py` - Complete DPO template
- [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Dataset format validation tool
|