Buckets:

hf-doc-build/doc-dev / trl /pr_5638 /en /quickstart.md
|
download
raw
3.62 kB

Quickstart

TRL is a comprehensive library for post-training foundation models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO).

Quick Examples

Get started instantly with TRL's most popular trainers. Each example uses compact models for quick experimentation.

Supervised Fine-Tuning

from trl import SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=load_dataset("trl-lib/Capybara", split="train"),
)
trainer.train()

Group Relative Policy Optimization

from trl import GRPOTrainer
from datasets import load_dataset
from trl.rewards import accuracy_reward

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=load_dataset("trl-lib/DeepMath-103K", split="train"),
    reward_funcs=accuracy_reward,
)
trainer.train()

Direct Preference Optimization

from trl import DPOTrainer
from datasets import load_dataset

trainer = DPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=load_dataset("trl-lib/ultrafeedback_binarized", split="train"),
)
trainer.train()

Reward Modeling

from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    train_dataset=dataset,
)
trainer.train()

Command Line Interface

Skip the code entirely - train directly from your terminal:

# SFT: Fine-tune on instructions
trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
    --dataset_name trl-lib/Capybara

# DPO: Align with preferences  
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized

# Reward: Train a reward model
trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name trl-lib/ultrafeedback_binarized

What's Next?

๐Ÿ“š Learn More

๐Ÿš€ Scale Up

๐Ÿ’ก Examples

Troubleshooting

Out of Memory?

Reduce batch size and enable optimizations:

training_args = SFTConfig(
    per_device_train_batch_size=1,  # Start small
    gradient_accumulation_steps=8,  # Maintain effective batch size
)
training_args = DPOConfig(
    per_device_train_batch_size=1,  # Start small
    gradient_accumulation_steps=8,  # Maintain effective batch size
)
training_args = GRPOConfig(
    per_device_train_batch_size=1,  # Start small
    gradient_accumulation_steps=8,  # Maintain effective batch size
    num_generations=4,              # Reduce from default 8 (GRPO generates num_generations completions per prompt)
    max_completion_length=256,      # Tune based on task; longer sequences cost more memory
)

Loss not decreasing?

Try adjusting the learning rate:

training_args = SFTConfig(learning_rate=2e-5)  # Good starting point

For more help, open an issue on GitHub.

Xet Storage Details

Size:
3.62 kB
ยท
Xet hash:
5adf1d8bcc36822f9b22a15ac25de786e2bf399294241daa0d267803d72ad4dd

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.