Buckets:
Examples
TRL provides notebooks for quick experimentation and scripts for production training.
- Notebooks: Most run on free Google Colab. Great for learning and prototyping.
- Scripts: Run on single GPU, multi-GPU, or with DeepSpeed. Ready for production.
Getting Started
pip install --upgrade trl[quantization]
For scripts, configure ๐ค Accelerate (recommended for multi-GPU):
accelerate config
๐ Notebooks
Interactive notebooks for quick experimentation. Find them in examples/notebooks/.
๐ Getting started
Generic notebooks that work with any model. Start here!
| Notebook | Method | Model | Colab |
|---|---|---|---|
| SFT a 14B model with LoRA/QLoRA on Free Colab | SFT | Qwen3-14B | |
| GRPO a 7B model with LoRA/QLoRA on Free Colab | GRPO | Qwen2-7B |
๐ค Agents
Train models for agentic tasks and tool use.
| Notebook | Method | Model | Colab |
|---|---|---|---|
| Agent Training Qwen3-1.7B with Tool Calling (BioGRID SQL) | GRPO | Qwen3-1.7B | โ ๏ธ Larger GPU |
๐ฎ OpenEnv
Train agents in interactive environments using OpenEnv.
| Notebook | Method | Model | Colab |
|---|---|---|---|
| Train Qwen3-1.7B to Play Wordle | GRPO | Qwen3-1.7B | |
| FunctionGemma for Browser Control (BrowserGym) | GRPO | FunctionGemma-270M |
๐ฏ Model-specific
Notebooks for specific models, including Vision Language Models (VLM) and reasoning.
| Notebook | Method | Model | VLM | Colab |
|---|---|---|---|---|
| Add Reasoning Capabilities to rnj-1-instruct-1B with GRPO and QLoRA | GRPO | rnj-1-instruct-1B | ||
| SFT Ministral-3B VLM with QLoRA on Free Colab | SFT | Ministral-3B | โ | |
| GRPO Ministral-3B VLM with QLoRA on Free Colab | GRPO | Ministral-3B | โ | |
| SFT Qwen3-VL with QLoRA on Free Colab | SFT | Qwen3-VL | โ | |
| GRPO Qwen3-VL with QLoRA on Free Colab | GRPO | Qwen3-VL | โ |
๐ Scripts
Scripts are maintained in the trl/scripts and examples/scripts directories. They show how to use different trainers such as SFTTrainer, PPOTrainer, DPOTrainer, GRPOTrainer, and more.
| File | Description |
|---|---|
examples/scripts/bco.py |
This script shows how to use the experimental.kto.KTOTrainer with the BCO loss to fine-tune a model to increase instruction-following, truthfulness, honesty, and helpfulness using the openbmb/UltraFeedback dataset. |
examples/scripts/cpo.py |
This script shows how to use the experimental.cpo.CPOTrainer to fine-tune a model to increase helpfulness and harmlessness using the Anthropic/hh-rlhf dataset. |
trl/scripts/dpo.py |
This script shows how to use the DPOTrainer to fine-tune a model. |
examples/scripts/dpo_vlm.py |
This script shows how to use the DPOTrainer to fine-tune a Vision Language Model to reduce hallucinations using the openbmb/RLAIF-V-Dataset dataset. |
examples/scripts/evals/judge_tldr.py |
This script shows how to use experimental.judges.HfPairwiseJudge or experimental.judges.OpenAIPairwiseJudge to judge model generations. |
examples/scripts/gkd.py |
This script shows how to use the experimental.gkd.GKDTrainer to fine-tune a model. |
trl/scripts/grpo.py |
This script shows how to use the GRPOTrainer to fine-tune a model. |
trl/scripts/grpo_agent.py |
This script shows how to use the GRPOTrainer to fine-tune a model to enable agentic usage. |
examples/scripts/grpo_vlm.py |
This script shows how to use the GRPOTrainer to fine-tune a multimodal model for reasoning using the lmms-lab/multimodal-open-r1-8k-verified dataset. |
examples/scripts/gspo.py |
This script shows how to use GSPO via the GRPOTrainer to fine-tune model for reasoning using the AI-MO/NuminaMath-TIR dataset. |
examples/scripts/gspo_vlm.py |
This script shows how to use GSPO via the GRPOTrainer to fine-tune a multimodal model for reasoning using the lmms-lab/multimodal-open-r1-8k-verified dataset. |
examples/scripts/kto.py |
This script shows how to use the experimental.kto.KTOTrainer to fine-tune a model. |
examples/scripts/mpo_vlm.py |
This script shows how to use MPO via the DPOTrainer to align a model based on preferences using the HuggingFaceH4/rlaif-v_formatted dataset and a set of loss weights with weights. |
examples/scripts/nash_md.py |
This script shows how to use the experimental.nash_md.NashMDTrainer to fine-tune a model. |
examples/scripts/online_dpo.py |
This script shows how to use the experimental.online_dpo.OnlineDPOTrainer to fine-tune a model. |
examples/scripts/online_dpo_vlm.py |
This script shows how to use the experimental.online_dpo.OnlineDPOTrainer to fine-tune a a Vision Language Model. |
examples/scripts/openenv/browsergym.py |
Simple script to run GRPO training via the GRPOTrainer with OpenEnv's BrowserGym environment and vLLM for VLMs |
examples/scripts/openenv/browsergym_llm.py |
Simple script to run GRPO training via the GRPOTrainer with OpenEnv's BrowserGym environment and vLLM for LLMs |
examples/scripts/openenv/catch.py |
Simple script to run GRPO training via the GRPOTrainer with OpenEnv's Catch environment (OpenSpiel) and vLLM |
examples/scripts/openenv/echo.py |
Simple script to run GRPO training via the GRPOTrainer with OpenEnv's Echo environment and vLLM. |
examples/scripts/openenv/wordle.py |
Simple script to run GRPO training via the GRPOTrainer with OpenEnv's Wordle environment and vLLM. |
examples/scripts/orpo.py |
This script shows how to use the experimental.orpo.ORPOTrainer to fine-tune a model to increase helpfulness and harmlessness using the Anthropic/hh-rlhf dataset. |
examples/scripts/ppo/ppo.py |
This script shows how to use the experimental.ppo.PPOTrainer to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
examples/scripts/ppo/ppo_tldr.py |
This script shows how to use the experimental.ppo.PPOTrainer to fine-tune a model to improve its ability to generate TL;DR summaries. |
examples/scripts/prm.py |
This script shows how to use the experimental.prm.PRMTrainer to fine-tune a Process-supervised Reward Model (PRM). |
examples/scripts/reward_modeling.py |
This script shows how to use the RewardTrainer to train an Outcome Reward Model (ORM) on your own dataset. |
examples/scripts/rloo.py |
This script shows how to use the RLOOTrainer to fine-tune a model to improve its ability to solve math questions. |
examples/scripts/sft.py |
This script shows how to use the SFTTrainer to fine-tune a model. |
examples/scripts/sft_gemma3.py |
This script shows how to use the SFTTrainer to fine-tune a Gemma 3 model. |
examples/scripts/sft_video_llm.py |
This script shows how to use the SFTTrainer to fine-tune a Video Language Model. |
examples/scripts/sft_vlm.py |
This script shows how to use the SFTTrainer to fine-tune a Vision Language Model in a chat setting. The script has only been tested with LLaVA 1.5, LLaVA 1.6, and Llama-3.2-11B-Vision-Instruct models, so users may see unexpected behaviour in other model architectures. |
examples/scripts/sft_vlm_gemma3.py |
This script shows how to use the SFTTrainer to fine-tune a Gemma 3 model on vision to text tasks. |
examples/scripts/sft_vlm_smol_vlm.py |
This script shows how to use the SFTTrainer to fine-tune a SmolVLM model. |
examples/scripts/xpo.py |
This script shows how to use the experimental.xpo.XPOTrainer to fine-tune a model. |
Distributed Training (for scripts)
You can run scripts on multiple GPUs with ๐ค Accelerate:
accelerate launch --config_file=examples/accelerate_configs/multi_gpu.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
For DeepSpeed ZeRO-{1,2,3}:
accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero{1,2,3}.yaml --num_processes {NUM_GPUS} path_to_script.py --all_arguments_of_the_script
Adjust NUM_GPUS and --all_arguments_of_the_script as needed.
Xet Storage Details
- Size:
- 16.1 kB
- Xet hash:
- a1af6fd7b1a81bb98ae76a3f6920151a6446607bd71573ae6eeef8b863a69cc9
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.