Recipe: CollabLLM
Last updated: 09/22/2025.
Open-Source Algorithm Implementation & Expriement Running: Haiquan Chen, Shirley Wu
🏠 Homepage | 📝 Paper | 🤗 Datasets & Models | ⭐️ Original Implementation
verl provides a recipe for the Outstanding Paper at ICML 2025, "CollabLLM: From Passive Responders to Active Collaborators". CollabLLM is a unified fine-tuning framework that optimizes LLMs for effective and efficient multiturn collaboration with users.
Core Idea: Models are rewarded based on how well their responses enable effective future collaboration with users.
Paper Authors: Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
Quick Start
0. Environment
Make sure the required packages for verl are installed. Additionally, install litellm and export the required API keys. The API model will be used for user simulators and, optionally, LLM Judges (see the Configuration section below).
1. Prepare Your Dataset
First, process your dataset using the provided script (see example commands and usage in process_dataset.py):
python process_dataset.py --dataset <> ... --dataset_type <sft or rl>
Requirements:
- Input: A Hugging Face multiturn dataset. Existing datasets:
collabllm/collabllm-multiturn-$DATASET, withDATASETin one of [math-hard(-large),medium(-large),bigcodebench(-large)] (*-large are the datasets used in the CollabLLM paper) - Example format: See collabllm-multiturn-math-hard
- To generate your own dataset: Use build_dataset.py from the original CollabLLM repository
2. Train Your Model
(Optional) For Supervised Fine-Tuning (SFT):
bash train_sft_collabllm.sh
For Reinforcement Learning (RL):
bash train_rl_collabllm.sh
The RL script shows an example to train CollabLLM on math-hard-large.
The config to sample future conversations are in
recipe/collabllm/config/collabllm_interaction_config.yaml.The Multiturn-aware Reward is aggregated from these three conversational-level rewards:
+reward_model.reward_kwargs.metric_weights.accuracy=1 \ +reward_model.reward_kwargs.metric_weights.interactivity=1 \ +reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \You can remove, add, or modify the weights depending on your task. A list of implemented metrics you can already add are under
recipe/collabllm/metrics. For example, onmedium-large, you can replaceaccuracywithbleu_scorevia+reward_model.reward_kwargs.metric_weights.bleu_score=1which will instead apply bleu score on the sampled future conversations.
Algorithm
| Step | Name | Description |
|---|---|---|
| 1 | Model response generation | The model generates multiple responses for each prompt in a batch. |
| 2 | Collaborative simulation | A user simulator (e.g., GPT or Claude) samples num_repeat_rollouts conversations for up to max_user_turns additional turns. |
| 3 | Compute Multiturn-aware Reward | Customized conversational reward functions are applied to the sampled conversations. Rewards are aggregated, then averaged across rollouts. |
| 4 | Update model | The model weights are updated using the computed multiturn-aware rewards. |
Configuration
The primary configuration is managed through the launch script train_rl_collabllm.sh and the YAML file recipe/collabllm/config/collabllm_interaction_config.yaml. Key configuration sections:
| Section | Key Parameters / Notes |
|---|---|
data |
Paths to training/validation files, batch sizes, sequence lengths. |
actor_rollout_ref (common) |
Base model path (used for actor + initial reference), FSDP settings, optimization (LR, scheduler). |
actor_rollout_ref (CollabLLM-specific) |
Hyperparameters under actor_rollout_ref.rollout.multi_turn: max_user_turns, max_assistant_turns, num_repeat_rollouts. |
interaction |
Defined in collabllm_interaction_config.yaml. Specifies user simulator and hyperparameters. Requires exported API keys. |
reward_model |
Manager set to collabllm by default. Modify reward_model.reward_kwargs.metric_weights for conversational rewards and weights. LLM Judge hyperparameters (e.g., model, temperature) go under reward_model.reward_kwargs.llm_judge_kwargs. |
algorithm |
GRPO-specific hyperparameters such as actor_rollout_ref.rollout.n. |
trainer |
Distributed training (nodes, GPUs per node), logging (WandB), checkpointing frequency. |
Key Files
| File Path | Purpose |
|---|---|
recipe/collabllm/collabllm_agent_loop.py |
Main logic to sample future conversations, using CollabLLMInteraction from verl/interactions/collabllm_interaction.py. |
verl/workers/reward_manager/collabllm.py |
Computes rewards for future conversations, leveraging recipe/collabllm/reward_function.py to apply each metric. |
Acknowledgement
We sincerely thank the verl community and advisors for their contributions and guidance!