| # Recipe: CollabLLM |
|
|
| Last updated: 09/22/2025. |
|
|
| > Open-Source Algorithm Implementation & Expriement Running: [Haiquan Chen](https://github.com/chenhaiq), [Shirley Wu](https://github.com/Wuyxin) |
|
|
| 🏠 [Homepage](https://aka.ms/CollabLLM) | 📝 [Paper](https://arxiv.org/pdf/2502.00640) | 🤗 [Datasets & Models](https://huggingface.co/collabllm) | ⭐️ [Original Implementation](https://github.com/Wuyxin/collabllm) |
|
|
| `verl` provides a recipe for the Outstanding Paper at ICML 2025, **"CollabLLM: From Passive Responders to Active Collaborators"**. [CollabLLM](https://aka.ms/CollabLLM) is a unified fine-tuning framework that optimizes LLMs for effective and efficient multiturn collaboration with users. |
|
|
| **Core Idea:** Models are rewarded based on how well their responses enable effective *future* collaboration with users. |
|
|
| Paper Authors: [Shirley Wu](https://cs.stanford.edu/~shirwu/), [Michel Galley](https://www.microsoft.com/en-us/research/people/mgalley/), Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, [James Zou](https://www.james-zou.com/), [Jure Leskovec](https://cs.stanford.edu/people/jure/), [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/) |
|
|
|
|
| --- |
| ## Quick Start |
|
|
| ### 0. Environment |
| Make sure the required packages for `verl` are installed. Additionally, install `litellm` and export the required API keys. The API model will be used for user simulators and, optionally, LLM Judges (see the Configuration section below). |
|
|
| ### 1. Prepare Your Dataset |
|
|
| First, process your dataset using the provided script (see example commands and usage in `process_dataset.py`): |
|
|
| ```bash |
| python process_dataset.py --dataset <> ... --dataset_type <sft or rl> |
| ``` |
|
|
|
|
| **Requirements:** |
| - Input: A Hugging Face multiturn dataset. Existing datasets: `collabllm/collabllm-multiturn-$DATASET`, with `DATASET` in one of [`math-hard(-large)`, `medium(-large)`, `bigcodebench(-large)`] (*-large are the datasets used in the CollabLLM paper) |
| - Example format: See [collabllm-multiturn-math-hard](https://huggingface.co/datasets/collabllm/collabllm-multiturn-math-hard) |
| - To generate your own dataset: Use [build_dataset.py](https://github.com/Wuyxin/collabllm/blob/main/scripts/engine/build_dataset.py) from the original CollabLLM repository |
| |
| |
| ### 2. Train Your Model |
| |
| **(Optional) For Supervised Fine-Tuning (SFT):** |
| ```bash |
| bash train_sft_collabllm.sh |
| ``` |
| |
| **For Reinforcement Learning (RL):** |
| |
| ```bash |
| bash train_rl_collabllm.sh |
| ``` |
| |
| The RL script shows an example to train CollabLLM on `math-hard-large`. |
| |
| - The config to sample future conversations are in `recipe/collabllm/config/collabllm_interaction_config.yaml`. |
| - The Multiturn-aware Reward is aggregated from these three conversational-level rewards: |
| |
| ``` |
| +reward_model.reward_kwargs.metric_weights.accuracy=1 \ |
| +reward_model.reward_kwargs.metric_weights.interactivity=1 \ |
| +reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \ |
| ``` |
| |
| You can remove, add, or modify the weights depending on your task. A list of implemented metrics you can already add are under `recipe/collabllm/metrics`. For example, on `medium-large`, you can replace `accuracy` with `bleu_score` via |
| ``` |
| +reward_model.reward_kwargs.metric_weights.bleu_score=1 |
| ``` |
| which will instead apply bleu score on the sampled future conversations. |
| |
| ## Algorithm |
| |
| | Step | Name | Description | |
| |------|-------------------------------|-----------------------------------------------------------------------------| |
| | 1 | Model response generation | The model generates multiple responses for each prompt in a batch. | |
| | 2 | Collaborative simulation | A user simulator (e.g., GPT or Claude) samples `num_repeat_rollouts` conversations for up to `max_user_turns` additional turns. | |
| | 3 | Compute Multiturn-aware Reward | Customized conversational reward functions are applied to the sampled conversations. Rewards are aggregated, then averaged across rollouts. | |
| | 4 | Update model | The model weights are updated using the computed multiturn-aware rewards. | |
| |
| --- |
| |
| ## Configuration |
| |
| The primary configuration is managed through the launch script `train_rl_collabllm.sh` and the YAML file `recipe/collabllm/config/collabllm_interaction_config.yaml`. Key configuration sections: |
| |
| | Section | Key Parameters / Notes | |
| |----------------------|-----------------------------------------------------------------------------------------| |
| | `data` | Paths to training/validation files, batch sizes, sequence lengths. | |
| | `actor_rollout_ref` (common) | Base model path (used for actor + initial reference), FSDP settings, optimization (LR, scheduler). | |
| | `actor_rollout_ref` (CollabLLM-specific) | Hyperparameters under `actor_rollout_ref.rollout.multi_turn`: `max_user_turns`, `max_assistant_turns`, `num_repeat_rollouts`. | |
| | `interaction` | Defined in `collabllm_interaction_config.yaml`. Specifies user simulator and hyperparameters. Requires exported API keys. | |
| | `reward_model` | Manager set to `collabllm` by default. Modify `reward_model.reward_kwargs.metric_weights` for conversational rewards and weights. LLM Judge hyperparameters (e.g., `model`, `temperature`) go under `reward_model.reward_kwargs.llm_judge_kwargs`. | |
| | `algorithm` | GRPO-specific hyperparameters such as `actor_rollout_ref.rollout.n`. | |
| | `trainer` | Distributed training (nodes, GPUs per node), logging (WandB), checkpointing frequency. | |
| |
| --- |
| |
| ## Key Files |
| |
| | File Path | Purpose | |
| |-----------|---------| |
| | `recipe/collabllm/collabllm_agent_loop.py` | Main logic to sample future conversations, using `CollabLLMInteraction` from `verl/interactions/collabllm_interaction.py`. | |
| | `verl/workers/reward_manager/collabllm.py` | Computes rewards for future conversations, leveraging `recipe/collabllm/reward_function.py` to apply each metric. | |
| |
| --- |
| |
| ## Acknowledgement |
| |
| We sincerely thank the `verl` community and advisors for their contributions and guidance! |
| |