File size: 6,230 Bytes
030876e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | # Recipe: CollabLLM
Last updated: 09/22/2025.
> Open-Source Algorithm Implementation & Expriement Running: [Haiquan Chen](https://github.com/chenhaiq), [Shirley Wu](https://github.com/Wuyxin)
🏠 [Homepage](https://aka.ms/CollabLLM) | 📝 [Paper](https://arxiv.org/pdf/2502.00640) | 🤗 [Datasets & Models](https://huggingface.co/collabllm) | ⭐️ [Original Implementation](https://github.com/Wuyxin/collabllm)
`verl` provides a recipe for the Outstanding Paper at ICML 2025, **"CollabLLM: From Passive Responders to Active Collaborators"**. [CollabLLM](https://aka.ms/CollabLLM) is a unified fine-tuning framework that optimizes LLMs for effective and efficient multiturn collaboration with users.
**Core Idea:** Models are rewarded based on how well their responses enable effective *future* collaboration with users.
Paper Authors: [Shirley Wu](https://cs.stanford.edu/~shirwu/), [Michel Galley](https://www.microsoft.com/en-us/research/people/mgalley/), Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, [James Zou](https://www.james-zou.com/), [Jure Leskovec](https://cs.stanford.edu/people/jure/), [Jianfeng Gao](https://www.microsoft.com/en-us/research/people/jfgao/)
---
## Quick Start
### 0. Environment
Make sure the required packages for `verl` are installed. Additionally, install `litellm` and export the required API keys. The API model will be used for user simulators and, optionally, LLM Judges (see the Configuration section below).
### 1. Prepare Your Dataset
First, process your dataset using the provided script (see example commands and usage in `process_dataset.py`):
```bash
python process_dataset.py --dataset <> ... --dataset_type <sft or rl>
```
**Requirements:**
- Input: A Hugging Face multiturn dataset. Existing datasets: `collabllm/collabllm-multiturn-$DATASET`, with `DATASET` in one of [`math-hard(-large)`, `medium(-large)`, `bigcodebench(-large)`] (*-large are the datasets used in the CollabLLM paper)
- Example format: See [collabllm-multiturn-math-hard](https://huggingface.co/datasets/collabllm/collabllm-multiturn-math-hard)
- To generate your own dataset: Use [build_dataset.py](https://github.com/Wuyxin/collabllm/blob/main/scripts/engine/build_dataset.py) from the original CollabLLM repository
### 2. Train Your Model
**(Optional) For Supervised Fine-Tuning (SFT):**
```bash
bash train_sft_collabllm.sh
```
**For Reinforcement Learning (RL):**
```bash
bash train_rl_collabllm.sh
```
The RL script shows an example to train CollabLLM on `math-hard-large`.
- The config to sample future conversations are in `recipe/collabllm/config/collabllm_interaction_config.yaml`.
- The Multiturn-aware Reward is aggregated from these three conversational-level rewards:
```
+reward_model.reward_kwargs.metric_weights.accuracy=1 \
+reward_model.reward_kwargs.metric_weights.interactivity=1 \
+reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \
```
You can remove, add, or modify the weights depending on your task. A list of implemented metrics you can already add are under `recipe/collabllm/metrics`. For example, on `medium-large`, you can replace `accuracy` with `bleu_score` via
```
+reward_model.reward_kwargs.metric_weights.bleu_score=1
```
which will instead apply bleu score on the sampled future conversations.
## Algorithm
| Step | Name | Description |
|------|-------------------------------|-----------------------------------------------------------------------------|
| 1 | Model response generation | The model generates multiple responses for each prompt in a batch. |
| 2 | Collaborative simulation | A user simulator (e.g., GPT or Claude) samples `num_repeat_rollouts` conversations for up to `max_user_turns` additional turns. |
| 3 | Compute Multiturn-aware Reward | Customized conversational reward functions are applied to the sampled conversations. Rewards are aggregated, then averaged across rollouts. |
| 4 | Update model | The model weights are updated using the computed multiturn-aware rewards. |
---
## Configuration
The primary configuration is managed through the launch script `train_rl_collabllm.sh` and the YAML file `recipe/collabllm/config/collabllm_interaction_config.yaml`. Key configuration sections:
| Section | Key Parameters / Notes |
|----------------------|-----------------------------------------------------------------------------------------|
| `data` | Paths to training/validation files, batch sizes, sequence lengths. |
| `actor_rollout_ref` (common) | Base model path (used for actor + initial reference), FSDP settings, optimization (LR, scheduler). |
| `actor_rollout_ref` (CollabLLM-specific) | Hyperparameters under `actor_rollout_ref.rollout.multi_turn`: `max_user_turns`, `max_assistant_turns`, `num_repeat_rollouts`. |
| `interaction` | Defined in `collabllm_interaction_config.yaml`. Specifies user simulator and hyperparameters. Requires exported API keys. |
| `reward_model` | Manager set to `collabllm` by default. Modify `reward_model.reward_kwargs.metric_weights` for conversational rewards and weights. LLM Judge hyperparameters (e.g., `model`, `temperature`) go under `reward_model.reward_kwargs.llm_judge_kwargs`. |
| `algorithm` | GRPO-specific hyperparameters such as `actor_rollout_ref.rollout.n`. |
| `trainer` | Distributed training (nodes, GPUs per node), logging (WandB), checkpointing frequency. |
---
## Key Files
| File Path | Purpose |
|-----------|---------|
| `recipe/collabllm/collabllm_agent_loop.py` | Main logic to sample future conversations, using `CollabLLMInteraction` from `verl/interactions/collabllm_interaction.py`. |
| `verl/workers/reward_manager/collabllm.py` | Computes rewards for future conversations, leveraging `recipe/collabllm/reward_function.py` to apply each metric. |
---
## Acknowledgement
We sincerely thank the `verl` community and advisors for their contributions and guidance!
|