mshahidul
Initial commit of readCtrl code without large models
030876e

Recipe: CollabLLM

Last updated: 09/22/2025.

Open-Source Algorithm Implementation & Expriement Running: Haiquan Chen, Shirley Wu

🏠 Homepage | 📝 Paper | 🤗 Datasets & Models | ⭐️ Original Implementation

verl provides a recipe for the Outstanding Paper at ICML 2025, "CollabLLM: From Passive Responders to Active Collaborators". CollabLLM is a unified fine-tuning framework that optimizes LLMs for effective and efficient multiturn collaboration with users.

Core Idea: Models are rewarded based on how well their responses enable effective future collaboration with users.

Paper Authors: Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao


Quick Start

0. Environment

Make sure the required packages for verl are installed. Additionally, install litellm and export the required API keys. The API model will be used for user simulators and, optionally, LLM Judges (see the Configuration section below).

1. Prepare Your Dataset

First, process your dataset using the provided script (see example commands and usage in process_dataset.py):

python process_dataset.py --dataset <> ... --dataset_type <sft or rl>

Requirements:

  • Input: A Hugging Face multiturn dataset. Existing datasets: collabllm/collabllm-multiturn-$DATASET, with DATASET in one of [math-hard(-large), medium(-large), bigcodebench(-large)] (*-large are the datasets used in the CollabLLM paper)
  • Example format: See collabllm-multiturn-math-hard
  • To generate your own dataset: Use build_dataset.py from the original CollabLLM repository

2. Train Your Model

(Optional) For Supervised Fine-Tuning (SFT):

bash train_sft_collabllm.sh

For Reinforcement Learning (RL):

bash train_rl_collabllm.sh

The RL script shows an example to train CollabLLM on math-hard-large.

  • The config to sample future conversations are in recipe/collabllm/config/collabllm_interaction_config.yaml.

  • The Multiturn-aware Reward is aggregated from these three conversational-level rewards:

    +reward_model.reward_kwargs.metric_weights.accuracy=1 \
    +reward_model.reward_kwargs.metric_weights.interactivity=1 \
    +reward_model.reward_kwargs.metric_weights.token_amount=-0.0001 \
    

    You can remove, add, or modify the weights depending on your task. A list of implemented metrics you can already add are under recipe/collabllm/metrics. For example, on medium-large, you can replace accuracy with bleu_score via

    +reward_model.reward_kwargs.metric_weights.bleu_score=1 
    

    which will instead apply bleu score on the sampled future conversations.

Algorithm

Step Name Description
1 Model response generation The model generates multiple responses for each prompt in a batch.
2 Collaborative simulation A user simulator (e.g., GPT or Claude) samples num_repeat_rollouts conversations for up to max_user_turns additional turns.
3 Compute Multiturn-aware Reward Customized conversational reward functions are applied to the sampled conversations. Rewards are aggregated, then averaged across rollouts.
4 Update model The model weights are updated using the computed multiturn-aware rewards.

Configuration

The primary configuration is managed through the launch script train_rl_collabllm.sh and the YAML file recipe/collabllm/config/collabllm_interaction_config.yaml. Key configuration sections:

Section Key Parameters / Notes
data Paths to training/validation files, batch sizes, sequence lengths.
actor_rollout_ref (common) Base model path (used for actor + initial reference), FSDP settings, optimization (LR, scheduler).
actor_rollout_ref (CollabLLM-specific) Hyperparameters under actor_rollout_ref.rollout.multi_turn: max_user_turns, max_assistant_turns, num_repeat_rollouts.
interaction Defined in collabllm_interaction_config.yaml. Specifies user simulator and hyperparameters. Requires exported API keys.
reward_model Manager set to collabllm by default. Modify reward_model.reward_kwargs.metric_weights for conversational rewards and weights. LLM Judge hyperparameters (e.g., model, temperature) go under reward_model.reward_kwargs.llm_judge_kwargs.
algorithm GRPO-specific hyperparameters such as actor_rollout_ref.rollout.n.
trainer Distributed training (nodes, GPUs per node), logging (WandB), checkpointing frequency.

Key Files

File Path Purpose
recipe/collabllm/collabllm_agent_loop.py Main logic to sample future conversations, using CollabLLMInteraction from verl/interactions/collabllm_interaction.py.
verl/workers/reward_manager/collabllm.py Computes rewards for future conversations, leveraging recipe/collabllm/reward_function.py to apply each metric.

Acknowledgement

We sincerely thank the verl community and advisors for their contributions and guidance!