arithmetic-grpo / docs /sglang_multiturn /interaction_system.rst

initial clean commit

1faccd4 about 1 month ago

14.9 kB

	Interaction System for Multi-turn RL Training
	=============================================

	Last updated: 06/25/2025.

	Overview
	--------

	The verl interaction system enables dynamic, multi-turn conversational feedback during reinforcement learning training. This system allows models to engage in iterative problem-solving scenarios where interaction agents can provide corrective feedback, guidance, or evaluation based on the model's responses.

	New in Multi-Interaction Support: The system now supports multiple named interactions within a single training session, enabling sophisticated training scenarios where different samples can use different interaction strategies. This allows for curriculum learning, domain-specific feedback, and flexible agent switching at the sample level.

	Key features:

	- Async-based Architecture: Non-blocking interaction processing for distributed training
	- Instance Management: Stateful session handling with unique instance IDs for concurrent interactions
	- SGLang Integration: Seamless integration with SGLang rollout system for multi-turn conversations
	- Configuration-driven: Dynamic agent loading via YAML configuration files
	- Multi-Interaction Support: Registry system enabling multiple named interactions per rollout
	- Sample-Level Selection: Each sample can specify which interaction to use via configuration
	- Reward Integration: Turn-level scoring mechanism integrated with verl's reward system

	Architecture
	------------

	The interaction system follows a plugin-based architecture with clear separation of concerns:

	.. code-block::

	Interaction Registry System
	↓
	BaseInteraction (Abstract Interface)
	↓
	Multiple Named Interactions (e.g., Gsm8kInteraction, CustomInteraction)
	↓
	SGLang Rollout Integration (interaction_map)
	↓
	Sample-Level Interaction Selection
	↓
	Async Request Lifecycle Management

	Core Components
	~~~~~~~~~~~~~~~

	Interaction Registry System

	The interaction registry system allows loading and managing multiple named interactions:

	.. code-block:: python

	from verl.interactions.utils.interaction_registry import initialize_interactions_from_config

	# Load multiple interactions from config
	interaction_map = initialize_interactions_from_config("config.yaml")

	# Access specific interaction by name
	gsm8k_interaction = interaction_map["gsm8k"]
	custom_interaction = interaction_map["custom_solver"]

	BaseInteraction Interface

	All interaction agents must implement the ``BaseInteraction`` abstract class:

	.. code-block:: python

	from verl.interactions.base import BaseInteraction
	from typing import Dict, Any, List, Tuple, Optional

	class BaseInteraction:
	def __init__(self, config: Dict[str, Any]):
	self.config = config
	self.name: str = config.get("name", "interaction_agent")

	async def start_interaction(self, instance_id: Optional[str] = None, **kwargs) -> str:
	"""Initialize interaction session, return instance_id"""

	async def generate_response(self, instance_id: str, messages: List[Dict[str, Any]], **kwargs) -> Tuple[bool, str, float, Dict[str, Any]]:
	"""Generate response, return (should_terminate, response, score, metadata)"""

	async def calculate_score(self, instance_id: str, **kwargs) -> float:
	"""Calculate turn-level score for RL training"""

	async def finalize_interaction(self, instance_id: str, **kwargs) -> None:
	"""Clean up resources"""

	Request Lifecycle

	The interaction system integrates with SGLang's async rollout via state management:

	1. ``PENDING`` → Initialize interaction via ``start_interaction()``
	2. ``GENERATING`` → Model generates response
	3. ``INTERACTING`` → Process response via ``generate_response()``
	4. ``GENERATING`` → Continue if not terminated, otherwise ``COMPLETED``

	Configuration
	-------------

	Basic Setup

	Enable interaction in your rollout configuration:

	.. code-block:: yaml

	actor_rollout_ref:
	rollout:
	multi_turn:
	enable: true
	interaction_config_path: "path/to/interaction_config.yaml"
	max_user_turns: 10
	max_assistant_turns: 10

	Interaction Configuration File

	Create an interaction configuration file (e.g., ``interaction_config.yaml``):

	Single Interaction (Legacy Format)

	.. code-block:: yaml

	interaction:
	- name: "gsm8k"
	class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
	config: {}

	Multiple Interactions (New Format)

	.. code-block:: yaml

	interaction:
	- name: "gsm8k"
	class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
	config: {}
	- name: "custom_solver"
	class_name: "custom.interactions.CustomInteraction"
	config:
	solver_type: "advanced"
	timeout: 30
	- name: "code_verifier"
	class_name: "verl.interactions.base.BaseInteraction"
	config:
	verification_mode: "strict"

	Automatic Name Generation

	If no ``name`` field is provided, the system will automatically generate one from the class name:

	.. code-block:: yaml

	interaction:
	- class_name: "verl.interactions.gsm8k_interaction.Gsm8kInteraction"
	config: {}
	# Automatically generates name: "gsm8k"

	The system will dynamically load all specified interaction classes and make them available by name.

	Implementation Example: GSM8K
	-----------------------------

	The GSM8K interaction demonstrates a complete implementation for math problem-solving scenarios:

	.. code-block:: python

	from verl.interactions.base import BaseInteraction
	from verl.utils.reward_score import gsm8k
	from uuid import uuid4

	class Gsm8kInteraction(BaseInteraction):
	def __init__(self, config: dict):
	super().__init__(config)
	self._instance_dict = {}

	async def start_interaction(self, instance_id=None, ground_truth=None, **kwargs):
	if instance_id is None:
	instance_id = str(uuid4())
	self._instance_dict[instance_id] = {
	"response": "",
	"ground_truth": ground_truth,
	"reward": 0.0,
	}
	return instance_id

	async def generate_response(self, instance_id, messages, **kwargs):
	# Extract last assistant message content
	content = ""
	for item in reversed(messages):
	if item.get("role") == "assistant":
	content = item.get("content", "")
	break

	# Ensure GSM8K format (#### prefix)
	self._instance_dict[instance_id]["response"] = content

	reward = await self.calculate_score(instance_id)
	if reward == 1.0:
	return True, "Your response is correct!", 1.0, {}
	else:
	return False, "Your response is incorrect! You need to reflect on your answer and try again.", 0.0, {}

	async def calculate_score(self, instance_id, **kwargs):
	return gsm8k.compute_score(
	self._instance_dict[instance_id]["response"],
	self._instance_dict[instance_id]["ground_truth"],
	method="strict", format_score=0.0, score=1.0,
	)

	async def finalize_interaction(self, instance_id, **kwargs):
	del self._instance_dict[instance_id]

	Training Integration
	--------------------

	Training Script Configuration

	Include interaction configuration in your training command:

	.. code-block:: bash

	python3 -m verl.trainer.main_ppo \\
	--config-path="$CONFIG_PATH" \\
	--config-name='gsm8k_multiturn_grpo_w_interaction' \\
	algorithm.adv_estimator=grpo \\
	data.train_batch_size=512 \\
	data.return_raw_chat=True \\
	actor_rollout_ref.rollout.name=sglang \\
	actor_rollout_ref.rollout.multi_turn.interaction_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/interaction_config/gsm8k_interaction_config.yaml" \\
	trainer.total_epochs=15

	Data Requirements

	Ensure your dataset includes interaction parameters with the ``name`` field for interaction selection:

	.. code-block:: python

	# Dataset should include interaction_kwargs in non_tensor_batch
	interaction_kwargs = [
	{"name": "gsm8k", "query": "What is 2+2?", "ground_truth": "4"},
	{"name": "custom_solver", "query": "Solve: x^2 + 5x + 6 = 0", "ground_truth": "x = -2, -3"},
	{"name": "gsm8k", "query": "What is 3+3?", "ground_truth": "6"},
	]

	Sample-Level Interaction Selection

	Each sample can specify which interaction to use via the ``name`` field. This enables flexible training scenarios where different samples use different interaction strategies:

	.. code-block:: python

	# Example: Math problems use GSM8K interaction, code problems use code verifier
	data_samples = [
	{
	"prompt": "What is 15% of 200?",
	"interaction_kwargs": {
	"name": "gsm8k",
	"query": "What is 15% of 200?",
	"ground_truth": "30"
	}
	},
	{
	"prompt": "Write a function to check if a number is prime",
	"interaction_kwargs": {
	"name": "code_verifier",
	"code_type": "python",
	"expected_behavior": "return True for prime numbers"
	}
	}
	]

	Backward Compatibility

	If no ``name`` field is provided in ``interaction_kwargs``, the system defaults to ``"gsm8k"`` for backward compatibility.

	Best Practices
	--------------

	Resource Management

	- Always implement proper cleanup in ``finalize_interaction()``
	- Use unique instance IDs to avoid conflicts in concurrent training
	- Handle edge cases like empty messages or malformed content

	Performance Optimization

	- Keep interaction logic lightweight to avoid blocking training
	- Use async/await properly to maintain non-blocking behavior
	- Consider caching expensive computations within interaction instances

	Testing

	Comprehensive testing is essential for interaction systems:

	.. code-block:: python

	import pytest
	from unittest.mock import patch

	@pytest.mark.asyncio
	async def test_interaction_workflow():
	interaction = YourInteraction({})

	# Test complete workflow
	instance_id = await interaction.start_interaction(ground_truth="expected_answer")


	messages = [{"role": "user", "content": "user_content"}, {"role": "assistant", "content": "assistant_content"}]
	should_terminate, response, reward, metadata = await interaction.generate_response(instance_id, messages)

	assert should_terminate in [True, False]
	assert isinstance(reward, float)

	await interaction.finalize_interaction(instance_id)

	Advanced Usage
	--------------

	Multi-Interaction Training Strategies

	You can design sophisticated training scenarios using multiple interactions:

	.. code-block:: python

	# Example: Progressive difficulty with different interaction agents
	class MathTrainingPipeline:
	def create_interaction_config(self):
	return {
	"interaction": [
	{
	"name": "basic_math",
	"class_name": "verl.interactions.gsm8k_interaction.Gsm8kInteraction",
	"config": {"difficulty": "easy"}
	},
	{
	"name": "advanced_math",
	"class_name": "custom.interactions.AdvancedMathInteraction",
	"config": {"difficulty": "hard", "allow_hints": True}
	},
	{
	"name": "competition_math",
	"class_name": "custom.interactions.CompetitionMathInteraction",
	"config": {"time_limit": 300, "show_steps": False}
	}
	]
	}

	def create_curriculum_data(self, epoch):
	if epoch < 5:
	return [{"name": "basic_math", ...} for _ in samples]
	elif epoch < 10:
	return [{"name": "advanced_math", ...} for _ in samples]
	else:
	return [{"name": "competition_math", ...} for _ in samples]

	Custom Scoring Functions

	You can integrate custom reward functions:

	.. code-block:: python

	async def calculate_score(self, instance_id, **kwargs):
	response = self._instance_dict[instance_id]["response"]
	ground_truth = self._instance_dict[instance_id]["ground_truth"]

	# Custom evaluation logic
	if custom_evaluation_function(response, ground_truth):
	return 1.0
	else:
	return 0.0

	Multi-step Interactions

	For complex scenarios requiring multiple feedback rounds:

	.. code-block:: python

	async def generate_response(self, instance_id, messages, **kwargs):
	instance = self._instance_dict[instance_id]
	instance["attempts"] += 1

	# Evaluate current response
	reward = await self.calculate_score(instance_id)

	if reward > 0.8:
	return True, "Excellent work!", reward, {}
	elif instance["attempts"] < 3:
	return False, "Good attempt, but try to improve...", reward, {}
	else:
	return True, "Maximum attempts reached.", reward, {}

	Troubleshooting
	---------------

	Common Issues

	1. Instance ID Conflicts: Ensure unique instance IDs across concurrent sessions
	2. Memory Leaks: Always call ``finalize_interaction()`` to clean up resources
	3. Blocking Operations: Keep interaction logic async and non-blocking
	4. Configuration Errors: Verify interaction config path and class name are correct
	5. Interaction Name Conflicts: Ensure all interactions have unique names in the configuration
	6. Missing Interaction: Verify the ``name`` field in ``interaction_kwargs`` matches available interactions
	7. Backward Compatibility: When migrating from single to multi-interaction, add ``name`` fields to existing data

	Debugging

	Enable debug logging to trace interaction flow:

	.. code-block:: bash

	export VERL_LOGGING_LEVEL=DEBUG

	Performance Monitoring

	Monitor interaction performance impact on training throughput and adjust accordingly.

	Related Documentation
	--------------------

	- :doc:`multiturn`: Basic multi-turn rollout configuration
	- :doc:`sandbox_fusion`: Tool integration with SGLang
	- :doc:`search_tool_example`: Search tool implementation example