RM-env-test / README.md

Upload folder using huggingface_hub

25644b7 verified 3 months ago

5.54 kB

	# Reward Model Environment

	An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.

	## Features

	- External Reward Model Integration: Connects to reward models hosted via vLLM's `/classify` endpoint
	- Automatic Model Discovery: Fetches the reward model name from `/v1/models`
	- Batched Requests: Sends all rollouts in a single batch request for efficiency
	- Retry Logic: Automatically retries failed requests with exponential backoff
	- Chat Template Support: Properly formats conversations using tokenizer chat templates
	- Sanity Checks: Logs statistics and warnings for reward values to ensure proper scaling

	## Installation

	```bash
	uv run vf-install reward-model-env
	```

	## Usage

	### Basic Example

	```python
	import verifiers as vf

	# Load the environment
	vf_env = vf.load_environment(
	"reward-model-env",
	dataset_name="gsm8k", # HF dataset with 'prompt' or 'question' column
	dataset_config="main", # Optional: dataset config name (required for some datasets)
	reward_model_url="http://localhost:8002", # URL where your reward model is hosted
	tokenizer_path="./tokenizer.json", # Optional: path to tokenizer for chat template
	num_train_examples=100, # Optional: limit training examples
	)

	# Evaluate with an OpenAI-compatible model
	from openai import AsyncOpenAI

	results = await vf_env.evaluate(
	client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
	model="your-model",
	num_examples=10,
	rollouts_per_example=1,
	)
	```

	See `example.py` for a complete working example.

	### Environment Variables

	Set `REWARD_MODEL_URL` to avoid passing it as an argument:

	```bash
	export REWARD_MODEL_URL="http://localhost:8002"
	```

	## Reward Model Setup

	This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:

	```bash
	# Start vLLM with a reward model
	vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
	--port 8002 \
	--enable-classification
	```

	## API Format

	The environment expects the following API endpoints:

	### `/v1/models` (GET)
	Returns available models:
	```json
	{
	"data": [
	{"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
	]
	}
	```

	### `/classify` (POST)
	Request:
	```json
	{
	"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
	"input": [
	"<s>[INST]question[/INST]answer</s>"
	]
	}
	```

	Response:
	```json
	{
	"data": [
	{
	"index": 0,
	"label": "LABEL_0",
	"probs": [0.85],
	"num_classes": 1
	}
	]
	}
	```

	The `probs[0]` value is used as the reward.

	## Chat Template Formatting

	The environment properly formats multi-turn conversations for the reward model:

	```python
	# Input conversation
	[
	{"role": "user", "content": "lets do python coding"},
	{"role": "assistant", "content": "Sure! How'd you like to get started?"}
	]

	# Formatted output (using Llama-style template)
	"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"
	```

	If you provide a `tokenizer_path`, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.

	## Configuration Options

	- `dataset_name` (str): Hugging Face dataset name
	- `reward_model_url` (str): Base URL for the reward model API
	- `dataset_config` (str \| None): Dataset config name (e.g., "main" for gsm8k, optional)
	- `tokenizer_path` (str \| None): Path to tokenizer.json for chat template formatting
	- `system_prompt` (str): System prompt for the environment (default: "You are a helpful assistant.")
	- `num_train_examples` (int): Number of training examples (-1 for all)
	- `num_eval_examples` (int): Number of eval examples (-1 for all)
	- `max_retries` (int): Maximum retry attempts for API calls (default: 3)
	- `retry_delay` (float): Base delay between retries in seconds (default: 1.0)
	- `timeout` (float): Request timeout in seconds (default: 120.0)

	## Sanity Checks

	The environment includes several sanity checks:

	1. Reward Range Logging: Logs min, max, mean, and median rewards for each batch
	2. Small Value Warnings: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
	3. Response Validation: Ensures the API response structure is correct and matches the input

	## Training Example

	Use with `vf-rl` for reinforcement learning:

	```toml
	# configs/rl/reward_model.toml
	model = "Qwen/Qwen3-4B-Instruct-2507"

	[env]
	id = "reward-model-env"
	reward_model_url = "http://localhost:8002"
	dataset_name = "your-dataset"
	tokenizer_path = "./tokenizer.json"

	[inference]
	gpus = 1

	[trainer]
	gpus = 1
	use_lora = true
	learning_rate = 1e-5
	max_steps = 100
	```

	```bash
	uv run vf-rl @ configs/rl/reward_model.toml
	```

	## Troubleshooting

	### Connection Issues
	- Ensure your reward model is running and accessible at the specified URL
	- Check firewall settings if connecting to a remote server
	- Verify the `/v1/models` endpoint returns valid data

	### Reward Scaling
	- Check the logged reward statistics to ensure values are in the expected range
	- If rewards are too small, they might not provide sufficient training signal
	- Consider normalizing or scaling rewards based on your use case

	### Chat Template Issues
	- If using a tokenizer, ensure it has a chat template defined
	- The fallback simple formatting works for most Llama-style models
	- Check the logged sample conversation to verify formatting is correct