File size: 5,540 Bytes
25644b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | # Reward Model Environment
An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.
## Features
- **External Reward Model Integration**: Connects to reward models hosted via vLLM's `/classify` endpoint
- **Automatic Model Discovery**: Fetches the reward model name from `/v1/models`
- **Batched Requests**: Sends all rollouts in a single batch request for efficiency
- **Retry Logic**: Automatically retries failed requests with exponential backoff
- **Chat Template Support**: Properly formats conversations using tokenizer chat templates
- **Sanity Checks**: Logs statistics and warnings for reward values to ensure proper scaling
## Installation
```bash
uv run vf-install reward-model-env
```
## Usage
### Basic Example
```python
import verifiers as vf
# Load the environment
vf_env = vf.load_environment(
"reward-model-env",
dataset_name="gsm8k", # HF dataset with 'prompt' or 'question' column
dataset_config="main", # Optional: dataset config name (required for some datasets)
reward_model_url="http://localhost:8002", # URL where your reward model is hosted
tokenizer_path="./tokenizer.json", # Optional: path to tokenizer for chat template
num_train_examples=100, # Optional: limit training examples
)
# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI
results = await vf_env.evaluate(
client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
model="your-model",
num_examples=10,
rollouts_per_example=1,
)
```
See `example.py` for a complete working example.
### Environment Variables
Set `REWARD_MODEL_URL` to avoid passing it as an argument:
```bash
export REWARD_MODEL_URL="http://localhost:8002"
```
## Reward Model Setup
This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:
```bash
# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
--port 8002 \
--enable-classification
```
## API Format
The environment expects the following API endpoints:
### `/v1/models` (GET)
Returns available models:
```json
{
"data": [
{"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
]
}
```
### `/classify` (POST)
Request:
```json
{
"model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
"input": [
"<s>[INST]question[/INST]answer</s>"
]
}
```
Response:
```json
{
"data": [
{
"index": 0,
"label": "LABEL_0",
"probs": [0.85],
"num_classes": 1
}
]
}
```
The `probs[0]` value is used as the reward.
## Chat Template Formatting
The environment properly formats multi-turn conversations for the reward model:
```python
# Input conversation
[
{"role": "user", "content": "lets do python coding"},
{"role": "assistant", "content": "Sure! How'd you like to get started?"}
]
# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"
```
If you provide a `tokenizer_path`, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.
## Configuration Options
- `dataset_name` (str): Hugging Face dataset name
- `reward_model_url` (str): Base URL for the reward model API
- `dataset_config` (str | None): Dataset config name (e.g., "main" for gsm8k, optional)
- `tokenizer_path` (str | None): Path to tokenizer.json for chat template formatting
- `system_prompt` (str): System prompt for the environment (default: "You are a helpful assistant.")
- `num_train_examples` (int): Number of training examples (-1 for all)
- `num_eval_examples` (int): Number of eval examples (-1 for all)
- `max_retries` (int): Maximum retry attempts for API calls (default: 3)
- `retry_delay` (float): Base delay between retries in seconds (default: 1.0)
- `timeout` (float): Request timeout in seconds (default: 120.0)
## Sanity Checks
The environment includes several sanity checks:
1. **Reward Range Logging**: Logs min, max, mean, and median rewards for each batch
2. **Small Value Warnings**: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
3. **Response Validation**: Ensures the API response structure is correct and matches the input
## Training Example
Use with `vf-rl` for reinforcement learning:
```toml
# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"
[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"
[inference]
gpus = 1
[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100
```
```bash
uv run vf-rl @ configs/rl/reward_model.toml
```
## Troubleshooting
### Connection Issues
- Ensure your reward model is running and accessible at the specified URL
- Check firewall settings if connecting to a remote server
- Verify the `/v1/models` endpoint returns valid data
### Reward Scaling
- Check the logged reward statistics to ensure values are in the expected range
- If rewards are too small, they might not provide sufficient training signal
- Consider normalizing or scaling rewards based on your use case
### Chat Template Issues
- If using a tokenizer, ensure it has a chat template defined
- The fallback simple formatting works for most Llama-style models
- Check the logged sample conversation to verify formatting is correct
|