File size: 5,540 Bytes
25644b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Reward Model Environment

An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness.

## Features

- **External Reward Model Integration**: Connects to reward models hosted via vLLM's `/classify` endpoint
- **Automatic Model Discovery**: Fetches the reward model name from `/v1/models`
- **Batched Requests**: Sends all rollouts in a single batch request for efficiency
- **Retry Logic**: Automatically retries failed requests with exponential backoff
- **Chat Template Support**: Properly formats conversations using tokenizer chat templates
- **Sanity Checks**: Logs statistics and warnings for reward values to ensure proper scaling

## Installation

```bash
uv run vf-install reward-model-env
```

## Usage

### Basic Example

```python
import verifiers as vf

# Load the environment
vf_env = vf.load_environment(
    "reward-model-env",
    dataset_name="gsm8k",  # HF dataset with 'prompt' or 'question' column
    dataset_config="main",  # Optional: dataset config name (required for some datasets)
    reward_model_url="http://localhost:8002",  # URL where your reward model is hosted
    tokenizer_path="./tokenizer.json",  # Optional: path to tokenizer for chat template
    num_train_examples=100,  # Optional: limit training examples
)

# Evaluate with an OpenAI-compatible model
from openai import AsyncOpenAI

results = await vf_env.evaluate(
    client=AsyncOpenAI(base_url="http://localhost:8000/v1"),
    model="your-model",
    num_examples=10,
    rollouts_per_example=1,
)
```

See `example.py` for a complete working example.

### Environment Variables

Set `REWARD_MODEL_URL` to avoid passing it as an argument:

```bash
export REWARD_MODEL_URL="http://localhost:8002"
```

## Reward Model Setup

This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup:

```bash
# Start vLLM with a reward model
vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \
  --port 8002 \
  --enable-classification
```

## API Format

The environment expects the following API endpoints:

### `/v1/models` (GET)
Returns available models:
```json
{
  "data": [
    {"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"}
  ]
}
```

### `/classify` (POST)
Request:
```json
{
  "model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2",
  "input": [
    "<s>[INST]question[/INST]answer</s>"
  ]
}
```

Response:
```json
{
  "data": [
    {
      "index": 0,
      "label": "LABEL_0",
      "probs": [0.85],
      "num_classes": 1
    }
  ]
}
```

The `probs[0]` value is used as the reward.

## Chat Template Formatting

The environment properly formats multi-turn conversations for the reward model:

```python
# Input conversation
[
  {"role": "user", "content": "lets do python coding"},
  {"role": "assistant", "content": "Sure! How'd you like to get started?"}
]

# Formatted output (using Llama-style template)
"<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>"
```

If you provide a `tokenizer_path`, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format.

## Configuration Options

- `dataset_name` (str): Hugging Face dataset name
- `reward_model_url` (str): Base URL for the reward model API
- `dataset_config` (str | None): Dataset config name (e.g., "main" for gsm8k, optional)
- `tokenizer_path` (str | None): Path to tokenizer.json for chat template formatting
- `system_prompt` (str): System prompt for the environment (default: "You are a helpful assistant.")
- `num_train_examples` (int): Number of training examples (-1 for all)
- `num_eval_examples` (int): Number of eval examples (-1 for all)
- `max_retries` (int): Maximum retry attempts for API calls (default: 3)
- `retry_delay` (float): Base delay between retries in seconds (default: 1.0)
- `timeout` (float): Request timeout in seconds (default: 120.0)

## Sanity Checks

The environment includes several sanity checks:

1. **Reward Range Logging**: Logs min, max, mean, and median rewards for each batch
2. **Small Value Warnings**: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues
3. **Response Validation**: Ensures the API response structure is correct and matches the input

## Training Example

Use with `vf-rl` for reinforcement learning:

```toml
# configs/rl/reward_model.toml
model = "Qwen/Qwen3-4B-Instruct-2507"

[env]
id = "reward-model-env"
reward_model_url = "http://localhost:8002"
dataset_name = "your-dataset"
tokenizer_path = "./tokenizer.json"

[inference]
gpus = 1

[trainer]
gpus = 1
use_lora = true
learning_rate = 1e-5
max_steps = 100
```

```bash
uv run vf-rl @ configs/rl/reward_model.toml
```

## Troubleshooting

### Connection Issues
- Ensure your reward model is running and accessible at the specified URL
- Check firewall settings if connecting to a remote server
- Verify the `/v1/models` endpoint returns valid data

### Reward Scaling
- Check the logged reward statistics to ensure values are in the expected range
- If rewards are too small, they might not provide sufficient training signal
- Consider normalizing or scaling rewards based on your use case

### Chat Template Issues
- If using a tokenizer, ensure it has a chat template defined
- The fallback simple formatting works for most Llama-style models
- Check the logged sample conversation to verify formatting is correct