| # Reward Model Environment | |
| An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness. | |
| ## Features | |
| - **External Reward Model Integration**: Connects to reward models hosted via vLLM's `/classify` endpoint | |
| - **Automatic Model Discovery**: Fetches the reward model name from `/v1/models` | |
| - **Batched Requests**: Sends all rollouts in a single batch request for efficiency | |
| - **Retry Logic**: Automatically retries failed requests with exponential backoff | |
| - **Chat Template Support**: Properly formats conversations using tokenizer chat templates | |
| - **Sanity Checks**: Logs statistics and warnings for reward values to ensure proper scaling | |
| ## Installation | |
| ```bash | |
| uv run vf-install reward-model-env | |
| ``` | |
| ## Usage | |
| ### Basic Example | |
| ```python | |
| import verifiers as vf | |
| # Load the environment | |
| vf_env = vf.load_environment( | |
| "reward-model-env", | |
| dataset_name="gsm8k", # HF dataset with 'prompt' or 'question' column | |
| dataset_config="main", # Optional: dataset config name (required for some datasets) | |
| reward_model_url="http://localhost:8002", # URL where your reward model is hosted | |
| tokenizer_path="./tokenizer.json", # Optional: path to tokenizer for chat template | |
| num_train_examples=100, # Optional: limit training examples | |
| ) | |
| # Evaluate with an OpenAI-compatible model | |
| from openai import AsyncOpenAI | |
| results = await vf_env.evaluate( | |
| client=AsyncOpenAI(base_url="http://localhost:8000/v1"), | |
| model="your-model", | |
| num_examples=10, | |
| rollouts_per_example=1, | |
| ) | |
| ``` | |
| See `example.py` for a complete working example. | |
| ### Environment Variables | |
| Set `REWARD_MODEL_URL` to avoid passing it as an argument: | |
| ```bash | |
| export REWARD_MODEL_URL="http://localhost:8002" | |
| ``` | |
| ## Reward Model Setup | |
| This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup: | |
| ```bash | |
| # Start vLLM with a reward model | |
| vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \ | |
| --port 8002 \ | |
| --enable-classification | |
| ``` | |
| ## API Format | |
| The environment expects the following API endpoints: | |
| ### `/v1/models` (GET) | |
| Returns available models: | |
| ```json | |
| { | |
| "data": [ | |
| {"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"} | |
| ] | |
| } | |
| ``` | |
| ### `/classify` (POST) | |
| Request: | |
| ```json | |
| { | |
| "model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", | |
| "input": [ | |
| "<s>[INST]question[/INST]answer</s>" | |
| ] | |
| } | |
| ``` | |
| Response: | |
| ```json | |
| { | |
| "data": [ | |
| { | |
| "index": 0, | |
| "label": "LABEL_0", | |
| "probs": [0.85], | |
| "num_classes": 1 | |
| } | |
| ] | |
| } | |
| ``` | |
| The `probs[0]` value is used as the reward. | |
| ## Chat Template Formatting | |
| The environment properly formats multi-turn conversations for the reward model: | |
| ```python | |
| # Input conversation | |
| [ | |
| {"role": "user", "content": "lets do python coding"}, | |
| {"role": "assistant", "content": "Sure! How'd you like to get started?"} | |
| ] | |
| # Formatted output (using Llama-style template) | |
| "<s>[INST]lets do python coding[/INST]Sure! How'd you like to get started?</s>" | |
| ``` | |
| If you provide a `tokenizer_path`, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format. | |
| ## Configuration Options | |
| - `dataset_name` (str): Hugging Face dataset name | |
| - `reward_model_url` (str): Base URL for the reward model API | |
| - `dataset_config` (str | None): Dataset config name (e.g., "main" for gsm8k, optional) | |
| - `tokenizer_path` (str | None): Path to tokenizer.json for chat template formatting | |
| - `system_prompt` (str): System prompt for the environment (default: "You are a helpful assistant.") | |
| - `num_train_examples` (int): Number of training examples (-1 for all) | |
| - `num_eval_examples` (int): Number of eval examples (-1 for all) | |
| - `max_retries` (int): Maximum retry attempts for API calls (default: 3) | |
| - `retry_delay` (float): Base delay between retries in seconds (default: 1.0) | |
| - `timeout` (float): Request timeout in seconds (default: 120.0) | |
| ## Sanity Checks | |
| The environment includes several sanity checks: | |
| 1. **Reward Range Logging**: Logs min, max, mean, and median rewards for each batch | |
| 2. **Small Value Warnings**: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues | |
| 3. **Response Validation**: Ensures the API response structure is correct and matches the input | |
| ## Training Example | |
| Use with `vf-rl` for reinforcement learning: | |
| ```toml | |
| # configs/rl/reward_model.toml | |
| model = "Qwen/Qwen3-4B-Instruct-2507" | |
| [env] | |
| id = "reward-model-env" | |
| reward_model_url = "http://localhost:8002" | |
| dataset_name = "your-dataset" | |
| tokenizer_path = "./tokenizer.json" | |
| [inference] | |
| gpus = 1 | |
| [trainer] | |
| gpus = 1 | |
| use_lora = true | |
| learning_rate = 1e-5 | |
| max_steps = 100 | |
| ``` | |
| ```bash | |
| uv run vf-rl @ configs/rl/reward_model.toml | |
| ``` | |
| ## Troubleshooting | |
| ### Connection Issues | |
| - Ensure your reward model is running and accessible at the specified URL | |
| - Check firewall settings if connecting to a remote server | |
| - Verify the `/v1/models` endpoint returns valid data | |
| ### Reward Scaling | |
| - Check the logged reward statistics to ensure values are in the expected range | |
| - If rewards are too small, they might not provide sufficient training signal | |
| - Consider normalizing or scaling rewards based on your use case | |
| ### Chat Template Issues | |
| - If using a tokenizer, ensure it has a chat template defined | |
| - The fallback simple formatting works for most Llama-style models | |
| - Check the logged sample conversation to verify formatting is correct | |