# Reward Model Environment An environment that uses an external reward model hosted via vLLM to train LLMs. This environment communicates with a reward model API, formats conversations using chat templates, batches requests for efficiency, and includes retry logic for robustness. ## Features - **External Reward Model Integration**: Connects to reward models hosted via vLLM's `/classify` endpoint - **Automatic Model Discovery**: Fetches the reward model name from `/v1/models` - **Batched Requests**: Sends all rollouts in a single batch request for efficiency - **Retry Logic**: Automatically retries failed requests with exponential backoff - **Chat Template Support**: Properly formats conversations using tokenizer chat templates - **Sanity Checks**: Logs statistics and warnings for reward values to ensure proper scaling ## Installation ```bash uv run vf-install reward-model-env ``` ## Usage ### Basic Example ```python import verifiers as vf # Load the environment vf_env = vf.load_environment( "reward-model-env", dataset_name="gsm8k", # HF dataset with 'prompt' or 'question' column dataset_config="main", # Optional: dataset config name (required for some datasets) reward_model_url="http://localhost:8002", # URL where your reward model is hosted tokenizer_path="./tokenizer.json", # Optional: path to tokenizer for chat template num_train_examples=100, # Optional: limit training examples ) # Evaluate with an OpenAI-compatible model from openai import AsyncOpenAI results = await vf_env.evaluate( client=AsyncOpenAI(base_url="http://localhost:8000/v1"), model="your-model", num_examples=10, rollouts_per_example=1, ) ``` See `example.py` for a complete working example. ### Environment Variables Set `REWARD_MODEL_URL` to avoid passing it as an argument: ```bash export REWARD_MODEL_URL="http://localhost:8002" ``` ## Reward Model Setup This environment expects a reward model hosted via vLLM with the classification API enabled. Example setup: ```bash # Start vLLM with a reward model vllm serve Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 \ --port 8002 \ --enable-classification ``` ## API Format The environment expects the following API endpoints: ### `/v1/models` (GET) Returns available models: ```json { "data": [ {"id": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2"} ] } ``` ### `/classify` (POST) Request: ```json { "model": "Skywork/Skywork-Reward-Llama-3.1-8B-v0.2", "input": [ "[INST]question[/INST]answer" ] } ``` Response: ```json { "data": [ { "index": 0, "label": "LABEL_0", "probs": [0.85], "num_classes": 1 } ] } ``` The `probs[0]` value is used as the reward. ## Chat Template Formatting The environment properly formats multi-turn conversations for the reward model: ```python # Input conversation [ {"role": "user", "content": "lets do python coding"}, {"role": "assistant", "content": "Sure! How'd you like to get started?"} ] # Formatted output (using Llama-style template) "[INST]lets do python coding[/INST]Sure! How'd you like to get started?" ``` If you provide a `tokenizer_path`, it will use the tokenizer's native chat template. Otherwise, it falls back to a simple Llama-style format. ## Configuration Options - `dataset_name` (str): Hugging Face dataset name - `reward_model_url` (str): Base URL for the reward model API - `dataset_config` (str | None): Dataset config name (e.g., "main" for gsm8k, optional) - `tokenizer_path` (str | None): Path to tokenizer.json for chat template formatting - `system_prompt` (str): System prompt for the environment (default: "You are a helpful assistant.") - `num_train_examples` (int): Number of training examples (-1 for all) - `num_eval_examples` (int): Number of eval examples (-1 for all) - `max_retries` (int): Maximum retry attempts for API calls (default: 3) - `retry_delay` (float): Base delay between retries in seconds (default: 1.0) - `timeout` (float): Request timeout in seconds (default: 120.0) ## Sanity Checks The environment includes several sanity checks: 1. **Reward Range Logging**: Logs min, max, mean, and median rewards for each batch 2. **Small Value Warnings**: Warns if rewards are extremely small (< 1e-10) to help detect truncation issues 3. **Response Validation**: Ensures the API response structure is correct and matches the input ## Training Example Use with `vf-rl` for reinforcement learning: ```toml # configs/rl/reward_model.toml model = "Qwen/Qwen3-4B-Instruct-2507" [env] id = "reward-model-env" reward_model_url = "http://localhost:8002" dataset_name = "your-dataset" tokenizer_path = "./tokenizer.json" [inference] gpus = 1 [trainer] gpus = 1 use_lora = true learning_rate = 1e-5 max_steps = 100 ``` ```bash uv run vf-rl @ configs/rl/reward_model.toml ``` ## Troubleshooting ### Connection Issues - Ensure your reward model is running and accessible at the specified URL - Check firewall settings if connecting to a remote server - Verify the `/v1/models` endpoint returns valid data ### Reward Scaling - Check the logged reward statistics to ensure values are in the expected range - If rewards are too small, they might not provide sufficient training signal - Consider normalizing or scaling rewards based on your use case ### Chat Template Issues - If using a tokenizer, ensure it has a chat template defined - The fallback simple formatting works for most Llama-style models - Check the logged sample conversation to verify formatting is correct