Refusals Environment - Modified
This is a modified version of the refusals environment that includes:
- System Prompt Distribution: Loads system prompts from
Delta-Vector/Tauri-RL-Styleson Hugging Face and distributes them across rollouts - Word Count Requirements: Enforces specific word count targets with buffer zones for different response styles
Features
System Prompt Distribution
- Loads system prompts from Hugging Face dataset
Delta-Vector/Tauri-RL-Styles - Distributes prompts evenly across rollouts (e.g., 256 rollouts with 32 prompts = 8 rollouts per prompt)
- Scales flexibly with different numbers of rollouts and prompts
- Includes fallback to default prompt if Hugging Face loading fails
Word Count Requirements
Three response styles with specific word count targets and buffer zones:
- "Be verbose": 2000 words (±100 word buffer, range: 1900-2100)
- "Respond tersely": 200 words (±50 word buffer, range: 150-250)
- "Medium-length response": 300 words (±100 word buffer, range: 200-400)
Requirements are distributed evenly across rollouts. Responses that fall outside the buffer zone receive a 0 reward.
Usage
# Install the environment
vf-install refusals-env-modified
# Run evaluation with a small number of rollouts for testing
vf-eval refusals-env-modified -n 5 -m gpt-4.1-mini
# Run with custom number of rollouts (system prompts will scale accordingly)
vf-eval refusals-env-modified -n 256 -m your-model
Configuration Parameters
In addition to the base refusals environment parameters:
word_count_penalty: Penalty for failing word count requirements (default: 0.0, but zero reward is applied automatically)
Implementation Details
System Prompt Loading
The environment attempts to load system prompts from the Hugging Face dataset. If this fails, it falls back to a default prompt. The distribution logic ensures:
- Each system prompt is used approximately the same number of times
- Any remainder after equal distribution is handled randomly
- The final order is randomized to avoid systematic bias
Word Count Enforcement
- Word counting excludes code blocks from the analysis
- Requirements are checked against the actual response text
- Only responses within the buffer zone receive non-zero rewards
- Word count compliance is tracked in batch metrics for analysis
Scalability
The implementation is designed to work with:
- Any number of rollouts
- Any number of system prompts
- Different dataset sizes
The distribution logic automatically adapts to the input parameters.
Testing
The environment has been tested with various rollout counts to ensure the system prompt distribution scales correctly. Use vf-eval with a small number of rollouts first to verify the setup before running large-scale evaluations.