Refusals Environment - Modified

This is a modified version of the refusals environment that includes:

System Prompt Distribution: Loads system prompts from Delta-Vector/Tauri-RL-Styles on Hugging Face and distributes them across rollouts
Word Count Requirements: Enforces specific word count targets with buffer zones for different response styles

Features

System Prompt Distribution

Loads system prompts from Hugging Face dataset Delta-Vector/Tauri-RL-Styles
Distributes prompts evenly across rollouts (e.g., 256 rollouts with 32 prompts = 8 rollouts per prompt)
Scales flexibly with different numbers of rollouts and prompts
Includes fallback to default prompt if Hugging Face loading fails

Word Count Requirements

Three response styles with specific word count targets and buffer zones:

"Be verbose": 2000 words (±100 word buffer, range: 1900-2100)
"Respond tersely": 200 words (±50 word buffer, range: 150-250)
"Medium-length response": 300 words (±100 word buffer, range: 200-400)

Requirements are distributed evenly across rollouts. Responses that fall outside the buffer zone receive a 0 reward.

Usage

# Install the environment
vf-install refusals-env-modified

# Run evaluation with a small number of rollouts for testing
vf-eval refusals-env-modified -n 5 -m gpt-4.1-mini

# Run with custom number of rollouts (system prompts will scale accordingly)
vf-eval refusals-env-modified -n 256 -m your-model

Configuration Parameters

In addition to the base refusals environment parameters:

word_count_penalty: Penalty for failing word count requirements (default: 0.0, but zero reward is applied automatically)

Implementation Details

System Prompt Loading

The environment attempts to load system prompts from the Hugging Face dataset. If this fails, it falls back to a default prompt. The distribution logic ensures:

Each system prompt is used approximately the same number of times
Any remainder after equal distribution is handled randomly
The final order is randomized to avoid systematic bias

Word Count Enforcement

Word counting excludes code blocks from the analysis
Requirements are checked against the actual response text
Only responses within the buffer zone receive non-zero rewards
Word count compliance is tracked in batch metrics for analysis

Scalability

The implementation is designed to work with:

Any number of rollouts
Any number of system prompts
Different dataset sizes

The distribution logic automatically adapts to the input parameters.

Testing

The environment has been tested with various rollout counts to ensure the system prompt distribution scales correctly. Use vf-eval with a small number of rollouts first to verify the setup before running large-scale evaluations.