File size: 2,839 Bytes
a35c6f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Refusals Environment - Modified

This is a modified version of the refusals environment that includes:

1. **System Prompt Distribution**: Loads system prompts from `Delta-Vector/Tauri-RL-Styles` on Hugging Face and distributes them across rollouts
2. **Word Count Requirements**: Enforces specific word count targets with buffer zones for different response styles

## Features

### System Prompt Distribution
- Loads system prompts from Hugging Face dataset `Delta-Vector/Tauri-RL-Styles`
- Distributes prompts evenly across rollouts (e.g., 256 rollouts with 32 prompts = 8 rollouts per prompt)
- Scales flexibly with different numbers of rollouts and prompts
- Includes fallback to default prompt if Hugging Face loading fails

### Word Count Requirements
Three response styles with specific word count targets and buffer zones:

- **"Be verbose"**: 2000 words (±100 word buffer, range: 1900-2100)
- **"Respond tersely"**: 200 words (±50 word buffer, range: 150-250)
- **"Medium-length response"**: 300 words (±100 word buffer, range: 200-400)

Requirements are distributed evenly across rollouts. Responses that fall outside the buffer zone receive a 0 reward.

## Usage

```bash
# Install the environment
vf-install refusals-env-modified

# Run evaluation with a small number of rollouts for testing
vf-eval refusals-env-modified -n 5 -m gpt-4.1-mini

# Run with custom number of rollouts (system prompts will scale accordingly)
vf-eval refusals-env-modified -n 256 -m your-model
```

## Configuration Parameters

In addition to the base refusals environment parameters:

- `word_count_penalty`: Penalty for failing word count requirements (default: 0.0, but zero reward is applied automatically)

## Implementation Details

### System Prompt Loading
The environment attempts to load system prompts from the Hugging Face dataset. If this fails, it falls back to a default prompt. The distribution logic ensures:

- Each system prompt is used approximately the same number of times
- Any remainder after equal distribution is handled randomly
- The final order is randomized to avoid systematic bias

### Word Count Enforcement
- Word counting excludes code blocks from the analysis
- Requirements are checked against the actual response text
- Only responses within the buffer zone receive non-zero rewards
- Word count compliance is tracked in batch metrics for analysis

### Scalability
The implementation is designed to work with:
- Any number of rollouts
- Any number of system prompts
- Different dataset sizes

The distribution logic automatically adapts to the input parameters.

## Testing

The environment has been tested with various rollout counts to ensure the system prompt distribution scales correctly. Use `vf-eval` with a small number of rollouts first to verify the setup before running large-scale evaluations.