File size: 2,095 Bytes
a35c6f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# complex-json-output

### Overview
- **Environment ID**: `complex-json-output`
- **Short description**: Verifies model ability to generate complex JSON structures matching exact specifications
- **Tags**: json, instruction-following, verifiable-reward, train, eval

### Datasets
- **Primary dataset(s)**: Delta-Vector/Tauri-Complex-JSON-Formatting
- **Source links**: https://huggingface.co/datasets/Delta-Vector/Tauri-Complex-JSON-Formatting
- **Split sizes**: 7000 train, 1000 eval (default)

### Task
- **Type**: single-turn
- **Parser**: Custom parser that extracts JSON from code blocks or raw text
- **Rubric overview** (multiplicative to prevent local minima): 
  - **Main reward**: `key_accuracy * value_accuracy`
    * `key_accuracy = (correct_keys) / (total_keys_in_response)`
    * `value_accuracy = (correct_values) / (total_values_in_response)`
  - Penalizes both missing items AND adding extra incorrect ones
  - If JSON fails to parse: reward = 0
  - Individual metrics tracked for debugging but don't contribute to training
- **No system prompt** - dataset prompts contain all instructions

### Quickstart
Run an evaluation with default settings:

```bash
uv run vf-eval complex-json-output
```

Configure model and sampling:

```bash
uv run vf-eval complex-json-output -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7
```

Notes:
- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.

### Environment Arguments

| Arg | Type | Default | Description |
| --- | ---- | ------- | ----------- |
| `num_train_examples` | int | `7000` | Number of training examples |
| `num_eval_examples` | int | `1000` | Number of evaluation examples |

### Metrics

| Metric | Meaning |
| ------ | ------- |
| `reward` | Multiplicative: key_accuracy * value_accuracy (0.0 to 1.0) |
| `multiplicative_reward` | Main training reward (0.0 to 1.0) |
| `format_reward` | Metric only: whether JSON is valid dict (0.33 or 0) |
| `keys_match_reward` | Metric only: whether all keys match (0.33 or 0) |
| `values_match_reward` | Metric only: whether all values match (0.33 or 0) |