| # complex-json-output | |
| ### Overview | |
| - **Environment ID**: `complex-json-output` | |
| - **Short description**: Verifies model ability to generate complex JSON structures matching exact specifications | |
| - **Tags**: json, instruction-following, verifiable-reward, train, eval | |
| ### Datasets | |
| - **Primary dataset(s)**: Delta-Vector/Tauri-Complex-JSON-Formatting | |
| - **Source links**: https://huggingface.co/datasets/Delta-Vector/Tauri-Complex-JSON-Formatting | |
| - **Split sizes**: 7000 train, 1000 eval (default) | |
| ### Task | |
| - **Type**: single-turn | |
| - **Parser**: Custom parser that extracts JSON from code blocks or raw text | |
| - **Rubric overview** (multiplicative to prevent local minima): | |
| - **Main reward**: `key_accuracy * value_accuracy` | |
| * `key_accuracy = (correct_keys) / (total_keys_in_response)` | |
| * `value_accuracy = (correct_values) / (total_values_in_response)` | |
| - Penalizes both missing items AND adding extra incorrect ones | |
| - If JSON fails to parse: reward = 0 | |
| - Individual metrics tracked for debugging but don't contribute to training | |
| - **No system prompt** - dataset prompts contain all instructions | |
| ### Quickstart | |
| Run an evaluation with default settings: | |
| ```bash | |
| uv run vf-eval complex-json-output | |
| ``` | |
| Configure model and sampling: | |
| ```bash | |
| uv run vf-eval complex-json-output -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7 | |
| ``` | |
| Notes: | |
| - Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object. | |
| ### Environment Arguments | |
| | Arg | Type | Default | Description | | |
| | --- | ---- | ------- | ----------- | | |
| | `num_train_examples` | int | `7000` | Number of training examples | | |
| | `num_eval_examples` | int | `1000` | Number of evaluation examples | | |
| ### Metrics | |
| | Metric | Meaning | | |
| | ------ | ------- | | |
| | `reward` | Multiplicative: key_accuracy * value_accuracy (0.0 to 1.0) | | |
| | `multiplicative_reward` | Main training reward (0.0 to 1.0) | | |
| | `format_reward` | Metric only: whether JSON is valid dict (0.33 or 0) | | |
| | `keys_match_reward` | Metric only: whether all keys match (0.33 or 0) | | |
| | `values_match_reward` | Metric only: whether all values match (0.33 or 0) | | |