NewEden
/

envs

Model card Files Files and versions

envs / complex_json_output /README.md

Delta-Vector's picture

Upload folder using huggingface_hub

a35c6f4 verified 14 days ago

|

history blame contribute delete

2.1 kB

	# complex-json-output

	### Overview
	- Environment ID: `complex-json-output`
	- Short description: Verifies model ability to generate complex JSON structures matching exact specifications
	- Tags: json, instruction-following, verifiable-reward, train, eval

	### Datasets
	- Primary dataset(s): Delta-Vector/Tauri-Complex-JSON-Formatting
	- Source links: https://huggingface.co/datasets/Delta-Vector/Tauri-Complex-JSON-Formatting
	- Split sizes: 7000 train, 1000 eval (default)

	### Task
	- Type: single-turn
	- Parser: Custom parser that extracts JSON from code blocks or raw text
	- Rubric overview (multiplicative to prevent local minima):
	- Main reward: `key_accuracy * value_accuracy`
	* `key_accuracy = (correct_keys) / (total_keys_in_response)`
	* `value_accuracy = (correct_values) / (total_values_in_response)`
	- Penalizes both missing items AND adding extra incorrect ones
	- If JSON fails to parse: reward = 0
	- Individual metrics tracked for debugging but don't contribute to training
	- No system prompt - dataset prompts contain all instructions

	### Quickstart
	Run an evaluation with default settings:

	```bash
	uv run vf-eval complex-json-output
	```

	Configure model and sampling:

	```bash
	uv run vf-eval complex-json-output -m gpt-4.1-mini -n 20 -r 3 -t 1024 -T 0.7
	```

	Notes:
	- Use `-a` / `--env-args` to pass environment-specific configuration as a JSON object.

	### Environment Arguments

	\| Arg \| Type \| Default \| Description \|
	\| --- \| ---- \| ------- \| ----------- \|
	\| `num_train_examples` \| int \| `7000` \| Number of training examples \|
	\| `num_eval_examples` \| int \| `1000` \| Number of evaluation examples \|

	### Metrics

	\| Metric \| Meaning \|
	\| ------ \| ------- \|
	\| `reward` \| Multiplicative: key_accuracy * value_accuracy (0.0 to 1.0) \|
	\| `multiplicative_reward` \| Main training reward (0.0 to 1.0) \|
	\| `format_reward` \| Metric only: whether JSON is valid dict (0.33 or 0) \|
	\| `keys_match_reward` \| Metric only: whether all keys match (0.33 or 0) \|
	\| `values_match_reward` \| Metric only: whether all values match (0.33 or 0) \|