envs / README.md

Upload folder using huggingface_hub

a35c6f4 verified 19 days ago

5.51 kB

	# Environments

	This folder contains installable example environments that showcase common usage patterns in Verifiers. Each module exposes a `load_environment(...)` function that returns a ready-to-use `vf.Environment` object.

	## Quick start

	- Install an environment from this GitHub repo: `vf-install math-python --from-repo`
	- Evaluate: `vf-eval math-python` (defaults to gpt-4.1-mini, small sample)

	## Common usage patterns and examples

	### SingleTurnEnv (prompt → single response)
	- gsm8k: Classic QA with exact-match reward; toggles `ThinkParser` vs `Parser` and format reward.
	- math: Hendrycks MATH dataset with `MathRubric` reward (using HuggingFace's `math-verify` scorer).
	- reverse_text: XML formatting with non-binary LCS reward + format reward.
	- gpqa: Multiple-choice; demonstrates optional judge-based secondary scoring via `RubricGroup`.
	- simpleqa: Judge-graded A/B/C classification using `JudgeRubric` rewards.
	- summarize_text: Multiple rewards (length/format + similarity) combined in one `Rubric`.
	- continuation_quality: Completion-style generation (`message_type="completion"`) judged for prose quality with `JudgeRubric`.
	- mmmu: Multimodal inputs (image + text) packed in chat content; single-turn boxed-answer check.

	### SingleTurnEnv subclass (custom dataset/scoring wrappers)
	- reasoning_gym_env: Wraps `reasoning_gym` procedural datasets, converts to HF datasets, uses `XMLParser` and task-specific scoring.

	### MultiTurnEnv (custom interaction protocols)
	- doublecheck: Simple follow-up turn ("Are you sure?") with math rewards; minimal `is_completed`/`env_response` implementation.
	- sentence_repeater: Multi-turn Q/A over a paragraph; rewards compare assistant messages to expected answers.
	- wordle: Game-style interaction via `TextArenaEnv`; multiple rewards (correctness, partial credit, few-turn bonus) and XML formatting.

	### Tool use
	- ToolEnv (native function-calling)
	- tool_test: Validates parallel tool calls and checks exact tool usage via `ToolRubric` + custom reward.
	- wiki_search: Multi-tool retrieval (search/view/read) with `ToolEnv`; final judgment combined via `RubricGroup` with a `JudgeRubric`.

	- XML tool calling (roll-your-own on MultiTurnEnv)
	- xml_tool_env: Parses `<tool>{...}</tool>` commands with `XMLParser`, executes Python functions, and returns `<result>...</result>` via `env_response`.
	- xlam_function_calling: Single-turn XML tool-call verification (no execution) that checks called tools match the ground truth list.
	- smolagents_math_tools: Integrates Smolagents `Tool` objects and a custom parser for tool/answer XML; demonstrates external tool frameworks.

	### Sandboxes
	- PythonEnv (ipython-style REPL)
	- math_python: Solve math problems using Python in a sandbox environment.

	### Composition
	- EnvGroup
	- math_group: Groups two `SingleTurnEnv` tasks (GSM8K + Math) into one environment with shared interface.

	- RubricGroup
	- math_python: `ToolRubric` (tool adherence) + `MathRubric` (answer correctness).
	- gpqa: Adds a `JudgeRubric` alongside base rubric for auxiliary scoring.
	- wiki_search: Merges judge scoring with the tool-use rubric.

	### Judge-based evaluation (LLM-as-judge)
	- simpleqa: Judge rubric maps graded letters to reward.
	- continuation_quality: Judge rubric extracts `<grade>` and maps A–F to a continuous score.
	- toxicity_explanation: Judge rubric returns 0–10 normalized score for both classification correctness and explanation quality.
	- self_reward: pattern for `SingleTurnEnv` with only a `JudgeRubric` over a dataset that supplies `question`/`answer`; intended for online RL where model acts as its own judge.

	### Parsers and formatting
	- ThinkParser: Used in `gsm8k`, `wiki_search` to separate reasoning from final answers.
	- XMLParser: Used in `reverse_text`, `wordle`, `summarize_text`, `reasoning_gym_env`, `xml_tool_env`, `xlam_function_calling` to enforce structured outputs and enable format rewards.
	- Custom parsers: `smolagents_math_tools` defines a bespoke parser to interoperate with external tool schemas.

	### Multimodal inputs
	- mmmu: Demonstrates passing images via chat `content` items with `{type: "image_url", image_url: {url: ...}}` and standard answer parsing.

	## What to look at for each pattern
	- Minimal SingleTurnEnv: `reverse_text`, `gsm8k`
	- JudgeRubric end-to-end: `simpleqa`, `continuation_quality`, `toxicity_explanation`, `self_reward`
	- ToolEnv with real tools: `wiki_search`, `math_python`
	- Custom MultiTurnEnv: `doublecheck`, `sentence_repeater`, `wordle`
	- XML tools without native function-calling: `xml_tool_env`, `xlam_function_calling`
	- Environment and rubric composition: `math_group`, `math_python`, `gpqa`, `wiki_search`
	- Procedural datasets: `reasoning_gym_env`
	- Multimodal: `mmmu`

	## Running examples
	All environments export `load_environment(...)`.

	In-line usage:
	```python
	import verifiers as vf
	from openai import AsyncOpenAI
	vf_env = vf.load_environment("reverse-text")
	results = vf_env.evaluate(client=AsyncOpenAI(), model="gpt-4.1-mini", num_examples=25)
	```

	CLI usage:
	```bash
	vf-install reverse-text --from-repo
	vf-eval reverse-text -n 50 -r 1
	```

	If you are building a new environment, prefer starting from `vf-init` and consult the top-level README and docs for dataset format, parser/rubric design, and rollout constraints.

	# Environments

	This folder contains installable example environments that showcase common usage patterns in Verifiers. Each module exposes a `load_environment(...)` function that returns a ready-to-use `vf.Environment` object.

	## Quick start

	- Install an environment from this GitHub repo: `vf-install math-python --from-repo`
	- Evaluate: `vf-eval math-python` (defaults to gpt-4.1-mini, small sample)

	## Common usage patterns and examples

	### SingleTurnEnv (prompt → single response)
	- gsm8k: Classic QA with exact-match reward; toggles `ThinkParser` vs `Parser` and format reward.
	- math: Hendrycks MATH dataset with `MathRubric` reward (using HuggingFace's `math-verify` scorer).
	- reverse_text: XML formatting with non-binary LCS reward + format reward.
	- gpqa: Multiple-choice; demonstrates optional judge-based secondary scoring via `RubricGroup`.
	- simpleqa: Judge-graded A/B/C classification using `JudgeRubric` rewards.
	- summarize_text: Multiple rewards (length/format + similarity) combined in one `Rubric`.
	- continuation_quality: Completion-style generation (`message_type="completion"`) judged for prose quality with `JudgeRubric`.
	- mmmu: Multimodal inputs (image + text) packed in chat content; single-turn boxed-answer check.

	### SingleTurnEnv subclass (custom dataset/scoring wrappers)
	- reasoning_gym_env: Wraps `reasoning_gym` procedural datasets, converts to HF datasets, uses `XMLParser` and task-specific scoring.

	### MultiTurnEnv (custom interaction protocols)
	- doublecheck: Simple follow-up turn ("Are you sure?") with math rewards; minimal `is_completed`/`env_response` implementation.
	- sentence_repeater: Multi-turn Q/A over a paragraph; rewards compare assistant messages to expected answers.
	- wordle: Game-style interaction via `TextArenaEnv`; multiple rewards (correctness, partial credit, few-turn bonus) and XML formatting.

	### Tool use
	- ToolEnv (native function-calling)
	- tool_test: Validates parallel tool calls and checks exact tool usage via `ToolRubric` + custom reward.
	- wiki_search: Multi-tool retrieval (search/view/read) with `ToolEnv`; final judgment combined via `RubricGroup` with a `JudgeRubric`.

	- XML tool calling (roll-your-own on MultiTurnEnv)
	- xml_tool_env: Parses `<tool>{...}</tool>` commands with `XMLParser`, executes Python functions, and returns `<result>...</result>` via `env_response`.
	- xlam_function_calling: Single-turn XML tool-call verification (no execution) that checks called tools match the ground truth list.
	- smolagents_math_tools: Integrates Smolagents `Tool` objects and a custom parser for tool/answer XML; demonstrates external tool frameworks.

	### Sandboxes
	- PythonEnv (ipython-style REPL)
	- math_python: Solve math problems using Python in a sandbox environment.

	### Composition
	- EnvGroup
	- math_group: Groups two `SingleTurnEnv` tasks (GSM8K + Math) into one environment with shared interface.

	- RubricGroup
	- math_python: `ToolRubric` (tool adherence) + `MathRubric` (answer correctness).
	- gpqa: Adds a `JudgeRubric` alongside base rubric for auxiliary scoring.
	- wiki_search: Merges judge scoring with the tool-use rubric.

	### Judge-based evaluation (LLM-as-judge)
	- simpleqa: Judge rubric maps graded letters to reward.
	- continuation_quality: Judge rubric extracts `<grade>` and maps A–F to a continuous score.
	- toxicity_explanation: Judge rubric returns 0–10 normalized score for both classification correctness and explanation quality.
	- self_reward: pattern for `SingleTurnEnv` with only a `JudgeRubric` over a dataset that supplies `question`/`answer`; intended for online RL where model acts as its own judge.

	### Parsers and formatting
	- ThinkParser: Used in `gsm8k`, `wiki_search` to separate reasoning from final answers.
	- XMLParser: Used in `reverse_text`, `wordle`, `summarize_text`, `reasoning_gym_env`, `xml_tool_env`, `xlam_function_calling` to enforce structured outputs and enable format rewards.
	- Custom parsers: `smolagents_math_tools` defines a bespoke parser to interoperate with external tool schemas.

	### Multimodal inputs
	- mmmu: Demonstrates passing images via chat `content` items with `{type: "image_url", image_url: {url: ...}}` and standard answer parsing.

	## What to look at for each pattern
	- Minimal SingleTurnEnv: `reverse_text`, `gsm8k`
	- JudgeRubric end-to-end: `simpleqa`, `continuation_quality`, `toxicity_explanation`, `self_reward`
	- ToolEnv with real tools: `wiki_search`, `math_python`
	- Custom MultiTurnEnv: `doublecheck`, `sentence_repeater`, `wordle`
	- XML tools without native function-calling: `xml_tool_env`, `xlam_function_calling`
	- Environment and rubric composition: `math_group`, `math_python`, `gpqa`, `wiki_search`
	- Procedural datasets: `reasoning_gym_env`
	- Multimodal: `mmmu`

	## Running examples
	All environments export `load_environment(...)`.

	In-line usage:
	```python
	import verifiers as vf
	from openai import AsyncOpenAI
	vf_env = vf.load_environment("reverse-text")
	results = vf_env.evaluate(client=AsyncOpenAI(), model="gpt-4.1-mini", num_examples=25)
	```

	CLI usage:
	```bash
	vf-install reverse-text --from-repo
	vf-eval reverse-text -n 50 -r 1
	```

	If you are building a new environment, prefer starting from `vf-init` and consult the top-level README and docs for dataset format, parser/rubric design, and rollout constraints.