Buckets:

hf-doc-build/doc-dev / openenv /pr_749 /en /tutorials /evaluation-inspect.md
|
download
raw
10.1 kB
# Evaluating agents with Inspect AI
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/OpenEnv/blob/main/examples/evaluation_inspect.ipynb)
After training a model in an OpenEnv environment, you need to measure how it
actually performs on a held-out set of episodes. OpenEnv integrates with
[Inspect AI](https://inspect.aisi.org.uk/) — an open-source evaluation
framework by the UK AI Safety Institute — through `InspectAIHarness`.
## How the pieces fit together
Inspect AI and OpenEnv are complementary, not overlapping:
- **OpenEnv** provides the environment (reset, step, reward) and the training
infrastructure (GRPO via TRL).
- **Inspect AI** provides the evaluation infrastructure: datasets, solvers,
scorers, and structured logs.
`InspectAIHarness` is the bridge. It wraps `inspect_ai.eval()` inside
OpenEnv's `EvalHarness` interface so that eval runs are tracked with the same
structured `EvalConfig` / `EvalResult` types you use across all harnesses.
The typical workflow is:
```
Train with OpenEnv (GRPO / SFT)
Define an Inspect AI Task
- dataset: held-out episodes or prompts
- solver: calls your model + the OpenEnv env
- scorer: grades correctness using env reward or exact match
Run via InspectAIHarness → EvalResult with structured scores
```
## Install dependencies
```bash
pip install "inspect-ai>=0.3.0"
pip install "openenv @ git+https://github.com/huggingface/OpenEnv.git"
```
`inspect-ai` is an optional dependency — `InspectAIHarness` is importable
without it, but raises a clear `ImportError` at call time if it is missing.
## Set your model provider
Uncomment exactly one option. All three feed into the same task and harness —
no other cells need to change.
```python
import getpass, os
# --- Option A: OpenAI ---
os.environ.setdefault("OPENAI_API_KEY", getpass.getpass("OpenAI API key: "))
MODEL = "openai/gpt-5-mini"
# --- Option B: Anthropic ---
# os.environ.setdefault("ANTHROPIC_API_KEY", getpass.getpass("Anthropic API key: "))
# MODEL = "anthropic/claude-haiku-4-5-20251001"
# --- Option C: local transformers model (no API key needed) ---
# Requires a GPU for reasonable speed. Omit 'temperature' from eval_parameters below.
# !pip install -U transformers
# MODEL = "hf/Qwen/Qwen3.5-0.8B"
# Use a local checkpoint path to skip the download:
# MODEL = "hf/./outputs/my-trained-model"
```
The `model` string uses `provider/model-name` format for API providers.
For local models, the `hf/` prefix loads the model with `transformers` — point
it at a Hub ID to download, or a local path (`hf/./path/to/checkpoint`) to use
weights you already have on disk (e.g. from TRL training).
## Define an Inspect AI task for an OpenEnv environment
An Inspect AI `Task` has three parts: a **dataset** of samples to evaluate,
a **solver** that runs the model (and optionally the environment), and a
**scorer** that grades each sample.
The example below evaluates a model against `echo_env` — the reference
OpenEnv environment. The model is asked to repeat a phrase; the solver sends
the phrase to the environment and records the echoed response; the scorer
checks it matches the expected output.
The solver calls Inspect AI's `generate()` to get the model's output, then
sends it to the environment. The dataset, scorer, and harness are identical
for both providers.
```python
import asyncio
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import CORRECT, INCORRECT, Score, Target, accuracy, scorer
from inspect_ai.solver import Generate, TaskState, solver
from openenv.core import MCPToolClient
ECHO_ENV_URL = "https://openenv-echo-env.hf.space"
# Limit concurrent env connections to match the server's MAX_CONCURRENT_ENVS.
_env_sem = asyncio.Semaphore(1) # increase if your Space supports more sessions
@task
def openenv_echo_eval(base_url: str = ECHO_ENV_URL):
return Task(
dataset=[
Sample(input="Repeat exactly: hello world", target="hello world"),
Sample(input="Repeat exactly: inspect ai", target="inspect ai"),
Sample(input="Repeat exactly: openenv eval", target="openenv eval"),
Sample(input="Repeat exactly: reinforcement learning", target="reinforcement learning"),
Sample(input="Repeat exactly: hugging face", target="hugging face"),
],
solver=echo_env_solver(base_url=base_url),
scorer=echo_scorer(),
)
@solver
def echo_env_solver(base_url: str):
"""Ask the model to repeat the phrase, then echo it through the env."""
async def solve(state: TaskState, generate: Generate) -> TaskState:
state = await generate(state)
model_output = state.output.completion.strip()
async with _env_sem: # one env connection at a time
env = MCPToolClient(base_url=base_url)
try:
await env.reset()
echoed = await env.call_tool("echo_message", message=model_output)
state.metadata["echoed"] = str(echoed) if echoed is not None else ""
finally:
await env.close()
return state
return solve
@scorer(metrics=[accuracy()])
def echo_scorer():
"""CORRECT if the env echoed back exactly what the target phrase was."""
async def score(state: TaskState, target: Target) -> Score:
echoed = state.metadata.get("echoed", "").strip()
expected = target.text.strip()
return Score(
value=CORRECT if echoed == expected else INCORRECT,
explanation=f"Env echoed {echoed!r}, expected {expected!r}",
)
return score
```
> [!NOTE]
> `echo_env` is a pure MCP environment. Interact with it via `MCPToolClient`
> and `call_tool("echo_message", ...)`. For non-MCP environments, use
> `GenericEnvClient` instead.
## Run the eval with `InspectAIHarness`
Pass the task to `InspectAIHarness` via `EvalConfig`. The `task` key in
`eval_parameters` takes a task object or a registered task name string.
```python
import inspect_ai
import openenv
from openenv.core.evals import EvalConfig, EvalResult, InspectAIHarness
harness = InspectAIHarness(log_dir="./eval-logs")
config = EvalConfig(
harness_name="InspectAIHarness",
harness_version=inspect_ai.__version__,
library_versions={"openenv": openenv.__version__},
dataset="openenv_echo_eval",
eval_parameters={
"model": MODEL,
"task": openenv_echo_eval(base_url=ECHO_ENV_URL),
# temperature is supported for API providers (Options A/B).
# Omit it for local transformers models (Option C).
"temperature": 0.0,
},
)
result: EvalResult = harness.run_from_config(config)
print(result.scores)
# {'accuracy': 1.0}
```
The `EvalResult` carries both the config and the scores, making it easy to
log, compare across runs, or serialize to JSON:
```python
import json
class _StrFallback(json.JSONEncoder):
def default(self, o):
return str(o)
print(json.dumps(result.model_dump(), indent=2, cls=_StrFallback))
```
## Using a task file instead of a task object
Inspect AI tasks can also be defined in standalone `.py` files and referenced
by path. This is useful for CI pipelines where the task definition lives in
the repo and the harness is called from a script:
```python
# tasks/echo_eval.py (contains the @task definition above)
result = harness.run_from_config(EvalConfig(
harness_name="InspectAIHarness",
harness_version=inspect_ai.__version__,
library_versions={"openenv": openenv.__version__},
dataset="tasks/echo_eval.py@openenv_echo_eval",
eval_parameters={
"model": "openai/gpt-5-mini",
"task": "tasks/echo_eval.py@openenv_echo_eval",
},
))
```
## Adapting to your own environment and task
Replace `echo_env_solver` with a solver that uses your env and model:
1. **Dataset** — collect held-out episodes from your env (or a static
benchmark); each `Sample` needs `input` and `target` fields.
2. **Solver** — call your trained model against the env via `generate()`.
If you used GRPO training with an `environment_factory`, reuse the same
factory here so the eval env matches training exactly.
3. **Scorer** — use the env's reward signal directly, or write an Inspect AI
`@scorer` that checks the final observation against a ground-truth target.
> [!TIP]
> Run this eval **before training** on your base model to establish a baseline,
> then again after training to measure the improvement. The delta (post − pre)
> is more informative than either number alone — a model that scores 60% after
> training tells you little without knowing it started at 4%.
```python
import asyncio
from inspect_ai.solver import Generate, TaskState, solver
from openenv.core import MCPToolClient
_env_sem = asyncio.Semaphore(1) # raise if your Space supports more sessions
@solver
def my_env_solver(base_url: str):
async def solve(state: TaskState, generate: Generate) -> TaskState:
state = await generate(state)
model_output = state.output.completion.strip()
async with _env_sem:
env = MCPToolClient(base_url=base_url)
try:
await env.reset()
result = await env.call_tool("your_tool_name", message=model_output)
state.metadata["env_result"] = result
finally:
await env.close()
return state
return solve
```
## Next steps
- [End-to-end walkthrough](end-to-end-walkthrough) — full GRPO training loop that produces a model you can evaluate with this tutorial
- [SFT warm-up tutorial](sft-warmup) — collect rollouts, filter by reward, and fine-tune a student model before running GRPO
- [Rubrics tutorial](rubrics) — define reward functions inside
the environment using composable rubrics
- [Inspect AI documentation](https://inspect.aisi.org.uk/) — full reference
for tasks, solvers, scorers, and the log viewer

Xet Storage Details

Size:
10.1 kB
·
Xet hash:
56a547fb5965b31157471c32948f994772341aec1ae513632139c59503265b38

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.