Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / lerobot /pr_3313 /en /adding_benchmarks.md

HuggingFaceDocBuilder

about 1 month ago

preview code

download

raw

14.9 kB

	# Adding a New Benchmark

	This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.

	A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks.

	## Existing benchmarks at a glance

	Before diving in, here is what is already integrated:

	\| Benchmark \| Env file \| Config class \| Tasks \| Action dim \| Processor \|
	\| -------------- \| ------------------- \| ------------------ \| ------------------- \| ------------ \| ---------------------------- \|
	\| LIBERO \| `envs/libero.py` \| `LiberoEnv` \| 130 across 5 suites \| 7 \| `LiberoProcessorStep` \|
	\| Meta-World \| `envs/metaworld.py` \| `MetaworldEnv` \| 50 (MT50) \| 4 \| None \|
	\| IsaacLab Arena \| Hub-hosted \| `IsaaclabArenaEnv` \| Configurable \| Configurable \| `IsaaclabArenaProcessorStep` \|

	Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations.

	## How it all fits together

	### Data flow

	During evaluation, data moves through four stages:

	```
	1. gym.Env ──→ raw observations (numpy dicts)

	2. Preprocessing ──→ standard LeRobot keys + task description
	(preprocess_observation in envs/utils.py, env.call("task_description"))

	3. Processors ──→ env-specific then policy-specific transforms
	(env_preprocessor, policy_preprocessor)

	4. Policy ──→ select_action() ──→ action tensor
	then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step()
	```

	Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).

	### Environment structure

	`make_env()` returns a nested dict of vectorized environments:

	```python
	dict[str, dict[int, gym.vector.VectorEnv]]
	# ^suite ^task_id
	```

	A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`.
	A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`.

	### How evaluation runs

	All benchmarks are evaluated the same way by `lerobot-eval`:

	1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict.
	2. `eval_policy_all()` iterates over every suite and task.
	3. For each task, it runs `n_episodes` rollouts via `rollout()`.
	4. Results are aggregated hierarchically: episode, task, suite, overall.
	5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`.

	The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed.

	## What your environment must provide

	LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.

	### Env attributes

	Your `gym.Env` must set these attributes:

	\| Attribute \| Type \| Why \|
	\| -------------------- \| ----- \| ---------------------------------------------------- \|
	\| `_max_episode_steps` \| `int` \| `rollout()` uses this to cap episode length \|
	\| `task_description` \| `str` \| Passed to VLA policies as a language instruction \|
	\| `task` \| `str` \| Fallback identifier if `task_description` is not set \|

	### Success reporting

	Your `step()` and `reset()` must include `"is_success"` in the `info` dict:

	```python
	info = {"is_success": True} # or False
	return observation, reward, terminated, truncated, info
	```

	### Observations

	The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper):

	\| Your env should output \| LeRobot maps it to \| What it is \|
	\| ------------------------- \| -------------------------- \| ------------------------------------- \|
	\| `"pixels"` (single array) \| `observation.image` \| Single camera image, HWC uint8 \|
	\| `"pixels"` (dict) \| `observation.images.` \| Multiple cameras, each HWC uint8 \|
	\| `"agent_pos"` \| `observation.state` \| Proprioceptive state vector \|
	\| `"environment_state"` \| `observation.env_state` \| Full environment state (e.g. PushT) \|
	\| `"robot_state"` \| `observation.robot_state` \| Nested robot state dict (e.g. LIBERO) \|

	If your simulator uses different key names, you have two options:

	1. Recommended: Rename them to the standard keys inside your `gym.Env` wrapper.
	2. Alternative: Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below).

	### Actions

	Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config.

	### Feature declaration

	Each `EnvConfig` subclass declares two dicts that tell the policy what to expect:

	- `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape).
	- `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`).

	## Step by step

	At minimum, you need two files: a gym.Env wrapper and an **EnvConfig
	subclass** with a `create_envs()` override. Everything else is optional or
	documentation. No changes to `factory.py` are needed.

	### Checklist

	\| File \| Required \| Why \|
	\| ---------------------------------------- \| -------- \| ------------------------------------------------------------ \|
	\| `src/lerobot/envs/.py` \| Yes \| Wraps the simulator as a standard gym.Env \|
	\| `src/lerobot/envs/configs.py` \| Yes \| Registers your benchmark and its `create_envs()` for the CLI \|
	\| `src/lerobot/processor/env_processor.py` \| Optional \| Custom observation/action transforms \|
	\| `src/lerobot/envs/utils.py` \| Optional \| Only if you need new raw observation keys \|
	\| `pyproject.toml` \| Yes \| Declares benchmark-specific dependencies \|
	\| `docs/source/.mdx` \| Yes \| User-facing documentation page \|
	\| `docs/source/_toctree.yml` \| Yes \| Adds your page to the docs sidebar \|

	### 1. The gym.Env wrapper (`src/lerobot/envs/.py`)

	Create a `gym.Env` subclass that wraps the third-party simulator:

	```python
	class MyBenchmarkEnv(gym.Env):
	metadata = {"render_modes": ["rgb_array"], "render_fps": }

	def __init__(self, task_suite, task_id, ...):
	super().__init__()
	self.task =
	self.task_description =
	self._max_episode_steps =
	self.observation_space = spaces.Dict({...})
	self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)

	def reset(self, seed=None, **kwargs):
	... # return (observation, info) — info must contain {"is_success": False}

	def step(self, action: np.ndarray):
	... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": }

	def render(self):
	... # return RGB image as numpy array

	def close(self):
	...
	```

	GPU-based simulators (e.g. MuJoCo with EGL rendering): If your simulator allocates GPU/EGL contexts during `__init__`, defer that allocation to a `_ensure_env()` helper called on first `reset()`/`step()`. This avoids inheriting stale GPU handles when `AsyncVectorEnv` spawns worker processes. See `LiberoEnv._ensure_env()` for the pattern.

	Also provide a factory function that returns the nested dict structure:

	```python
	def create_mybenchmark_envs(
	task: str,
	n_envs: int,
	gym_kwargs: dict \| None = None,
	env_cls: type \| None = None,
	) -> dict[str, dict[int, Any]]:
	"""Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
	...
	```

	See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference.

	### 2. The config (`src/lerobot/envs/configs.py`)

	Register a config dataclass so users can select your benchmark with `--env.type=`. Each config owns its environment creation and processor logic via two methods:

	- `create_envs(n_envs, use_async_envs)` — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this.
	- `get_env_processors()` — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.

	```python
	@EnvConfig.register_subclass("")
	@dataclass
	class MyBenchmarkEnvConfig(EnvConfig):
	task: str = ""
	fps: int =
	obs_type: str = "pixels_agent_pos"

	features: dict[str, PolicyFeature] = field(default_factory=lambda: {
	ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)),
	})
	features_map: dict[str, str] = field(default_factory=lambda: {
	ACTION: ACTION,
	"agent_pos": OBS_STATE,
	"pixels": OBS_IMAGE,
	})

	def __post_init__(self):
	... # populate features based on obs_type

	@property
	def gym_kwargs(self) -> dict:
	return {"obs_type": self.obs_type, "render_mode": self.render_mode}

	def create_envs(self, n_envs: int, use_async_envs: bool = True):
	"""Override for multi-task benchmarks or custom env creation."""
	from lerobot.envs. import create__envs
	return create__envs(task=self.task, n_envs=n_envs, ...)

	def get_env_processors(self):
	"""Override if your benchmark needs observation/action transforms."""
	from lerobot.processor import PolicyProcessorPipeline
	from lerobot.processor.env_processor import MyBenchmarkProcessorStep
	return (
	PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
	PolicyProcessorPipeline(steps=[]),
	)
	```

	Key points:

	- The `register_subclass` name is what users pass on the CLI (`--env.type=`).
	- `features` tells the policy what the environment produces.
	- `features_map` maps raw observation keys to LeRobot convention keys.
	- No changes to `factory.py` needed — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically.

	### 3. Env processor (optional — `src/lerobot/processor/env_processor.py`)

	Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from `get_env_processors()` in your config (see step 2):

	```python
	@dataclass
	@ProcessorStepRegistry.register(name="_processor")
	class MyBenchmarkProcessorStep(ObservationProcessorStep):
	def _process_observation(self, observation):
	processed = observation.copy()
	# your transforms here
	return processed

	def transform_features(self, features):
	return features # update if shapes change

	def observation(self, observation):
	return self._process_observation(observation)
	```

	See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion).

	### 4. Dependencies (`pyproject.toml`)

	Add a new optional-dependency group:

	```toml
	mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
	```

	Pinning rules:

	- Always pin benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`).
	- Add platform markers when needed (e.g. `; sys_platform == 'linux'`).
	- Pin fragile transitive deps if known (e.g. `gymnasium==1.1.0` for Meta-World).
	- Document constraints in your benchmark doc page.

	Users install with:

	```bash
	pip install -e ".[mybenchmark]"
	```

	### 5. Documentation (`docs/source/.mdx`)

	Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples.

	### 6. Table of contents (`docs/source/_toctree.yml`)

	Add your benchmark to the "Benchmarks" section:

	```yaml
	- sections:
	- local: libero
	title: LIBERO
	- local: metaworld
	title: Meta-World
	- local: envhub_isaaclab_arena
	title: NVIDIA IsaacLab Arena Environments
	- local:
	title:
	title: "Benchmarks"
	```

	## Verifying your integration

	After completing the steps above, confirm that everything works:

	1. Install — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly.
	2. Smoke test env creation — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys.
	3. Run a full eval — `lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --policy.path=` to exercise the full pipeline end-to-end. (`batch_size` defaults to auto-tuning based on CPU cores; pass `--eval.batch_size=1` to force a single environment.)
	4. Check success detection — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates.

	## Writing a benchmark doc page

	Each benchmark `.mdx` page should include:

	- Title and description — 1-2 paragraphs on what the benchmark tests and why it matters.
	- Links — paper, GitHub repo, project website (if available).
	- Overview image or GIF.
	- Available tasks — table of task suites with counts and brief descriptions.
	- Installation — `pip install -e ".[]"` plus any extra steps (env vars, system packages).
	- Evaluation — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable.
	- Policy inputs and outputs — observation keys with shapes, action space description.
	- Recommended evaluation episodes — how many episodes per task is standard.
	- Training — example `lerobot-train` command.
	- Reproducing published results — link to pretrained model, eval command, results table (if available).

	See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples.

Xet Storage Details

Size:: 14.9 kB
Xet hash:: d1b12cf83fcc4c4288f255a27aa19673675c2f036537568424e2458603ff29ca

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.