Buckets:
Adding a New Benchmark
This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.
A benchmark in LeRobot is a set of Gymnasium environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard gym.Env interface. The lerobot-eval CLI then runs evaluation uniformly across all benchmarks.
Existing benchmarks at a glance
Before diving in, here is what is already integrated:
| Benchmark | Env file | Config class | Tasks | Action dim | Processor |
|---|---|---|---|---|---|
| LIBERO | envs/libero.py |
LiberoEnv |
130 across 5 suites | 7 | LiberoProcessorStep |
| Meta-World | envs/metaworld.py |
MetaworldEnv |
50 (MT50) | 4 | None |
| IsaacLab Arena | Hub-hosted | IsaaclabArenaEnv |
Configurable | Configurable | IsaaclabArenaProcessorStep |
Use src/lerobot/envs/libero.py and src/lerobot/envs/metaworld.py as reference implementations.
How it all fits together
Data flow
During evaluation, data moves through four stages:
1. gym.Env ──→ raw observations (numpy dicts)
2. Preprocessing ──→ standard LeRobot keys + task description
(preprocess_observation in envs/utils.py, env.call("task_description"))
3. Processors ──→ env-specific then policy-specific transforms
(env_preprocessor, policy_preprocessor)
4. Policy ──→ select_action() ──→ action tensor
then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step()
Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).
Environment structure
make_env() returns a nested dict of vectorized environments:
dict[str, dict[int, gym.vector.VectorEnv]]
# ^suite ^task_id
A single-task env (e.g. PushT) looks like {"pusht": {0: vec_env}}.
A multi-task benchmark (e.g. LIBERO) looks like {"libero_spatial": {0: vec0, 1: vec1, ...}, ...}.
How evaluation runs
All benchmarks are evaluated the same way by lerobot-eval:
make_env()builds the nested{suite: {task_id: VectorEnv}}dict.eval_policy_all()iterates over every suite and task.- For each task, it runs
n_episodesrollouts viarollout(). - Results are aggregated hierarchically: episode, task, suite, overall.
- Metrics include
pc_success(success rate),avg_sum_reward, andavg_max_reward.
The critical piece: your env must return info["is_success"] on every step() call. This is how the eval loop knows whether a task was completed.
What your environment must provide
LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.
Env attributes
Your gym.Env must set these attributes:
| Attribute | Type | Why |
|---|---|---|
_max_episode_steps |
int |
rollout() uses this to cap episode length |
task_description |
str |
Passed to VLA policies as a language instruction |
task |
str |
Fallback identifier if task_description is not set |
Success reporting
Your step() and reset() must include "is_success" in the info dict:
info = {"is_success": True} # or False
return observation, reward, terminated, truncated, info
Observations
The simplest approach is to map your simulator's outputs to the standard keys that preprocess_observation() already understands. Do this inside your gym.Env (e.g. in a _format_raw_obs() helper):
| Your env should output | LeRobot maps it to | What it is |
|---|---|---|
"pixels" (single array) |
observation.image |
Single camera image, HWC uint8 |
"pixels" (dict) |
observation.images. |
Multiple cameras, each HWC uint8 |
"agent_pos" |
observation.state |
Proprioceptive state vector |
"environment_state" |
observation.env_state |
Full environment state (e.g. PushT) |
"robot_state" |
observation.robot_state |
Nested robot state dict (e.g. LIBERO) |
If your simulator uses different key names, you have two options:
- Recommended: Rename them to the standard keys inside your
gym.Envwrapper. - Alternative: Write an env processor to transform observations after
preprocess_observation()runs (see step 4 below).
Actions
Actions are continuous numpy arrays in a gym.spaces.Box. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their input_features / output_features config.
Feature declaration
Each EnvConfig subclass declares two dicts that tell the policy what to expect:
features— maps feature names toPolicyFeature(type, shape)(e.g. action dim, image shape).features_map— maps raw observation keys to LeRobot convention keys (e.g."agent_pos"to"observation.state").
Step by step
At minimum, you need two files: a gym.Env wrapper and an EnvConfig
subclass with a create_envs() override. Everything else is optional or
documentation. No changes to factory.py are needed.
Checklist
| File | Required | Why |
|---|---|---|
src/lerobot/envs/.py |
Yes | Wraps the simulator as a standard gym.Env |
src/lerobot/envs/configs.py |
Yes | Registers your benchmark and its create_envs() for the CLI |
src/lerobot/processor/env_processor.py |
Optional | Custom observation/action transforms |
src/lerobot/envs/utils.py |
Optional | Only if you need new raw observation keys |
pyproject.toml |
Yes | Declares benchmark-specific dependencies |
docs/source/.mdx |
Yes | User-facing documentation page |
docs/source/_toctree.yml |
Yes | Adds your page to the docs sidebar |
1. The gym.Env wrapper (src/lerobot/envs/.py)
Create a gym.Env subclass that wraps the third-party simulator:
class MyBenchmarkEnv(gym.Env):
metadata = {"render_modes": ["rgb_array"], "render_fps": }
def __init__(self, task_suite, task_id, ...):
super().__init__()
self.task =
self.task_description =
self._max_episode_steps =
self.observation_space = spaces.Dict({...})
self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)
def reset(self, seed=None, **kwargs):
... # return (observation, info) — info must contain {"is_success": False}
def step(self, action: np.ndarray):
... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": }
def render(self):
... # return RGB image as numpy array
def close(self):
...
GPU-based simulators (e.g. MuJoCo with EGL rendering): If your simulator allocates GPU/EGL contexts during __init__, defer that allocation to a _ensure_env() helper called on first reset()/step(). This avoids inheriting stale GPU handles when AsyncVectorEnv spawns worker processes. See LiberoEnv._ensure_env() for the pattern.
Also provide a factory function that returns the nested dict structure:
def create_mybenchmark_envs(
task: str,
n_envs: int,
gym_kwargs: dict | None = None,
env_cls: type | None = None,
) -> dict[str, dict[int, Any]]:
"""Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
...
See create_libero_envs() (multi-suite, multi-task) and create_metaworld_envs() (difficulty-grouped tasks) for reference.
2. The config (src/lerobot/envs/configs.py)
Register a config dataclass so users can select your benchmark with --env.type=. Each config owns its environment creation and processor logic via two methods:
create_envs(n_envs, use_async_envs)— Returns{suite: {task_id: VectorEnv}}. The base class default usesgym.make()for single-task envs. Multi-task benchmarks override this.get_env_processors()— Returns(preprocessor, postprocessor). The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.
@EnvConfig.register_subclass("")
@dataclass
class MyBenchmarkEnvConfig(EnvConfig):
task: str = ""
fps: int =
obs_type: str = "pixels_agent_pos"
features: dict[str, PolicyFeature] = field(default_factory=lambda: {
ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)),
})
features_map: dict[str, str] = field(default_factory=lambda: {
ACTION: ACTION,
"agent_pos": OBS_STATE,
"pixels": OBS_IMAGE,
})
def __post_init__(self):
... # populate features based on obs_type
@property
def gym_kwargs(self) -> dict:
return {"obs_type": self.obs_type, "render_mode": self.render_mode}
def create_envs(self, n_envs: int, use_async_envs: bool = True):
"""Override for multi-task benchmarks or custom env creation."""
from lerobot.envs. import create__envs
return create__envs(task=self.task, n_envs=n_envs, ...)
def get_env_processors(self):
"""Override if your benchmark needs observation/action transforms."""
from lerobot.processor import PolicyProcessorPipeline
from lerobot.processor.env_processor import MyBenchmarkProcessorStep
return (
PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
PolicyProcessorPipeline(steps=[]),
)
Key points:
- The
register_subclassname is what users pass on the CLI (--env.type=). featurestells the policy what the environment produces.features_mapmaps raw observation keys to LeRobot convention keys.- No changes to
factory.pyneeded — the factory delegates tocfg.create_envs()andcfg.get_env_processors()automatically.
3. Env processor (optional — src/lerobot/processor/env_processor.py)
Only needed if your benchmark requires observation transforms beyond what preprocess_observation() handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from get_env_processors() in your config (see step 2):
@dataclass
@ProcessorStepRegistry.register(name="_processor")
class MyBenchmarkProcessorStep(ObservationProcessorStep):
def _process_observation(self, observation):
processed = observation.copy()
# your transforms here
return processed
def transform_features(self, features):
return features # update if shapes change
def observation(self, observation):
return self._process_observation(observation)
See LiberoProcessorStep for a full example (image rotation, quaternion-to-axis-angle conversion).
4. Dependencies (pyproject.toml)
Add a new optional-dependency group:
mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]
Pinning rules:
- Always pin benchmark packages to exact versions for reproducibility (e.g.
metaworld==3.0.0). - Add platform markers when needed (e.g.
; sys_platform == 'linux'). - Pin fragile transitive deps if known (e.g.
gymnasium==1.1.0for Meta-World). - Document constraints in your benchmark doc page.
Users install with:
pip install -e ".[mybenchmark]"
5. Documentation (docs/source/.mdx)
Write a user-facing page following the template in the next section. See docs/source/libero.mdx and docs/source/metaworld.mdx for full examples.
6. Table of contents (docs/source/_toctree.yml)
Add your benchmark to the "Benchmarks" section:
- sections:
- local: libero
title: LIBERO
- local: metaworld
title: Meta-World
- local: envhub_isaaclab_arena
title: NVIDIA IsaacLab Arena Environments
- local:
title:
title: "Benchmarks"
Verifying your integration
After completing the steps above, confirm that everything works:
- Install —
pip install -e ".[mybenchmark]"and verify the dependency group installs cleanly. - Smoke test env creation — call
make_env()with your config in Python, check that the returned dict has the expected{suite: {task_id: VectorEnv}}shape, and thatreset()returns observations with the right keys. - Run a full eval —
lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --policy.path=to exercise the full pipeline end-to-end. (batch_sizedefaults to auto-tuning based on CPU cores; pass--eval.batch_size=1to force a single environment.) - Check success detection — verify that
info["is_success"]flips toTruewhen the task is actually completed. This is what the eval loop uses to compute success rates.
Writing a benchmark doc page
Each benchmark .mdx page should include:
- Title and description — 1-2 paragraphs on what the benchmark tests and why it matters.
- Links — paper, GitHub repo, project website (if available).
- Overview image or GIF.
- Available tasks — table of task suites with counts and brief descriptions.
- Installation —
pip install -e ".[]"plus any extra steps (env vars, system packages). - Evaluation — recommended
lerobot-evalcommand withn_episodesfor reproducible results.batch_sizedefaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable. - Policy inputs and outputs — observation keys with shapes, action space description.
- Recommended evaluation episodes — how many episodes per task is standard.
- Training — example
lerobot-traincommand. - Reproducing published results — link to pretrained model, eval command, results table (if available).
See docs/source/libero.mdx and docs/source/metaworld.mdx for complete examples.
Xet Storage Details
- Size:
- 14.9 kB
- Xet hash:
- d1b12cf83fcc4c4288f255a27aa19673675c2f036537568424e2458603ff29ca
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.