Buckets:
| # Adding a New Benchmark | |
| This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates. | |
| A benchmark in LeRobot is a set of [Gymnasium](https://gymnasium.farama.org/) environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard `gym.Env` interface. The `lerobot-eval` CLI then runs evaluation uniformly across all benchmarks. | |
| ## Existing benchmarks at a glance | |
| Before diving in, here is what is already integrated: | |
| | Benchmark | Env file | Config class | Tasks | Action dim | Processor | | |
| | -------------- | ------------------- | ------------------ | ------------------- | ------------ | ---------------------------- | | |
| | LIBERO | `envs/libero.py` | `LiberoEnv` | 130 across 5 suites | 7 | `LiberoProcessorStep` | | |
| | Meta-World | `envs/metaworld.py` | `MetaworldEnv` | 50 (MT50) | 4 | None | | |
| | IsaacLab Arena | Hub-hosted | `IsaaclabArenaEnv` | Configurable | Configurable | `IsaaclabArenaProcessorStep` | | |
| Use `src/lerobot/envs/libero.py` and `src/lerobot/envs/metaworld.py` as reference implementations. | |
| ## How it all fits together | |
| ### Data flow | |
| During evaluation, data moves through four stages: | |
| ``` | |
| 1. gym.Env ──→ raw observations (numpy dicts) | |
| 2. Preprocessing ──→ standard LeRobot keys + task description | |
| (preprocess_observation in envs/utils.py, env.call("task_description")) | |
| 3. Processors ──→ env-specific then policy-specific transforms | |
| (env_preprocessor, policy_preprocessor) | |
| 4. Policy ──→ select_action() ──→ action tensor | |
| then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step() | |
| ``` | |
| Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed). | |
| ### Environment structure | |
| `make_env()` returns a nested dict of vectorized environments: | |
| ```python | |
| dict[str, dict[int, gym.vector.VectorEnv]] | |
| # ^suite ^task_id | |
| ``` | |
| A single-task env (e.g. PushT) looks like `{"pusht": {0: vec_env}}`. | |
| A multi-task benchmark (e.g. LIBERO) looks like `{"libero_spatial": {0: vec0, 1: vec1, ...}, ...}`. | |
| ### How evaluation runs | |
| All benchmarks are evaluated the same way by `lerobot-eval`: | |
| 1. `make_env()` builds the nested `{suite: {task_id: VectorEnv}}` dict. | |
| 2. `eval_policy_all()` iterates over every suite and task. | |
| 3. For each task, it runs `n_episodes` rollouts via `rollout()`. | |
| 4. Results are aggregated hierarchically: episode, task, suite, overall. | |
| 5. Metrics include `pc_success` (success rate), `avg_sum_reward`, and `avg_max_reward`. | |
| The critical piece: your env must return `info["is_success"]` on every `step()` call. This is how the eval loop knows whether a task was completed. | |
| ## What your environment must provide | |
| LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow. | |
| ### Env attributes | |
| Your `gym.Env` must set these attributes: | |
| | Attribute | Type | Why | | |
| | -------------------- | ----- | ---------------------------------------------------- | | |
| | `_max_episode_steps` | `int` | `rollout()` uses this to cap episode length | | |
| | `task_description` | `str` | Passed to VLA policies as a language instruction | | |
| | `task` | `str` | Fallback identifier if `task_description` is not set | | |
| ### Success reporting | |
| Your `step()` and `reset()` must include `"is_success"` in the `info` dict: | |
| ```python | |
| info = {"is_success": True} # or False | |
| return observation, reward, terminated, truncated, info | |
| ``` | |
| ### Observations | |
| The simplest approach is to map your simulator's outputs to the standard keys that `preprocess_observation()` already understands. Do this inside your `gym.Env` (e.g. in a `_format_raw_obs()` helper): | |
| | Your env should output | LeRobot maps it to | What it is | | |
| | ------------------------- | -------------------------- | ------------------------------------- | | |
| | `"pixels"` (single array) | `observation.image` | Single camera image, HWC uint8 | | |
| | `"pixels"` (dict) | `observation.images.` | Multiple cameras, each HWC uint8 | | |
| | `"agent_pos"` | `observation.state` | Proprioceptive state vector | | |
| | `"environment_state"` | `observation.env_state` | Full environment state (e.g. PushT) | | |
| | `"robot_state"` | `observation.robot_state` | Nested robot state dict (e.g. LIBERO) | | |
| If your simulator uses different key names, you have two options: | |
| 1. **Recommended:** Rename them to the standard keys inside your `gym.Env` wrapper. | |
| 2. **Alternative:** Write an env processor to transform observations after `preprocess_observation()` runs (see step 4 below). | |
| ### Actions | |
| Actions are continuous numpy arrays in a `gym.spaces.Box`. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their `input_features` / `output_features` config. | |
| ### Feature declaration | |
| Each `EnvConfig` subclass declares two dicts that tell the policy what to expect: | |
| - `features` — maps feature names to `PolicyFeature(type, shape)` (e.g. action dim, image shape). | |
| - `features_map` — maps raw observation keys to LeRobot convention keys (e.g. `"agent_pos"` to `"observation.state"`). | |
| ## Step by step | |
| At minimum, you need two files: a **gym.Env wrapper** and an **EnvConfig | |
| subclass** with a `create_envs()` override. Everything else is optional or | |
| documentation. No changes to `factory.py` are needed. | |
| ### Checklist | |
| | File | Required | Why | | |
| | ---------------------------------------- | -------- | ------------------------------------------------------------ | | |
| | `src/lerobot/envs/.py` | Yes | Wraps the simulator as a standard gym.Env | | |
| | `src/lerobot/envs/configs.py` | Yes | Registers your benchmark and its `create_envs()` for the CLI | | |
| | `src/lerobot/processor/env_processor.py` | Optional | Custom observation/action transforms | | |
| | `src/lerobot/envs/utils.py` | Optional | Only if you need new raw observation keys | | |
| | `pyproject.toml` | Yes | Declares benchmark-specific dependencies | | |
| | `docs/source/.mdx` | Yes | User-facing documentation page | | |
| | `docs/source/_toctree.yml` | Yes | Adds your page to the docs sidebar | | |
| ### 1. The gym.Env wrapper (`src/lerobot/envs/.py`) | |
| Create a `gym.Env` subclass that wraps the third-party simulator: | |
| ```python | |
| class MyBenchmarkEnv(gym.Env): | |
| metadata = {"render_modes": ["rgb_array"], "render_fps": } | |
| def __init__(self, task_suite, task_id, ...): | |
| super().__init__() | |
| self.task = | |
| self.task_description = | |
| self._max_episode_steps = | |
| self.observation_space = spaces.Dict({...}) | |
| self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32) | |
| def reset(self, seed=None, **kwargs): | |
| ... # return (observation, info) — info must contain {"is_success": False} | |
| def step(self, action: np.ndarray): | |
| ... # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": } | |
| def render(self): | |
| ... # return RGB image as numpy array | |
| def close(self): | |
| ... | |
| ``` | |
| **GPU-based simulators (e.g. MuJoCo with EGL rendering):** If your simulator allocates GPU/EGL contexts during `__init__`, defer that allocation to a `_ensure_env()` helper called on first `reset()`/`step()`. This avoids inheriting stale GPU handles when `AsyncVectorEnv` spawns worker processes. See `LiberoEnv._ensure_env()` for the pattern. | |
| Also provide a factory function that returns the nested dict structure: | |
| ```python | |
| def create_mybenchmark_envs( | |
| task: str, | |
| n_envs: int, | |
| gym_kwargs: dict | None = None, | |
| env_cls: type | None = None, | |
| ) -> dict[str, dict[int, Any]]: | |
| """Create {suite_name: {task_id: VectorEnv}} for MyBenchmark.""" | |
| ... | |
| ``` | |
| See `create_libero_envs()` (multi-suite, multi-task) and `create_metaworld_envs()` (difficulty-grouped tasks) for reference. | |
| ### 2. The config (`src/lerobot/envs/configs.py`) | |
| Register a config dataclass so users can select your benchmark with `--env.type=`. Each config owns its environment creation and processor logic via two methods: | |
| - **`create_envs(n_envs, use_async_envs)`** — Returns `{suite: {task_id: VectorEnv}}`. The base class default uses `gym.make()` for single-task envs. Multi-task benchmarks override this. | |
| - **`get_env_processors()`** — Returns `(preprocessor, postprocessor)`. The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms. | |
| ```python | |
| @EnvConfig.register_subclass("") | |
| @dataclass | |
| class MyBenchmarkEnvConfig(EnvConfig): | |
| task: str = "" | |
| fps: int = | |
| obs_type: str = "pixels_agent_pos" | |
| features: dict[str, PolicyFeature] = field(default_factory=lambda: { | |
| ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)), | |
| }) | |
| features_map: dict[str, str] = field(default_factory=lambda: { | |
| ACTION: ACTION, | |
| "agent_pos": OBS_STATE, | |
| "pixels": OBS_IMAGE, | |
| }) | |
| def __post_init__(self): | |
| ... # populate features based on obs_type | |
| @property | |
| def gym_kwargs(self) -> dict: | |
| return {"obs_type": self.obs_type, "render_mode": self.render_mode} | |
| def create_envs(self, n_envs: int, use_async_envs: bool = True): | |
| """Override for multi-task benchmarks or custom env creation.""" | |
| from lerobot.envs. import create__envs | |
| return create__envs(task=self.task, n_envs=n_envs, ...) | |
| def get_env_processors(self): | |
| """Override if your benchmark needs observation/action transforms.""" | |
| from lerobot.processor import PolicyProcessorPipeline | |
| from lerobot.processor.env_processor import MyBenchmarkProcessorStep | |
| return ( | |
| PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]), | |
| PolicyProcessorPipeline(steps=[]), | |
| ) | |
| ``` | |
| Key points: | |
| - The `register_subclass` name is what users pass on the CLI (`--env.type=`). | |
| - `features` tells the policy what the environment produces. | |
| - `features_map` maps raw observation keys to LeRobot convention keys. | |
| - **No changes to `factory.py` needed** — the factory delegates to `cfg.create_envs()` and `cfg.get_env_processors()` automatically. | |
| ### 3. Env processor (optional — `src/lerobot/processor/env_processor.py`) | |
| Only needed if your benchmark requires observation transforms beyond what `preprocess_observation()` handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from `get_env_processors()` in your config (see step 2): | |
| ```python | |
| @dataclass | |
| @ProcessorStepRegistry.register(name="_processor") | |
| class MyBenchmarkProcessorStep(ObservationProcessorStep): | |
| def _process_observation(self, observation): | |
| processed = observation.copy() | |
| # your transforms here | |
| return processed | |
| def transform_features(self, features): | |
| return features # update if shapes change | |
| def observation(self, observation): | |
| return self._process_observation(observation) | |
| ``` | |
| See `LiberoProcessorStep` for a full example (image rotation, quaternion-to-axis-angle conversion). | |
| ### 4. Dependencies (`pyproject.toml`) | |
| Add a new optional-dependency group: | |
| ```toml | |
| mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"] | |
| ``` | |
| Pinning rules: | |
| - **Always pin** benchmark packages to exact versions for reproducibility (e.g. `metaworld==3.0.0`). | |
| - **Add platform markers** when needed (e.g. `; sys_platform == 'linux'`). | |
| - **Pin fragile transitive deps** if known (e.g. `gymnasium==1.1.0` for Meta-World). | |
| - **Document constraints** in your benchmark doc page. | |
| Users install with: | |
| ```bash | |
| pip install -e ".[mybenchmark]" | |
| ``` | |
| ### 5. Documentation (`docs/source/.mdx`) | |
| Write a user-facing page following the template in the next section. See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for full examples. | |
| ### 6. Table of contents (`docs/source/_toctree.yml`) | |
| Add your benchmark to the "Benchmarks" section: | |
| ```yaml | |
| - sections: | |
| - local: libero | |
| title: LIBERO | |
| - local: metaworld | |
| title: Meta-World | |
| - local: envhub_isaaclab_arena | |
| title: NVIDIA IsaacLab Arena Environments | |
| - local: | |
| title: | |
| title: "Benchmarks" | |
| ``` | |
| ## Verifying your integration | |
| After completing the steps above, confirm that everything works: | |
| 1. **Install** — `pip install -e ".[mybenchmark]"` and verify the dependency group installs cleanly. | |
| 2. **Smoke test env creation** — call `make_env()` with your config in Python, check that the returned dict has the expected `{suite: {task_id: VectorEnv}}` shape, and that `reset()` returns observations with the right keys. | |
| 3. **Run a full eval** — `lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --policy.path=` to exercise the full pipeline end-to-end. (`batch_size` defaults to auto-tuning based on CPU cores; pass `--eval.batch_size=1` to force a single environment.) | |
| 4. **Check success detection** — verify that `info["is_success"]` flips to `True` when the task is actually completed. This is what the eval loop uses to compute success rates. | |
| ## Writing a benchmark doc page | |
| Each benchmark `.mdx` page should include: | |
| - **Title and description** — 1-2 paragraphs on what the benchmark tests and why it matters. | |
| - **Links** — paper, GitHub repo, project website (if available). | |
| - **Overview image or GIF.** | |
| - **Available tasks** — table of task suites with counts and brief descriptions. | |
| - **Installation** — `pip install -e ".[]"` plus any extra steps (env vars, system packages). | |
| - **Evaluation** — recommended `lerobot-eval` command with `n_episodes` for reproducible results. `batch_size` defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable. | |
| - **Policy inputs and outputs** — observation keys with shapes, action space description. | |
| - **Recommended evaluation episodes** — how many episodes per task is standard. | |
| - **Training** — example `lerobot-train` command. | |
| - **Reproducing published results** — link to pretrained model, eval command, results table (if available). | |
| See `docs/source/libero.mdx` and `docs/source/metaworld.mdx` for complete examples. | |
Xet Storage Details
- Size:
- 14.9 kB
- Xet hash:
- d1b12cf83fcc4c4288f255a27aa19673675c2f036537568424e2458603ff29ca
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.