Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / lerobot /pr_3313 /en /adding_benchmarks.md

HuggingFaceDocBuilder

about 1 month ago

preview code

download

raw

14.9 kB

Adding a New Benchmark

This guide walks you through adding a new simulation benchmark to LeRobot. Follow the steps in order and use the existing benchmarks as templates.

A benchmark in LeRobot is a set of Gymnasium environments that wrap a third-party simulator (like LIBERO or Meta-World) behind a standard gym.Env interface. The lerobot-eval CLI then runs evaluation uniformly across all benchmarks.

Existing benchmarks at a glance

Before diving in, here is what is already integrated:

Benchmark	Env file	Config class	Tasks	Action dim	Processor
LIBERO	`envs/libero.py`	`LiberoEnv`	130 across 5 suites	7	`LiberoProcessorStep`
Meta-World	`envs/metaworld.py`	`MetaworldEnv`	50 (MT50)	4	None
IsaacLab Arena	Hub-hosted	`IsaaclabArenaEnv`	Configurable	Configurable	`IsaaclabArenaProcessorStep`

Use src/lerobot/envs/libero.py and src/lerobot/envs/metaworld.py as reference implementations.

How it all fits together

Data flow

During evaluation, data moves through four stages:

1. gym.Env  ──→  raw observations (numpy dicts)

2. Preprocessing  ──→  standard LeRobot keys + task description
   (preprocess_observation in envs/utils.py, env.call("task_description"))

3. Processors  ──→  env-specific then policy-specific transforms
   (env_preprocessor, policy_preprocessor)

4. Policy  ──→  select_action()  ──→  action tensor
   then reverse: policy_postprocessor → env_postprocessor → numpy action → env.step()

Most benchmarks only need to care about stage 1 (producing observations in the right format) and optionally stage 3 (if env-specific transforms are needed).

Environment structure

make_env() returns a nested dict of vectorized environments:

dict[str, dict[int, gym.vector.VectorEnv]]
#    ^suite       ^task_id

A single-task env (e.g. PushT) looks like {"pusht": {0: vec_env}}. A multi-task benchmark (e.g. LIBERO) looks like {"libero_spatial": {0: vec0, 1: vec1, ...}, ...}.

How evaluation runs

All benchmarks are evaluated the same way by lerobot-eval:

make_env() builds the nested {suite: {task_id: VectorEnv}} dict.
eval_policy_all() iterates over every suite and task.
For each task, it runs n_episodes rollouts via rollout().
Results are aggregated hierarchically: episode, task, suite, overall.
Metrics include pc_success (success rate), avg_sum_reward, and avg_max_reward.

The critical piece: your env must return info["is_success"] on every step() call. This is how the eval loop knows whether a task was completed.

What your environment must provide

LeRobot does not enforce a strict observation schema. Instead it relies on a set of conventions that all benchmarks follow.

Env attributes

Your gym.Env must set these attributes:

Attribute	Type	Why
`_max_episode_steps`	`int`	`rollout()` uses this to cap episode length
`task_description`	`str`	Passed to VLA policies as a language instruction
`task`	`str`	Fallback identifier if `task_description` is not set

Success reporting

Your step() and reset() must include "is_success" in the info dict:

info = {"is_success": True}   # or False
return observation, reward, terminated, truncated, info

Observations

The simplest approach is to map your simulator's outputs to the standard keys that preprocess_observation() already understands. Do this inside your gym.Env (e.g. in a _format_raw_obs() helper):

Your env should output	LeRobot maps it to	What it is
`"pixels"` (single array)	`observation.image`	Single camera image, HWC uint8
`"pixels"` (dict)	`observation.images.`	Multiple cameras, each HWC uint8
`"agent_pos"`	`observation.state`	Proprioceptive state vector
`"environment_state"`	`observation.env_state`	Full environment state (e.g. PushT)
`"robot_state"`	`observation.robot_state`	Nested robot state dict (e.g. LIBERO)

If your simulator uses different key names, you have two options:

Recommended: Rename them to the standard keys inside your gym.Env wrapper.
Alternative: Write an env processor to transform observations after preprocess_observation() runs (see step 4 below).

Actions

Actions are continuous numpy arrays in a gym.spaces.Box. The dimensionality depends on your benchmark (7 for LIBERO, 4 for Meta-World, etc.). Policies adapt to different action dimensions through their input_features / output_features config.

Feature declaration

Each EnvConfig subclass declares two dicts that tell the policy what to expect:

features — maps feature names to PolicyFeature(type, shape) (e.g. action dim, image shape).
features_map — maps raw observation keys to LeRobot convention keys (e.g. "agent_pos" to "observation.state").

Step by step

At minimum, you need two files: a gym.Env wrapper and an EnvConfig subclass with a create_envs() override. Everything else is optional or documentation. No changes to factory.py are needed.

Checklist

File	Required	Why
`src/lerobot/envs/.py`	Yes	Wraps the simulator as a standard gym.Env
`src/lerobot/envs/configs.py`	Yes	Registers your benchmark and its `create_envs()` for the CLI
`src/lerobot/processor/env_processor.py`	Optional	Custom observation/action transforms
`src/lerobot/envs/utils.py`	Optional	Only if you need new raw observation keys
`pyproject.toml`	Yes	Declares benchmark-specific dependencies
`docs/source/.mdx`	Yes	User-facing documentation page
`docs/source/_toctree.yml`	Yes	Adds your page to the docs sidebar

1. The gym.Env wrapper (`src/lerobot/envs/.py`)

Create a gym.Env subclass that wraps the third-party simulator:

class MyBenchmarkEnv(gym.Env):
    metadata = {"render_modes": ["rgb_array"], "render_fps": }

    def __init__(self, task_suite, task_id, ...):
        super().__init__()
        self.task = 
        self.task_description = 
        self._max_episode_steps = 
        self.observation_space = spaces.Dict({...})
        self.action_space = spaces.Box(low=..., high=..., shape=(...,), dtype=np.float32)

    def reset(self, seed=None, **kwargs):
        ...  # return (observation, info) — info must contain {"is_success": False}

    def step(self, action: np.ndarray):
        ...  # return (obs, reward, terminated, truncated, info) — info must contain {"is_success": }

    def render(self):
        ...  # return RGB image as numpy array

    def close(self):
        ...

GPU-based simulators (e.g. MuJoCo with EGL rendering): If your simulator allocates GPU/EGL contexts during __init__, defer that allocation to a _ensure_env() helper called on first reset()/step(). This avoids inheriting stale GPU handles when AsyncVectorEnv spawns worker processes. See LiberoEnv._ensure_env() for the pattern.

Also provide a factory function that returns the nested dict structure:

def create_mybenchmark_envs(
    task: str,
    n_envs: int,
    gym_kwargs: dict | None = None,
    env_cls: type | None = None,
) -> dict[str, dict[int, Any]]:
    """Create {suite_name: {task_id: VectorEnv}} for MyBenchmark."""
    ...

See create_libero_envs() (multi-suite, multi-task) and create_metaworld_envs() (difficulty-grouped tasks) for reference.

2. The config (`src/lerobot/envs/configs.py`)

Register a config dataclass so users can select your benchmark with --env.type=. Each config owns its environment creation and processor logic via two methods:

create_envs(n_envs, use_async_envs) — Returns {suite: {task_id: VectorEnv}}. The base class default uses gym.make() for single-task envs. Multi-task benchmarks override this.
get_env_processors() — Returns (preprocessor, postprocessor). The base class default returns identity (no-op) pipelines. Override if your benchmark needs observation/action transforms.

@EnvConfig.register_subclass("")
@dataclass
class MyBenchmarkEnvConfig(EnvConfig):
    task: str = ""
    fps: int = 
    obs_type: str = "pixels_agent_pos"

    features: dict[str, PolicyFeature] = field(default_factory=lambda: {
        ACTION: PolicyFeature(type=FeatureType.ACTION, shape=(,)),
    })
    features_map: dict[str, str] = field(default_factory=lambda: {
        ACTION: ACTION,
        "agent_pos": OBS_STATE,
        "pixels": OBS_IMAGE,
    })

    def __post_init__(self):
        ...  # populate features based on obs_type

    @property
    def gym_kwargs(self) -> dict:
        return {"obs_type": self.obs_type, "render_mode": self.render_mode}

    def create_envs(self, n_envs: int, use_async_envs: bool = True):
        """Override for multi-task benchmarks or custom env creation."""
        from lerobot.envs. import create__envs
        return create__envs(task=self.task, n_envs=n_envs, ...)

    def get_env_processors(self):
        """Override if your benchmark needs observation/action transforms."""
        from lerobot.processor import PolicyProcessorPipeline
        from lerobot.processor.env_processor import MyBenchmarkProcessorStep
        return (
            PolicyProcessorPipeline(steps=[MyBenchmarkProcessorStep()]),
            PolicyProcessorPipeline(steps=[]),
        )

Key points:

The register_subclass name is what users pass on the CLI (--env.type=).
features tells the policy what the environment produces.
features_map maps raw observation keys to LeRobot convention keys.
No changes to factory.py needed — the factory delegates to cfg.create_envs() and cfg.get_env_processors() automatically.

3. Env processor (optional — `src/lerobot/processor/env_processor.py`)

Only needed if your benchmark requires observation transforms beyond what preprocess_observation() handles (e.g. image flipping, coordinate conversion). Define the processor step here and return it from get_env_processors() in your config (see step 2):

@dataclass
@ProcessorStepRegistry.register(name="_processor")
class MyBenchmarkProcessorStep(ObservationProcessorStep):
    def _process_observation(self, observation):
        processed = observation.copy()
        # your transforms here
        return processed

    def transform_features(self, features):
        return features  # update if shapes change

    def observation(self, observation):
        return self._process_observation(observation)

See LiberoProcessorStep for a full example (image rotation, quaternion-to-axis-angle conversion).

4. Dependencies (`pyproject.toml`)

Add a new optional-dependency group:

mybenchmark = ["my-benchmark-pkg==1.2.3", "lerobot[scipy-dep]"]

Pinning rules:

Always pin benchmark packages to exact versions for reproducibility (e.g. metaworld==3.0.0).
Add platform markers when needed (e.g. ; sys_platform == 'linux').
Pin fragile transitive deps if known (e.g. gymnasium==1.1.0 for Meta-World).
Document constraints in your benchmark doc page.

Users install with:

pip install -e ".[mybenchmark]"

5. Documentation (`docs/source/.mdx`)

Write a user-facing page following the template in the next section. See docs/source/libero.mdx and docs/source/metaworld.mdx for full examples.

6. Table of contents (`docs/source/_toctree.yml`)

Add your benchmark to the "Benchmarks" section:

- sections:
    - local: libero
      title: LIBERO
    - local: metaworld
      title: Meta-World
    - local: envhub_isaaclab_arena
      title: NVIDIA IsaacLab Arena Environments
    - local: 
      title: 
  title: "Benchmarks"

Verifying your integration

After completing the steps above, confirm that everything works:

Install — pip install -e ".[mybenchmark]" and verify the dependency group installs cleanly.
Smoke test env creation — call make_env() with your config in Python, check that the returned dict has the expected {suite: {task_id: VectorEnv}} shape, and that reset() returns observations with the right keys.
Run a full eval — lerobot-eval --env.type= --env.task= --eval.n_episodes=1 --policy.path= to exercise the full pipeline end-to-end. (batch_size defaults to auto-tuning based on CPU cores; pass --eval.batch_size=1 to force a single environment.)
Check success detection — verify that info["is_success"] flips to True when the task is actually completed. This is what the eval loop uses to compute success rates.

Writing a benchmark doc page

Each benchmark .mdx page should include:

Title and description — 1-2 paragraphs on what the benchmark tests and why it matters.
Links — paper, GitHub repo, project website (if available).
Overview image or GIF.
Available tasks — table of task suites with counts and brief descriptions.
Installation — pip install -e ".[]" plus any extra steps (env vars, system packages).
Evaluation — recommended lerobot-eval command with n_episodes for reproducible results. batch_size defaults to auto; only specify it if needed. Include single-task and multi-task examples if applicable.
Policy inputs and outputs — observation keys with shapes, action space description.
Recommended evaluation episodes — how many episodes per task is standard.
Training — example lerobot-train command.
Reproducing published results — link to pretrained model, eval command, results table (if available).

See docs/source/libero.mdx and docs/source/metaworld.mdx for complete examples.

Xet Storage Details

Size:: 14.9 kB
Xet hash:: d1b12cf83fcc4c4288f255a27aa19673675c2f036537568424e2458603ff29ca

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.