Environment Input/Output
On RoboMME, a key difference from traditional Gym-like envs is that every observation value is a list rather than a single item. This is because some RoboMME tasks use conditioning video input, and for discrete action types (e.g. waypoint or multi_choice) we also return intermediate observations for potential use with video-based policy models.
Env Input Format
We support four ACTION_SPACE types:
joint_angle: 7 joint angles + gripper open/closeee_pose: 3 position (xyz) + 3 rotation (rpy) + gripper open/closewaypoint: Same format as ee_pose, but executed in discrete keyframe stepsmulti_choice: Command dict, e.g.{"choice": "A", "point": [y, x]}; the total choices can be found ininfo["available_multi_choices"], where thepointis the pixel location on the front image. this action is designed for Video-QA research.
Note: Gripper closed is -1, gripper open is 1.
Env Output Format
When calling the step function:
obs, reward, terminated, truncated, info = env.step(action)
| Return | Description | Typical type |
|---|---|---|
obs |
Observation dict | dict[str, list] |
info |
Info dict | dict[str, Any] |
reward |
Reward value (not used) | scalar tensor |
terminated |
Termination flag | scalar boolean tensor |
truncated |
Truncation flag | scalar boolean tensor |
obs dict
| Key | Meaning | Typical content |
|---|---|---|
maniskill_obs |
The original raw env observation from ManiSkill | Raw observation dict |
front_rgb_list |
Front camera RGB List | Image frames, e.g. (H, W, 3) |
wrist_rgb_list |
Wrist camera RGB List | Image frames, e.g. (H, W, 3) |
front_depth_list |
Front camera depth List | Depth map, e.g. (H, W, 1) |
wrist_depth_list |
Wrist camera depth List | Depth map, e.g. (H, W, 1) |
eef_state_list |
End-effector state List | [x, y, z, roll, pitch, yaw] |
joint_state_list |
Robot joint state List | Joint vector, often 7-D |
gripper_state_list |
Robot gripper state List | 2-D |
front_camera_extrinsic_list |
Front camera extrinsic List | Camera extrinsic matrix |
wrist_camera_extrinsic_list |
Wrist camera extrinsic List | Camera extrinsic matrix |
To use only the current (latest) observation, use obs[key][-1].
Optional field switches (include_*)
BenchmarkEnvBuilder.make_env_for_episode(...) controls optional observation/info fields through include_* flags.
Default behavior:
- All
include_*flags default toFalse. - Without extra flags, env returns RGB + state related fields only.
Mapping:
| Flag | Added key |
|---|---|
include_maniskill_obs |
obs["maniskill_obs"] |
include_front_depth |
obs["front_depth_list"] |
include_wrist_depth |
obs["wrist_depth_list"] |
include_front_camera_extrinsic |
obs["front_camera_extrinsic_list"] |
include_wrist_camera_extrinsic |
obs["wrist_camera_extrinsic_list"] |
include_available_multi_choices |
info["available_multi_choices"] |
include_front_camera_intrinsic |
info["front_camera_intrinsic"] |
include_wrist_camera_intrinsic |
info["wrist_camera_intrinsic"] |
Special case:
- If
action_space="multi_choice", front camera parameters are forced on internally:front_camera_extrinsic_listfront_camera_intrinsicEven if the correspondinginclude_front_camera_*flags areFalse.
Example:
from robomme.env_record_wrapper import BenchmarkEnvBuilder
builder = BenchmarkEnvBuilder(
env_id="VideoUnmaskSwap",
dataset="test",
action_space="joint_angle",
gui_render=False,
)
env = builder.make_env_for_episode(
episode_idx=0,
max_steps=1000,
include_maniskill_obs=False,
include_front_depth=True,
include_wrist_depth=False,
include_front_camera_extrinsic=True,
include_wrist_camera_extrinsic=False,
include_available_multi_choices=False,
include_front_camera_intrinsic=True,
include_wrist_camera_intrinsic=False,
)
obs, info = env.reset()
info dict
| Key | Meaning | Typical content |
|---|---|---|
task_goal |
Task goal list | list[str] |
simple_subgoal_online |
Oracle online simple subgoal | Description of the current simple subgoal |
grounded_subgoal_online |
Oracle online grounded subgoal | Description of the current grounded subgoal |
available_multi_choices |
Current available options for multi-choice action | List of e.g. {"label: "a/b/...", "action": str, "need_parameter": bool}, need_parameter means this action needs grounding info like [y, x] |
front_camera_intrinsic |
Front camera intrinsic | Camera intrinsic matrix |
wrist_camera_intrinsic |
Wrist camera intrinsic | Camera intrinsic matrix |
status |
Status flag | One of success, fail, timeout, ongoing, error |