| # Environment Input/Output | |
| On RoboMME, a key difference from traditional Gym-like envs is that every observation value is a **list** rather than a single item. This is because some RoboMME tasks use conditioning video input, and for discrete action types (e.g. waypoint or multi_choice) we also return intermediate observations for potential use with video-based policy models. | |
| ## Env Input Format | |
| We support four `ACTION_SPACE` types: | |
| - `joint_angle`: 7 joint angles + gripper open/close | |
| - `ee_pose`: 3 position (xyz) + 3 rotation (rpy) + gripper open/close | |
| - `waypoint`: Same format as ee_pose, but executed in discrete keyframe steps | |
| - `multi_choice`: Command dict, e.g. `{"choice": "A", "point": [y, x]}`; the total choices can be found in `info["available_multi_choices"]`, where the `point` is the pixel location on the front image. this action is designed for Video-QA research. | |
| Note: Gripper closed is -1, gripper open is 1. | |
| ## Env Output Format | |
| When calling the `step` function: | |
| ```python | |
| obs, reward, terminated, truncated, info = env.step(action) | |
| ``` | |
| | Return | Description | Typical type | | |
| |--------|-------------|--------------| | |
| | `obs` | Observation dict | `dict[str, list]` | | |
| | `info` | Info dict | `dict[str, Any]` | | |
| | `reward` | Reward value (not used) | scalar tensor | | |
| | `terminated` | Termination flag | scalar boolean tensor | | |
| | `truncated` | Truncation flag | scalar boolean tensor | | |
| ### `obs` dict | |
| | Key | Meaning | Typical content | | |
| |-----|---------|-----------------| | |
| | `maniskill_obs` | The original raw env observation from ManiSkill | Raw observation dict | | |
| | `front_rgb_list` | Front camera RGB List | Image frames, e.g. `(H, W, 3)` | | |
| | `wrist_rgb_list` | Wrist camera RGB List | Image frames, e.g. `(H, W, 3)` | | |
| | `front_depth_list` | Front camera depth List | Depth map, e.g. `(H, W, 1)` | | |
| | `wrist_depth_list` | Wrist camera depth List | Depth map, e.g. `(H, W, 1)` | | |
| | `eef_state_list` | End-effector state List | `[x, y, z, roll, pitch, yaw]` | | |
| | `joint_state_list` | Robot joint state List | Joint vector, often 7-D | | |
| | `gripper_state_list` | Robot gripper state List | 2-D | | |
| | `front_camera_extrinsic_list` | Front camera extrinsic List | Camera extrinsic matrix | | |
| | `wrist_camera_extrinsic_list` | Wrist camera extrinsic List | Camera extrinsic matrix | | |
| To use only the current (latest) observation, use `obs[key][-1]`. | |
| ### Optional field switches (`include_*`) | |
| `BenchmarkEnvBuilder.make_env_for_episode(...)` controls optional observation/info fields through `include_*` flags. | |
| Default behavior: | |
| - All `include_*` flags default to `False`. | |
| - Without extra flags, env returns RGB + state related fields only. | |
| Mapping: | |
| | Flag | Added key | | |
| |------|-----------| | |
| | `include_maniskill_obs` | `obs["maniskill_obs"]` | | |
| | `include_front_depth` | `obs["front_depth_list"]` | | |
| | `include_wrist_depth` | `obs["wrist_depth_list"]` | | |
| | `include_front_camera_extrinsic` | `obs["front_camera_extrinsic_list"]` | | |
| | `include_wrist_camera_extrinsic` | `obs["wrist_camera_extrinsic_list"]` | | |
| | `include_available_multi_choices` | `info["available_multi_choices"]` | | |
| | `include_front_camera_intrinsic` | `info["front_camera_intrinsic"]` | | |
| | `include_wrist_camera_intrinsic` | `info["wrist_camera_intrinsic"]` | | |
| Special case: | |
| - If `action_space="multi_choice"`, front camera parameters are forced on internally: | |
| - `front_camera_extrinsic_list` | |
| - `front_camera_intrinsic` | |
| Even if the corresponding `include_front_camera_*` flags are `False`. | |
| Example: | |
| ```python | |
| from robomme.env_record_wrapper import BenchmarkEnvBuilder | |
| builder = BenchmarkEnvBuilder( | |
| env_id="VideoUnmaskSwap", | |
| dataset="test", | |
| action_space="joint_angle", | |
| gui_render=False, | |
| ) | |
| env = builder.make_env_for_episode( | |
| episode_idx=0, | |
| max_steps=1000, | |
| include_maniskill_obs=False, | |
| include_front_depth=True, | |
| include_wrist_depth=False, | |
| include_front_camera_extrinsic=True, | |
| include_wrist_camera_extrinsic=False, | |
| include_available_multi_choices=False, | |
| include_front_camera_intrinsic=True, | |
| include_wrist_camera_intrinsic=False, | |
| ) | |
| obs, info = env.reset() | |
| ``` | |
| ### `info` dict | |
| | Key | Meaning | Typical content | | |
| |-----|---------|-----------------| | |
| | `task_goal` | Task goal list | `list[str]` | | |
| | `simple_subgoal_online` | Oracle online simple subgoal | Description of the current simple subgoal | | |
| | `grounded_subgoal_online` | Oracle online grounded subgoal | Description of the current grounded subgoal | | |
| | `available_multi_choices` | Current available options for multi-choice action | List of e.g. `{"label: "a/b/...", "action": str, "need_parameter": bool}`, need_parameter means this action needs grounding info like `[y, x]` | | |
| | `front_camera_intrinsic` | Front camera intrinsic | Camera intrinsic matrix | | |
| | `wrist_camera_intrinsic` | Wrist camera intrinsic | Camera intrinsic matrix | | |
| | `status` | Status flag | One of `success`, `fail`, `timeout`, `ongoing`, `error` | | |