RoboMME / doc /env_format.md
HongzeFu's picture
HF Space: code-only (no binary assets)
06c11b0
# Environment Input/Output
On RoboMME, a key difference from traditional Gym-like envs is that every observation value is a **list** rather than a single item. This is because some RoboMME tasks use conditioning video input, and for discrete action types (e.g. waypoint or multi_choice) we also return intermediate observations for potential use with video-based policy models.
## Env Input Format
We support four `ACTION_SPACE` types:
- `joint_angle`: 7 joint angles + gripper open/close
- `ee_pose`: 3 position (xyz) + 3 rotation (rpy) + gripper open/close
- `waypoint`: Same format as ee_pose, but executed in discrete keyframe steps
- `multi_choice`: Command dict, e.g. `{"choice": "A", "point": [y, x]}`; the total choices can be found in `info["available_multi_choices"]`, where the `point` is the pixel location on the front image. this action is designed for Video-QA research.
Note: Gripper closed is -1, gripper open is 1.
## Env Output Format
When calling the `step` function:
```python
obs, reward, terminated, truncated, info = env.step(action)
```
| Return | Description | Typical type |
|--------|-------------|--------------|
| `obs` | Observation dict | `dict[str, list]` |
| `info` | Info dict | `dict[str, Any]` |
| `reward` | Reward value (not used) | scalar tensor |
| `terminated` | Termination flag | scalar boolean tensor |
| `truncated` | Truncation flag | scalar boolean tensor |
### `obs` dict
| Key | Meaning | Typical content |
|-----|---------|-----------------|
| `maniskill_obs` | The original raw env observation from ManiSkill | Raw observation dict |
| `front_rgb_list` | Front camera RGB List | Image frames, e.g. `(H, W, 3)` |
| `wrist_rgb_list` | Wrist camera RGB List | Image frames, e.g. `(H, W, 3)` |
| `front_depth_list` | Front camera depth List | Depth map, e.g. `(H, W, 1)` |
| `wrist_depth_list` | Wrist camera depth List | Depth map, e.g. `(H, W, 1)` |
| `eef_state_list` | End-effector state List | `[x, y, z, roll, pitch, yaw]` |
| `joint_state_list` | Robot joint state List | Joint vector, often 7-D |
| `gripper_state_list` | Robot gripper state List | 2-D |
| `front_camera_extrinsic_list` | Front camera extrinsic List | Camera extrinsic matrix |
| `wrist_camera_extrinsic_list` | Wrist camera extrinsic List | Camera extrinsic matrix |
To use only the current (latest) observation, use `obs[key][-1]`.
### Optional field switches (`include_*`)
`BenchmarkEnvBuilder.make_env_for_episode(...)` controls optional observation/info fields through `include_*` flags.
Default behavior:
- All `include_*` flags default to `False`.
- Without extra flags, env returns RGB + state related fields only.
Mapping:
| Flag | Added key |
|------|-----------|
| `include_maniskill_obs` | `obs["maniskill_obs"]` |
| `include_front_depth` | `obs["front_depth_list"]` |
| `include_wrist_depth` | `obs["wrist_depth_list"]` |
| `include_front_camera_extrinsic` | `obs["front_camera_extrinsic_list"]` |
| `include_wrist_camera_extrinsic` | `obs["wrist_camera_extrinsic_list"]` |
| `include_available_multi_choices` | `info["available_multi_choices"]` |
| `include_front_camera_intrinsic` | `info["front_camera_intrinsic"]` |
| `include_wrist_camera_intrinsic` | `info["wrist_camera_intrinsic"]` |
Special case:
- If `action_space="multi_choice"`, front camera parameters are forced on internally:
- `front_camera_extrinsic_list`
- `front_camera_intrinsic`
Even if the corresponding `include_front_camera_*` flags are `False`.
Example:
```python
from robomme.env_record_wrapper import BenchmarkEnvBuilder
builder = BenchmarkEnvBuilder(
env_id="VideoUnmaskSwap",
dataset="test",
action_space="joint_angle",
gui_render=False,
)
env = builder.make_env_for_episode(
episode_idx=0,
max_steps=1000,
include_maniskill_obs=False,
include_front_depth=True,
include_wrist_depth=False,
include_front_camera_extrinsic=True,
include_wrist_camera_extrinsic=False,
include_available_multi_choices=False,
include_front_camera_intrinsic=True,
include_wrist_camera_intrinsic=False,
)
obs, info = env.reset()
```
### `info` dict
| Key | Meaning | Typical content |
|-----|---------|-----------------|
| `task_goal` | Task goal list | `list[str]` |
| `simple_subgoal_online` | Oracle online simple subgoal | Description of the current simple subgoal |
| `grounded_subgoal_online` | Oracle online grounded subgoal | Description of the current grounded subgoal |
| `available_multi_choices` | Current available options for multi-choice action | List of e.g. `{"label: "a/b/...", "action": str, "need_parameter": bool}`, need_parameter means this action needs grounding info like `[y, x]` |
| `front_camera_intrinsic` | Front camera intrinsic | Camera intrinsic matrix |
| `wrist_camera_intrinsic` | Wrist camera intrinsic | Camera intrinsic matrix |
| `status` | Status flag | One of `success`, `fail`, `timeout`, `ongoing`, `error` |