| # HDF5 Training Data Format | |
| Structure inside each `record_dataset_<EnvID>.h5` file: | |
| ```text | |
| episode_1/ | |
| setup/ | |
| timestep_1/ | |
| obs/ | |
| action/ | |
| info/ | |
| timestep_2/ | |
| obs/ | |
| action/ | |
| info/ | |
| ... | |
| ... | |
| ``` | |
| Each episode contains: | |
| - `setup/`: episode-level configuration. | |
| - `timestep_<K>/`: per-timestep data. | |
| ## `setup/` fields (episode configuration) | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `seed` | `int` | Environment seed (fixed for benchmarking) | | |
| | `difficulty` | `str` | Difficulty level (fixed for benchmarking) | | |
| | `task_goal` | `list[str]` | Possible language goals for the task | | |
| | `front_camera_intrinsic` | `float32 (3, 3)` | Front camera intrinsic matrix | | |
| | `wrist_camera_intrinsic` | `float32 (3, 3)` | Wrist camera intrinsic matrix | | |
| | `available_multi_choices` | `str` | Available options for the multi-choice Video-QA problem | | |
| ## `obs/` fields (observations) | |
| | Field | Type / shape | Description | | |
| |-------|---------------|-------------| | |
| | `front_rgb` | `uint8 (512, 512, 3)` | Front camera RGB | | |
| | `wrist_rgb` | `uint8 (256, 256, 3)` | Wrist camera RGB | | |
| | `front_depth` | `int16 (512, 512, 1)` | Front camera depth (mm) | | |
| | `wrist_depth` | `int16 (256, 256, 1)` | Wrist camera depth (mm) | | |
| | `joint_state` | `float32 (7,)` | Joint positions (7 joints) | | |
| | `eef_state` | `float32 (6,)` | End-effector pose `[x, y, z, roll, pitch, yaw]` | | |
| | `gripper_state` | `float32 (2,)` | Gripper opening width in [0, 0.04] | | |
| | `is_gripper_close` | `bool` | Whether gripper is closed | | |
| | `front_camera_extrinsic` | `float32 (3, 4)` | Front camera extrinsic matrix | | |
| | `wrist_camera_extrinsic` | `float32 (3, 4)` | Wrist camera extrinsic matrix | | |
| ## `action/` fields | |
| | Field | Type / shape | Description | | |
| |-------|---------------|-------------| | |
| | `joint_action` | `float32 (8,)` | Joint-space action: 7 joint angles + gripper | | |
| | `eef_action` | `float32 (7,)` | End-effector action `[x, y, z, roll, pitch, yaw, gripper]` | | |
| | `waypoint_action` | `float32 (7,)` | End-effector action at discrete time steps; a subtask may contain multiple waypoint actions. Used for data generation. | | |
| | `choice_action` | `str` | JSON string for multi-choice selection with an optional grounded pixel location on the front image, e.g., `{"choice": "A", "point": [y, x]}` | | |
| In RoboMME, a gripper action of -1 means close and 1 means open. | |
| ## `info/` fields (metadata) | |
| | Field | Type | Description | | |
| |-------|------|-------------| | |
| | `simple_subgoal` | `bytes (UTF-8)` | Simple subgoal text (built-in planner view) | | |
| | `simple_subgoal_online` | `bytes (UTF-8)` | Simple subgoal text (online view; may advance to the next subgoal earlier than planner view) | | |
| | `grounded_subgoal` | `bytes (UTF-8)` | Grounded subgoal text (built-in planner view) | | |
| | `grounded_subgoal_online` | `bytes (UTF-8)` | Grounded subgoal text (online view; may advance to the next subgoal earlier than planner view) | | |
| | `is_video_demo` | `bool` | Whether this frame is from the conditioning video shown before execution | | |
| | `is_subgoal_boundary` | `bool` | Whether this is a keyframe (i.e., a boundary between subtasks) | | |
| | `is_completed` | `bool` | Whether the task is finished | | |