RoboMME / doc /h5_data_format.md
HongzeFu's picture
HF Space: code-only (no binary assets)
06c11b0

HDF5 Training Data Format

Structure inside each record_dataset_<EnvID>.h5 file:

episode_1/
  setup/
  timestep_1/
    obs/
    action/
    info/
  timestep_2/
    obs/
    action/
    info/
  ...
...

Each episode contains:

  • setup/: episode-level configuration.
  • timestep_<K>/: per-timestep data.

setup/ fields (episode configuration)

Field Type Description
seed int Environment seed (fixed for benchmarking)
difficulty str Difficulty level (fixed for benchmarking)
task_goal list[str] Possible language goals for the task
front_camera_intrinsic float32 (3, 3) Front camera intrinsic matrix
wrist_camera_intrinsic float32 (3, 3) Wrist camera intrinsic matrix
available_multi_choices str Available options for the multi-choice Video-QA problem

obs/ fields (observations)

Field Type / shape Description
front_rgb uint8 (512, 512, 3) Front camera RGB
wrist_rgb uint8 (256, 256, 3) Wrist camera RGB
front_depth int16 (512, 512, 1) Front camera depth (mm)
wrist_depth int16 (256, 256, 1) Wrist camera depth (mm)
joint_state float32 (7,) Joint positions (7 joints)
eef_state float32 (6,) End-effector pose [x, y, z, roll, pitch, yaw]
gripper_state float32 (2,) Gripper opening width in [0, 0.04]
is_gripper_close bool Whether gripper is closed
front_camera_extrinsic float32 (3, 4) Front camera extrinsic matrix
wrist_camera_extrinsic float32 (3, 4) Wrist camera extrinsic matrix

action/ fields

Field Type / shape Description
joint_action float32 (8,) Joint-space action: 7 joint angles + gripper
eef_action float32 (7,) End-effector action [x, y, z, roll, pitch, yaw, gripper]
waypoint_action float32 (7,) End-effector action at discrete time steps; a subtask may contain multiple waypoint actions. Used for data generation.
choice_action str JSON string for multi-choice selection with an optional grounded pixel location on the front image, e.g., {"choice": "A", "point": [y, x]}

In RoboMME, a gripper action of -1 means close and 1 means open.

info/ fields (metadata)

Field Type Description
simple_subgoal bytes (UTF-8) Simple subgoal text (built-in planner view)
simple_subgoal_online bytes (UTF-8) Simple subgoal text (online view; may advance to the next subgoal earlier than planner view)
grounded_subgoal bytes (UTF-8) Grounded subgoal text (built-in planner view)
grounded_subgoal_online bytes (UTF-8) Grounded subgoal text (online view; may advance to the next subgoal earlier than planner view)
is_video_demo bool Whether this frame is from the conditioning video shown before execution
is_subgoal_boundary bool Whether this is a keyframe (i.e., a boundary between subtasks)
is_completed bool Whether the task is finished