Spaces:

HongzeFu
/

RoboMME

Running on T4

App Files Files Community

RoboMME / doc /h5_data_format.md

HongzeFu

HF Space: code-only (no binary assets)

06c11b0 6 days ago

preview code

raw

history blame contribute delete

3.19 kB

	# HDF5 Training Data Format

	Structure inside each `record_dataset_<EnvID>.h5` file:

	```text
	episode_1/
	setup/
	timestep_1/
	obs/
	action/
	info/
	timestep_2/
	obs/
	action/
	info/
	...
	...
	```

	Each episode contains:
	- `setup/`: episode-level configuration.
	- `timestep_<K>/`: per-timestep data.

	## `setup/` fields (episode configuration)

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `seed` \| `int` \| Environment seed (fixed for benchmarking) \|
	\| `difficulty` \| `str` \| Difficulty level (fixed for benchmarking) \|
	\| `task_goal` \| `list[str]` \| Possible language goals for the task \|
	\| `front_camera_intrinsic` \| `float32 (3, 3)` \| Front camera intrinsic matrix \|
	\| `wrist_camera_intrinsic` \| `float32 (3, 3)` \| Wrist camera intrinsic matrix \|
	\| `available_multi_choices` \| `str` \| Available options for the multi-choice Video-QA problem \|

	## `obs/` fields (observations)

	\| Field \| Type / shape \| Description \|
	\|-------\|---------------\|-------------\|
	\| `front_rgb` \| `uint8 (512, 512, 3)` \| Front camera RGB \|
	\| `wrist_rgb` \| `uint8 (256, 256, 3)` \| Wrist camera RGB \|
	\| `front_depth` \| `int16 (512, 512, 1)` \| Front camera depth (mm) \|
	\| `wrist_depth` \| `int16 (256, 256, 1)` \| Wrist camera depth (mm) \|
	\| `joint_state` \| `float32 (7,)` \| Joint positions (7 joints) \|
	\| `eef_state` \| `float32 (6,)` \| End-effector pose `[x, y, z, roll, pitch, yaw]` \|
	\| `gripper_state` \| `float32 (2,)` \| Gripper opening width in [0, 0.04] \|
	\| `is_gripper_close` \| `bool` \| Whether gripper is closed \|
	\| `front_camera_extrinsic` \| `float32 (3, 4)` \| Front camera extrinsic matrix \|
	\| `wrist_camera_extrinsic` \| `float32 (3, 4)` \| Wrist camera extrinsic matrix \|

	## `action/` fields

	\| Field \| Type / shape \| Description \|
	\|-------\|---------------\|-------------\|
	\| `joint_action` \| `float32 (8,)` \| Joint-space action: 7 joint angles + gripper \|
	\| `eef_action` \| `float32 (7,)` \| End-effector action `[x, y, z, roll, pitch, yaw, gripper]` \|
	\| `waypoint_action` \| `float32 (7,)` \| End-effector action at discrete time steps; a subtask may contain multiple waypoint actions. Used for data generation. \|
	\| `choice_action` \| `str` \| JSON string for multi-choice selection with an optional grounded pixel location on the front image, e.g., `{"choice": "A", "point": [y, x]}` \|

	In RoboMME, a gripper action of -1 means close and 1 means open.

	## `info/` fields (metadata)

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `simple_subgoal` \| `bytes (UTF-8)` \| Simple subgoal text (built-in planner view) \|
	\| `simple_subgoal_online` \| `bytes (UTF-8)` \| Simple subgoal text (online view; may advance to the next subgoal earlier than planner view) \|
	\| `grounded_subgoal` \| `bytes (UTF-8)` \| Grounded subgoal text (built-in planner view) \|
	\| `grounded_subgoal_online` \| `bytes (UTF-8)` \| Grounded subgoal text (online view; may advance to the next subgoal earlier than planner view) \|
	\| `is_video_demo` \| `bool` \| Whether this frame is from the conditioning video shown before execution \|
	\| `is_subgoal_boundary` \| `bool` \| Whether this is a keyframe (i.e., a boundary between subtasks) \|
	\| `is_completed` \| `bool` \| Whether the task is finished \|

	# HDF5 Training Data Format

	Structure inside each `record_dataset_<EnvID>.h5` file:

	```text
	episode_1/
	setup/
	timestep_1/
	obs/
	action/
	info/
	timestep_2/
	obs/
	action/
	info/
	...
	...
	```

	Each episode contains:
	- `setup/`: episode-level configuration.
	- `timestep_<K>/`: per-timestep data.

	## `setup/` fields (episode configuration)

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `seed` \| `int` \| Environment seed (fixed for benchmarking) \|
	\| `difficulty` \| `str` \| Difficulty level (fixed for benchmarking) \|
	\| `task_goal` \| `list[str]` \| Possible language goals for the task \|
	\| `front_camera_intrinsic` \| `float32 (3, 3)` \| Front camera intrinsic matrix \|
	\| `wrist_camera_intrinsic` \| `float32 (3, 3)` \| Wrist camera intrinsic matrix \|
	\| `available_multi_choices` \| `str` \| Available options for the multi-choice Video-QA problem \|

	## `obs/` fields (observations)

	\| Field \| Type / shape \| Description \|
	\|-------\|---------------\|-------------\|
	\| `front_rgb` \| `uint8 (512, 512, 3)` \| Front camera RGB \|
	\| `wrist_rgb` \| `uint8 (256, 256, 3)` \| Wrist camera RGB \|
	\| `front_depth` \| `int16 (512, 512, 1)` \| Front camera depth (mm) \|
	\| `wrist_depth` \| `int16 (256, 256, 1)` \| Wrist camera depth (mm) \|
	\| `joint_state` \| `float32 (7,)` \| Joint positions (7 joints) \|
	\| `eef_state` \| `float32 (6,)` \| End-effector pose `[x, y, z, roll, pitch, yaw]` \|
	\| `gripper_state` \| `float32 (2,)` \| Gripper opening width in [0, 0.04] \|
	\| `is_gripper_close` \| `bool` \| Whether gripper is closed \|
	\| `front_camera_extrinsic` \| `float32 (3, 4)` \| Front camera extrinsic matrix \|
	\| `wrist_camera_extrinsic` \| `float32 (3, 4)` \| Wrist camera extrinsic matrix \|

	## `action/` fields

	\| Field \| Type / shape \| Description \|
	\|-------\|---------------\|-------------\|
	\| `joint_action` \| `float32 (8,)` \| Joint-space action: 7 joint angles + gripper \|
	\| `eef_action` \| `float32 (7,)` \| End-effector action `[x, y, z, roll, pitch, yaw, gripper]` \|
	\| `waypoint_action` \| `float32 (7,)` \| End-effector action at discrete time steps; a subtask may contain multiple waypoint actions. Used for data generation. \|
	\| `choice_action` \| `str` \| JSON string for multi-choice selection with an optional grounded pixel location on the front image, e.g., `{"choice": "A", "point": [y, x]}` \|

	In RoboMME, a gripper action of -1 means close and 1 means open.

	## `info/` fields (metadata)

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `simple_subgoal` \| `bytes (UTF-8)` \| Simple subgoal text (built-in planner view) \|
	\| `simple_subgoal_online` \| `bytes (UTF-8)` \| Simple subgoal text (online view; may advance to the next subgoal earlier than planner view) \|
	\| `grounded_subgoal` \| `bytes (UTF-8)` \| Grounded subgoal text (built-in planner view) \|
	\| `grounded_subgoal_online` \| `bytes (UTF-8)` \| Grounded subgoal text (online view; may advance to the next subgoal earlier than planner view) \|
	\| `is_video_demo` \| `bool` \| Whether this frame is from the conditioning video shown before execution \|
	\| `is_subgoal_boundary` \| `bool` \| Whether this is a keyframe (i.e., a boundary between subtasks) \|
	\| `is_completed` \| `bool` \| Whether the task is finished \|