RoboMME / tests /README.md
HongzeFu's picture
HF Space: code-only (no binary assets)
06c11b0

RoboMME Benchmark Test Instructions

These tests cover logical assertions, action replay, and dataset recording correctness under this benchmark framework. They are mainly divided into two major sections: tests/dataset (focusing on low-level environment and dataset interaction) and tests/lightweight (focusing on lightweight logic branches and unit tests).

The following explains the functions implemented by each test and how to run them.

1. dataset/ Directory: Environment Interaction and Dataset Alignment Tests

Tests in this directory mainly verify low-level environment calls based on the physics engine, Wrapper observation packaging for reinforcement learning, and dimensional alignment for massive dataset recording and replay.

  • test_obs_config.py: Verifies the include_* control switches passed in make_env_for_episode (e.g., turning on/off front depth map, wrist camera intrinsics/extrinsics, etc.). Tests whether it can return the corresponding fields in obs and info on demand during reset() and step() without error.
  • test_obs_numpy.py: Verifies the correctness of data conversion handled by DemonstrationWrapper. This test checks that in the generated temporary dataset, the values in the native obs and info dictionaries are correctly converted to compliant NumPy ndarray data types and have the expected data structure shapes.
  • test_record_stick.py: Verifies that when recording demonstrations as HDF5 dataset format using RecordWrapper, tasks with special trajectory requirements (like PatternLock/RouteStick) and normal requirements (like PickXtimes) store their gripper state (gripper_state), robot arm joints (joint_action), and end-effector pose (eef_action) correctly with proper dimensions.
  • test_replay_stick.py: Reverse verification test. Used to read tests: verify whether the special or normal datasets generated in the previous step can align as precisely as when recorded when parsed and replayed by EpisodeDatasetResolver.
  • test_eepose_error_handling.py: Heavy environment interaction test. Verifies that when an end-effector pose target (ee_pose action space) beyond the robot arm's reach is passed in, DemonstrationWrapper can gracefully catch the underlying physics engine or IK solver errors, and report the exception information by returning info["status"] = "error" to avoid simulation program crashes.
  • test_route_stick_waypoint_boundary.py: Specific route task verification. Ensures that the first online waypoint recorded has sufficient fidelity when the generated demonstration data transitions from offline demonstration to online interaction (Demo -> Non-demo) boundary.
  • test_waypoint_phase_isolation.py: Verifies data isolation for action commands (especially Waypoints) between demonstration recording and online interaction, preventing residual demonstration actions in the buffer from polluting the recorded data during the online phase.

Dataset Generation Sharing Mechanism (Pytest Fixture + Cache)

Because rendering and calling the underlying motion planning solver to record qualified datasets is extremely time-consuming, the tests under tests/dataset/ use a comprehensive data generation caching mechanism (_shared/dataset_generation.py) to ensure that a complete demonstration trajectory for the same case is only generated once:

  1. Session-level generator (dataset_factory): In dataset/conftest.py, a globally unique factory function dataset_factory with a lifecycle spanning the entire Test Session is defined.
  2. Hash-based temporary directory cache (DatasetFactoryCache): The dataset will be flushed to the temporary directory mechanism tmp_path_factory provided by Pytest upon the first request. The unique features of this data are assembled into a cache_key by environment name, number of steps, difficulty, and even action control mode.
  3. Direct retrieval by subsequent tests: For example, if multiple tests are asking for the video_unmaskswap_train_ep0_dataset fixture which contains pre-recorded data, the engine will detect the hit and directly return the same prepared HDF5 test file. This avoids repeating physics calculations for similar task trajectories across multiple test cases.

2. lightweight/ Directory: Lightweight Functional Unit Tests

Mainly targets unit-level or branch-level assertion verification for some internal specific logic such as label matching, data post-processing, and state in various specific task scenarios.

  • test_ChoiceLabel.py: Tests replay matching logic during action inference (oracle_action_matcher), verifying processes such as "accurate extraction of option labels", "ignoring empty label text", and correct mapping to target dictionary options.
  • test_ChoicePositionNearest.py: Position matching logic test. Covers the 3D nearest neighbor behavior of select_target_with_position, including skipping invalid candidates, returning None for no valid input, and stably selecting according to flattened candidate order when equidistant.
  • test_choice_action_pixel_mapping.py: Pixel-level mapping and selection logic. Tests the mapper algorithm (project_world_to_pixel) from world 3D coordinates projected to the camera pixel 2D plane, and verifies the accurate nearest selection capability of selecting targets at screen pixel level (select_target_with_pixel).
  • test_StopcubeIncrement.py: Timing function verification for the specific task StopCube. Verifies whether the absolute time step (absTimestep) increment of the internal scheduler increments as expected and eventually reaches the upper limit phase (Saturation) when the "remain static" option is triggered. Includes cases where simulating backward time steps can correctly reset the counter.
  • test_TaskGoal.py / test_TaskGoalI_isList.py: Branch coverage test for the internal natural language description generation (get_language_goal) logic of task_goal.py. Verifies that up to a dozen subtasks can assemble accurate quantities of bilingual goal descriptions for specific scenarios.
  • test_choice_action_is_keyframe_flow.py: Workflow test targeted at features extracted from discrete items and pixel positions. Determines whether its recording truly satisfies the set keyframe admission conditions, and ensures position_3d is only recorded as a supplementary field.
  • test_waypoint_dense_dedup.py: Tests dense trajectory filtering and adjacent dedup matching logic based on the demonstrated waypoint (Waypoint) action space.
  • test_record_info_is_completed.py: Lightweight validation. Parses the AST tree to ensure RecordWrapper correctly handles progress tasks in the online phase (e.g., progress marker field is_completed, etc.) during HDF5 file generation.
  • test_record_video_metadata_fields.py: Lightweight metadata fields check. Scans syntax tree to ensure RecordWrapper writes correct command labels, action options, completion flags, etc., to the data record Buffer for replay rendering and visualization verification calls.
  • test_record_waypoint_pending_flow.py: Syntax flow analysis. Ensures that data flow (Waypoint Pending state updates, caching) during the recording process has correct lifecycle management logic and isolation clearance measures.
  • test_step_error_handling.py: Lightweight structural test and syntax analysis. Uses Mock and AST checks to confirm the core environment wrapper layer can effectively catch errors, set info["status"] to "error"; and verifies if other callers (like run_example.py) correctly caught the safety error flag, replacing rigid try-except blocks.

3. Public Settings and Helper Scripts (conftest.py and _shared/)

  • conftest.py & dataset/conftest.py: Defines Pytest Fixtures at all levels (including how to pre-register related environments via BenchmarkEnvBuilder, and build dedicated temporary storage factories).
  • _shared/: Contains utility scripts like dataset_generation.py used to coordinate with temporary HDF5 mock structures and uniformly manage the benchmark project path locations.

How to Run Tests

This project heavily relies on uv to manage virtual environments. All test execution commands must be guided by uv run from the code root directory.

1. Run all tests

uv run python -m pytest tests/

If you want to see real-time print() and environment building standard output prompts, you can turn off log capture via -s:

uv run python -m pytest tests/ -s

2. Run by section

Run non-physics rendering tests leaning towards pure logical verification (extremely fast):

uv run python -m pytest tests/lightweight/

Run tests that need actual Mujoco simulation physics and data loading wrappers (slightly time-consuming):

uv run python -m pytest tests/dataset/

3. Execute a specific test script or single test method

Down to the file:

uv run python -m pytest tests/lightweight/test_TaskGoal.py

Run a single test case under a file (for example):

uv run python -m pytest tests/lightweight/test_TaskGoal.py::test_binfill_two_colors

4. Run via Decorator Marks (Pytest Mark)

For some files given @pytest.mark.dataset, you can also execute via match:

uv run python -m pytest -m dataset