RLBench Custom Subset Eval

Date: 2026-03-23 UTC

Scope:

Local 3-task RLBench subset: bimanual_lift_ball, bimanual_push_box, bimanual_dual_push_buttons
Train episodes: episode0
Validation episodes: episode1
Observation interface: front, wrist_left, wrist_right at 224x224
Policy action format: 14-D bimanual delta pose + gripper commands, executed through RLBench bimanual end-effector planning
Backbone used in these custom runs: the existing 128-d dummy frozen backbone, initialized from the previously trained reveal-proxy checkpoints

Offline training results:

Live bounded rollout results:

Per-task live rollout success:

Interpretation:

The missing RLBench-side custom trainer/eval path is now implemented and tested.
The bounded custom subset runs do not support a go decision. They fit the tiny offline slice but do not produce any short-horizon task success in live RLBench rollouts.
On this subset, the reveal-state model is not better than the backbone-only model, and enabling planning does not recover success.
These are not paper-scale results. They are bounded diagnostic runs on a repaired local subset, not a credible full PerAct2 reproduction or a full custom-model benchmark.

Key artifacts:

Backbone-only summary: /workspace/outputs/rlbench_custom/rlbench_subset3_backbone_only_dummy/summary.json
Reveal-state summary: /workspace/outputs/rlbench_custom/rlbench_subset3_reveal_state_dummy/summary.json
Backbone-only rollout: /workspace/reports/rlbench_custom/backbone_only_rollout/rollout_eval.json
Reveal-state rollout, no plan: /workspace/reports/rlbench_custom/reveal_state_rollout_noplan/rollout_eval.json
Reveal-state rollout, plan enabled: /workspace/reports/rlbench_custom/reveal_state_rollout_plan/rollout_eval.json