VLAarchtests / artifacts /reports /rlbench_custom /rlbench_subset3_custom_eval.md
lsnu's picture
Add files using upload-large-folder tool
35377df verified

RLBench Custom Subset Eval

Date: 2026-03-23 UTC

Scope:

  • Local 3-task RLBench subset: bimanual_lift_ball, bimanual_push_box, bimanual_dual_push_buttons
  • Train episodes: episode0
  • Validation episodes: episode1
  • Observation interface: front, wrist_left, wrist_right at 224x224
  • Policy action format: 14-D bimanual delta pose + gripper commands, executed through RLBench bimanual end-effector planning
  • Backbone used in these custom runs: the existing 128-d dummy frozen backbone, initialized from the previously trained reveal-proxy checkpoints

Offline training results:

  • Backbone-only: train total 0.0079247, val total 0.0056060
  • Reveal-state: train total 0.0078287, val total 0.0091639

Live bounded rollout results:

  • Backbone-only, plan=false: mean success 0.000
  • Reveal-state, plan=false: mean success 0.000
  • Reveal-state, plan=true: mean success 0.000

Per-task live rollout success:

  • bimanual_lift_ball: 0.0 for all three runs
  • bimanual_push_box: 0.0 for all three runs
  • bimanual_dual_push_buttons: 0.0 for all three runs

Interpretation:

  • The missing RLBench-side custom trainer/eval path is now implemented and tested.
  • The bounded custom subset runs do not support a go decision. They fit the tiny offline slice but do not produce any short-horizon task success in live RLBench rollouts.
  • On this subset, the reveal-state model is not better than the backbone-only model, and enabling planning does not recover success.
  • These are not paper-scale results. They are bounded diagnostic runs on a repaired local subset, not a credible full PerAct2 reproduction or a full custom-model benchmark.

Key artifacts:

  • Backbone-only summary: /workspace/outputs/rlbench_custom/rlbench_subset3_backbone_only_dummy/summary.json
  • Reveal-state summary: /workspace/outputs/rlbench_custom/rlbench_subset3_reveal_state_dummy/summary.json
  • Backbone-only rollout: /workspace/reports/rlbench_custom/backbone_only_rollout/rollout_eval.json
  • Reveal-state rollout, no plan: /workspace/reports/rlbench_custom/reveal_state_rollout_noplan/rollout_eval.json
  • Reveal-state rollout, plan enabled: /workspace/reports/rlbench_custom/reveal_state_rollout_plan/rollout_eval.json