VLAarchtests / results /phase_tracking.md
lsnu's picture
Rerun fixed null-rollout world-model ablation
63a70c7 verified

Phase Tracking

Date closed: 2026-03-25 UTC

  • Snapshot note: this Hugging Face snapshot does not contain .git, so a git commit hash is unavailable.
  • Regression baselines: /workspace/VLAarchtests/regression/baselines.md
  • Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.

Phase 0

  • Status: completed.
  • Historical baseline artifacts are locked in /workspace/VLAarchtests/regression/baselines.md.
  • Historical dummy benchmark reference:
    • interaction 0.5278
    • backbone 0.5556
    • reveal 0.5417
  • Historical CLIP benchmark reference:
    • interaction 0.3056
    • backbone 0.3333
    • reveal 0.2083

Phase 1

  • Config: proxy_interaction_r3d_stage1_dummy.yaml
  • Seeds: 13, 14, 15
  • Artifact roots:
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15
  • Mean train time: 20.45 s
  • Mean peak GPU memory: 629.62 MB
  • 3-seed benchmark means:
    • mean success: 0.5787
    • foliage success: 0.4444
    • bag success: 0.6111
    • cloth success: 0.6806
    • reocclusion rate: 0.0000
    • persistence horizon MAE: 1.9553
    • disturbance cost: 0.3649
    • planner top-1: 0.2832
    • planner regret: 0.0143
    • planner score/utility spearman: 0.2504
    • role collapse: 0.0000
    • proposal diversity: 0.0245
    • swap equivariance error: 0.00768
  • Ablations:
    • no_planner: 0.5648 mean success, drop 0.0139
    • no_role_symmetry: 0.5833 mean success, delta +0.0046
  • Gate decisions:
    • hard success gate >= 0.58: fail by 0.0013
    • planner must matter: fail, no_planner drop is only 0.0139
    • planner top-1 >= 0.30: fail, measured 0.2832
    • role symmetry must matter: fail, no_role_symmetry is slightly better than full
    • proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
  • Takeaway: the structure refactor improved over the historical interaction baseline (0.5787 vs 0.5278) and exceeded the historical dummy backbone baseline (0.5556), but it did not clear the phase-1 acceptance gates.

Phase 2

  • Config: proxy_interaction_r3d_stage2_dummy.yaml
  • Seeds: 21, 22, 23
  • Artifact roots:
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23
  • Mean train time: 20.76 s
  • Mean peak GPU memory: 639.39 MB
  • 3-seed benchmark means:
    • mean success: 0.5463
    • foliage success: 0.4444
    • bag success: 0.5417
    • cloth success: 0.6528
    • reocclusion rate: 0.0121
    • persistence horizon MAE: 2.2358
    • disturbance cost: 0.3148
    • planner top-1: 0.3442
    • planner regret: 0.0208
    • planner score/utility spearman: 0.2397
    • belief calibration brier: 0.00842
    • reocclusion calibration brier: 0.2745
    • swap equivariance error: 0.00504
  • Ablations:
    • no_world_model: 0.5463 mean success, drop 0.0000
    • short_history: 0.5463 mean success, delta 0.0000
  • Gate decisions:
    • hard success gate >= 0.60: fail
    • no_world_model must hurt: fail; the 2026-03-25 post-fix null-rollout rerun remained at 0.5463, drop 0.0000
    • full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
    • state metrics should improve over phase 1: fail, reocclusion rate increased (0.0000 -> 0.0121), persistence MAE worsened (1.9553 -> 2.2358), and calibration worsened
  • Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.

Phase 3

  • RGB-only compatibility configs:
    • proxy_interaction_r3d_stage1_clip.yaml
    • proxy_interaction_r3d_stage2_clip.yaml
  • RGB-D config: proxy_interaction_r3d_stage3_clip_rgbd.yaml
  • Artifact roots:
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18
    • /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19
  • RGB-only CLIP means:
    • stage 1 clip mean success: 0.5324
    • stage 2 clip mean success: 0.4954
  • Stage 3 RGB-D means:
    • mean train time: 145.93 s
    • mean peak GPU memory: 1952.12 MB
    • mean success: 0.5741
    • foliage success: 0.4861
    • bag success: 0.5417
    • cloth success: 0.6944
    • reocclusion rate: 0.0151
    • persistence horizon MAE: 1.7883
    • disturbance cost: 0.2258
    • planner top-1: 0.3265
    • planner regret: 0.0157
    • proposal diversity: 0.0270
    • swap equivariance error: 0.000094
  • no_depth ablation:
    • mean success: 0.5231
    • absolute drop vs full: 0.0509
    • bag success drops 0.5417 -> 0.4722
    • foliage success drops 0.4861 -> 0.4167
  • Gate decisions:
    • CLIP hard success gate >= 0.37: pass
    • no_depth must hurt on at least one geometry-heavy proxy: pass
    • no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
  • Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.

Phase 4

  • Unit tests:
    • command: PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests
    • result: 10 passed
  • RLBench import/config smoke:
    • artifact: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt
    • status: pass
    • imports rlbench, pyrep, yarr all resolved
    • camera contract preserved: front, wrist_left, wrist_right at 224x224
  • RLBench launch smoke:
    • artifact stdout: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt
    • artifact stderr: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr
    • status: pass
    • open_drawer resolves to RightOpenDrawer
    • finite 18-D action, camera shapes [224, 224, 3], no crash
  • RLBench open-drawer rollout:
    • artifact: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json
    • status: pass as integration
    • no import errors
    • no historical workspace path error string
    • rollout JSON written
    • mean success remains 0.0, so this is plumbing evidence only
  • PerAct2 13-task smoke:
    • artifact summary JSON: /workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json
    • artifact summary markdown: /workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md
    • status: pass as integration
    • all 13/13 tasks launched
    • finite action check: 13/13
    • summary JSON written
  • Integration caveat:
    • full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched run_rlbench_rollout_eval attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.

Final Decision

  • Phase 1: not accepted
  • Phase 2: not accepted
  • Phase 3: accepted
  • Phase 4 integration: accepted

Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.