Phase Tracking

Date closed: 2026-03-25 UTC

Snapshot note: this Hugging Face snapshot does not contain .git, so a git commit hash is unavailable.
Regression baselines: /workspace/VLAarchtests/regression/baselines.md
Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.

Phase 0

Status: completed.
Historical baseline artifacts are locked in /workspace/VLAarchtests/regression/baselines.md.
Historical dummy benchmark reference:
- interaction 0.5278
- backbone 0.5556
- reveal 0.5417
Historical CLIP benchmark reference:
- interaction 0.3056
- backbone 0.3333
- reveal 0.2083

Phase 1

Config: proxy_interaction_r3d_stage1_dummy.yaml
Seeds: 13, 14, 15
Artifact roots:
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15
Mean train time: 20.45 s
Mean peak GPU memory: 629.62 MB
3-seed benchmark means:
- mean success: 0.5787
- foliage success: 0.4444
- bag success: 0.6111
- cloth success: 0.6806
- reocclusion rate: 0.0000
- persistence horizon MAE: 1.9553
- disturbance cost: 0.3649
- planner top-1: 0.2832
- planner regret: 0.0143
- planner score/utility spearman: 0.2504
- role collapse: 0.0000
- proposal diversity: 0.0245
- swap equivariance error: 0.00768
Ablations:
- no_planner: 0.5648 mean success, drop 0.0139
- no_role_symmetry: 0.5833 mean success, delta +0.0046
Gate decisions:
- hard success gate >= 0.58: fail by 0.0013
- planner must matter: fail, no_planner drop is only 0.0139
- planner top-1 >= 0.30: fail, measured 0.2832
- role symmetry must matter: fail, no_role_symmetry is slightly better than full
- proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
Takeaway: the structure refactor improved over the historical interaction baseline (0.5787 vs 0.5278) and exceeded the historical dummy backbone baseline (0.5556), but it did not clear the phase-1 acceptance gates.

Phase 2

Config: proxy_interaction_r3d_stage2_dummy.yaml
Seeds: 21, 22, 23
Artifact roots:
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23
Mean train time: 20.76 s
Mean peak GPU memory: 639.39 MB
3-seed benchmark means:
- mean success: 0.5463
- foliage success: 0.4444
- bag success: 0.5417
- cloth success: 0.6528
- reocclusion rate: 0.0121
- persistence horizon MAE: 2.2358
- disturbance cost: 0.3148
- planner top-1: 0.3442
- planner regret: 0.0208
- planner score/utility spearman: 0.2397
- belief calibration brier: 0.00842
- reocclusion calibration brier: 0.2745
- swap equivariance error: 0.00504
Ablations:
- no_world_model: 0.5463 mean success, drop 0.0000
- short_history: 0.5463 mean success, delta 0.0000
Gate decisions:
- hard success gate >= 0.60: fail
- no_world_model must hurt: fail; the 2026-03-25 post-fix null-rollout rerun remained at 0.5463, drop 0.0000
- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
- state metrics should improve over phase 1: fail, reocclusion rate increased (0.0000 -> 0.0121), persistence MAE worsened (1.9553 -> 2.2358), and calibration worsened
Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.

Phase 3

RGB-only compatibility configs:
- proxy_interaction_r3d_stage1_clip.yaml
- proxy_interaction_r3d_stage2_clip.yaml
RGB-D config: proxy_interaction_r3d_stage3_clip_rgbd.yaml
Artifact roots:
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18
- /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19
RGB-only CLIP means:
- stage 1 clip mean success: 0.5324
- stage 2 clip mean success: 0.4954
Stage 3 RGB-D means:
- mean train time: 145.93 s
- mean peak GPU memory: 1952.12 MB
- mean success: 0.5741
- foliage success: 0.4861
- bag success: 0.5417
- cloth success: 0.6944
- reocclusion rate: 0.0151
- persistence horizon MAE: 1.7883
- disturbance cost: 0.2258
- planner top-1: 0.3265
- planner regret: 0.0157
- proposal diversity: 0.0270
- swap equivariance error: 0.000094
no_depth ablation:
- mean success: 0.5231
- absolute drop vs full: 0.0509
- bag success drops 0.5417 -> 0.4722
- foliage success drops 0.4861 -> 0.4167
Gate decisions:
- CLIP hard success gate >= 0.37: pass
- no_depth must hurt on at least one geometry-heavy proxy: pass
- no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.

Phase 4

Unit tests:
- command: PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests
- result: 10 passed
RLBench import/config smoke:
- artifact: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt
- status: pass
- imports rlbench, pyrep, yarr all resolved
- camera contract preserved: front, wrist_left, wrist_right at 224x224
RLBench launch smoke:
- artifact stdout: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt
- artifact stderr: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr
- status: pass
- open_drawer resolves to RightOpenDrawer
- finite 18-D action, camera shapes [224, 224, 3], no crash
RLBench open-drawer rollout:
- artifact: /workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json
- status: pass as integration
- no import errors
- no historical workspace path error string
- rollout JSON written
- mean success remains 0.0, so this is plumbing evidence only
PerAct2 13-task smoke:
- artifact summary JSON: /workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json
- artifact summary markdown: /workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md
- status: pass as integration
- all 13/13 tasks launched
- finite action check: 13/13
- summary JSON written
Integration caveat:
- full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched run_rlbench_rollout_eval attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.

Final Decision

Phase 1: not accepted
Phase 2: not accepted
Phase 3: accepted
Phase 4 integration: accepted

Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.