Phase Tracking
Date closed: 2026-03-25 UTC
- Snapshot note: this Hugging Face snapshot does not contain
.git, so a git commit hash is unavailable. - Regression baselines:
/workspace/VLAarchtests/regression/baselines.md - Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.
Phase 0
- Status: completed.
- Historical baseline artifacts are locked in
/workspace/VLAarchtests/regression/baselines.md. - Historical dummy benchmark reference:
- interaction
0.5278 - backbone
0.5556 - reveal
0.5417
- interaction
- Historical CLIP benchmark reference:
- interaction
0.3056 - backbone
0.3333 - reveal
0.2083
- interaction
Phase 1
- Config:
proxy_interaction_r3d_stage1_dummy.yaml - Seeds:
13, 14, 15 - Artifact roots:
/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15
- Mean train time:
20.45 s - Mean peak GPU memory:
629.62 MB - 3-seed benchmark means:
- mean success:
0.5787 - foliage success:
0.4444 - bag success:
0.6111 - cloth success:
0.6806 - reocclusion rate:
0.0000 - persistence horizon MAE:
1.9553 - disturbance cost:
0.3649 - planner top-1:
0.2832 - planner regret:
0.0143 - planner score/utility spearman:
0.2504 - role collapse:
0.0000 - proposal diversity:
0.0245 - swap equivariance error:
0.00768
- mean success:
- Ablations:
no_planner:0.5648mean success, drop0.0139no_role_symmetry:0.5833mean success, delta+0.0046
- Gate decisions:
- hard success gate
>= 0.58: fail by0.0013 - planner must matter: fail,
no_plannerdrop is only0.0139 - planner top-1
>= 0.30: fail, measured0.2832 - role symmetry must matter: fail,
no_role_symmetryis slightly better than full - proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
- hard success gate
- Takeaway: the structure refactor improved over the historical interaction baseline (
0.5787vs0.5278) and exceeded the historical dummy backbone baseline (0.5556), but it did not clear the phase-1 acceptance gates.
Phase 2
- Config:
proxy_interaction_r3d_stage2_dummy.yaml - Seeds:
21, 22, 23 - Artifact roots:
/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23
- Mean train time:
20.76 s - Mean peak GPU memory:
639.39 MB - 3-seed benchmark means:
- mean success:
0.5463 - foliage success:
0.4444 - bag success:
0.5417 - cloth success:
0.6528 - reocclusion rate:
0.0121 - persistence horizon MAE:
2.2358 - disturbance cost:
0.3148 - planner top-1:
0.3442 - planner regret:
0.0208 - planner score/utility spearman:
0.2397 - belief calibration brier:
0.00842 - reocclusion calibration brier:
0.2745 - swap equivariance error:
0.00504
- mean success:
- Ablations:
no_world_model:0.5463mean success, drop0.0000short_history:0.5463mean success, delta0.0000
- Gate decisions:
- hard success gate
>= 0.60: fail no_world_modelmust hurt: fail; the2026-03-25post-fix null-rollout rerun remained at0.5463, drop0.0000- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
- state metrics should improve over phase 1: fail, reocclusion rate increased (
0.0000 -> 0.0121), persistence MAE worsened (1.9553 -> 2.2358), and calibration worsened
- hard success gate
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.
Phase 3
- RGB-only compatibility configs:
proxy_interaction_r3d_stage1_clip.yamlproxy_interaction_r3d_stage2_clip.yaml
- RGB-D config:
proxy_interaction_r3d_stage3_clip_rgbd.yaml - Artifact roots:
/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19
- RGB-only CLIP means:
- stage 1 clip mean success:
0.5324 - stage 2 clip mean success:
0.4954
- stage 1 clip mean success:
- Stage 3 RGB-D means:
- mean train time:
145.93 s - mean peak GPU memory:
1952.12 MB - mean success:
0.5741 - foliage success:
0.4861 - bag success:
0.5417 - cloth success:
0.6944 - reocclusion rate:
0.0151 - persistence horizon MAE:
1.7883 - disturbance cost:
0.2258 - planner top-1:
0.3265 - planner regret:
0.0157 - proposal diversity:
0.0270 - swap equivariance error:
0.000094
- mean train time:
no_depthablation:- mean success:
0.5231 - absolute drop vs full:
0.0509 - bag success drops
0.5417 -> 0.4722 - foliage success drops
0.4861 -> 0.4167
- mean success:
- Gate decisions:
- CLIP hard success gate
>= 0.37: pass no_depthmust hurt on at least one geometry-heavy proxy: pass- no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
- CLIP hard success gate
- Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.
Phase 4
- Unit tests:
- command:
PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests - result:
10 passed
- command:
- RLBench import/config smoke:
- artifact:
/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt - status: pass
- imports
rlbench,pyrep,yarrall resolved - camera contract preserved:
front,wrist_left,wrist_rightat224x224
- artifact:
- RLBench launch smoke:
- artifact stdout:
/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt - artifact stderr:
/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr - status: pass
open_drawerresolves toRightOpenDrawer- finite
18-D action, camera shapes[224, 224, 3], no crash
- artifact stdout:
- RLBench open-drawer rollout:
- artifact:
/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json - status: pass as integration
- no import errors
- no historical workspace path error string
- rollout JSON written
- mean success remains
0.0, so this is plumbing evidence only
- artifact:
- PerAct2 13-task smoke:
- artifact summary JSON:
/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json - artifact summary markdown:
/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md - status: pass as integration
- all
13/13tasks launched - finite action check:
13/13 - summary JSON written
- artifact summary JSON:
- Integration caveat:
- full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched
run_rlbench_rollout_evalattempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.
- full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched
Final Decision
- Phase 1: not accepted
- Phase 2: not accepted
- Phase 3: accepted
- Phase 4 integration: accepted
Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.