# Phase Tracking Date closed: `2026-03-25 UTC` - Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable. - Regression baselines: `/workspace/VLAarchtests/regression/baselines.md` - Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks. ## Phase 0 - Status: completed. - Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`. - Historical dummy benchmark reference: - interaction `0.5278` - backbone `0.5556` - reveal `0.5417` - Historical CLIP benchmark reference: - interaction `0.3056` - backbone `0.3333` - reveal `0.2083` ## Phase 1 - Config: `proxy_interaction_r3d_stage1_dummy.yaml` - Seeds: `13, 14, 15` - Artifact roots: - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15` - Mean train time: `20.45 s` - Mean peak GPU memory: `629.62 MB` - 3-seed benchmark means: - mean success: `0.5787` - foliage success: `0.4444` - bag success: `0.6111` - cloth success: `0.6806` - reocclusion rate: `0.0000` - persistence horizon MAE: `1.9553` - disturbance cost: `0.3649` - planner top-1: `0.2832` - planner regret: `0.0143` - planner score/utility spearman: `0.2504` - role collapse: `0.0000` - proposal diversity: `0.0245` - swap equivariance error: `0.00768` - Ablations: - `no_planner`: `0.5648` mean success, drop `0.0139` - `no_role_symmetry`: `0.5833` mean success, delta `+0.0046` - Gate decisions: - hard success gate `>= 0.58`: fail by `0.0013` - planner must matter: fail, `no_planner` drop is only `0.0139` - planner top-1 `>= 0.30`: fail, measured `0.2832` - role symmetry must matter: fail, `no_role_symmetry` is slightly better than full - proposal collapse must not happen: pass, diversity stayed nonzero across all seeds - Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates. ## Phase 2 - Config: `proxy_interaction_r3d_stage2_dummy.yaml` - Seeds: `21, 22, 23` - Artifact roots: - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23` - Mean train time: `20.76 s` - Mean peak GPU memory: `639.39 MB` - 3-seed benchmark means: - mean success: `0.5463` - foliage success: `0.4444` - bag success: `0.5417` - cloth success: `0.6528` - reocclusion rate: `0.0121` - persistence horizon MAE: `2.2358` - disturbance cost: `0.3148` - planner top-1: `0.3442` - planner regret: `0.0208` - planner score/utility spearman: `0.2397` - belief calibration brier: `0.00842` - reocclusion calibration brier: `0.2745` - swap equivariance error: `0.00504` - Ablations: - `no_world_model`: `0.5463` mean success, drop `0.0000` - `short_history`: `0.5463` mean success, delta `0.0000` - Gate decisions: - hard success gate `>= 0.60`: fail - `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000` - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened - Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged. ## Phase 3 - RGB-only compatibility configs: - `proxy_interaction_r3d_stage1_clip.yaml` - `proxy_interaction_r3d_stage2_clip.yaml` - RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml` - Artifact roots: - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18` - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19` - RGB-only CLIP means: - stage 1 clip mean success: `0.5324` - stage 2 clip mean success: `0.4954` - Stage 3 RGB-D means: - mean train time: `145.93 s` - mean peak GPU memory: `1952.12 MB` - mean success: `0.5741` - foliage success: `0.4861` - bag success: `0.5417` - cloth success: `0.6944` - reocclusion rate: `0.0151` - persistence horizon MAE: `1.7883` - disturbance cost: `0.2258` - planner top-1: `0.3265` - planner regret: `0.0157` - proposal diversity: `0.0270` - swap equivariance error: `0.000094` - `no_depth` ablation: - mean success: `0.5231` - absolute drop vs full: `0.0509` - bag success drops `0.5417 -> 0.4722` - foliage success drops `0.4861 -> 0.4167` - Gate decisions: - CLIP hard success gate `>= 0.37`: pass - `no_depth` must hurt on at least one geometry-heavy proxy: pass - no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics - Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates. ## Phase 4 - Unit tests: - command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests` - result: `10 passed` - RLBench import/config smoke: - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt` - status: pass - imports `rlbench`, `pyrep`, `yarr` all resolved - camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224` - RLBench launch smoke: - artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt` - artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr` - status: pass - `open_drawer` resolves to `RightOpenDrawer` - finite `18`-D action, camera shapes `[224, 224, 3]`, no crash - RLBench open-drawer rollout: - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json` - status: pass as integration - no import errors - no historical workspace path error string - rollout JSON written - mean success remains `0.0`, so this is plumbing evidence only - PerAct2 13-task smoke: - artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json` - artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md` - status: pass as integration - all `13/13` tasks launched - finite action check: `13/13` - summary JSON written - Integration caveat: - full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary. ## Final Decision - Phase 1: not accepted - Phase 2: not accepted - Phase 3: accepted - Phase 4 integration: accepted Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.