| # Phase Tracking |
|
|
| Date closed: `2026-03-25 UTC` |
|
|
| - Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable. |
| - Regression baselines: `/workspace/VLAarchtests/regression/baselines.md` |
| - Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks. |
|
|
| ## Phase 0 |
|
|
| - Status: completed. |
| - Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`. |
| - Historical dummy benchmark reference: |
| - interaction `0.5278` |
| - backbone `0.5556` |
| - reveal `0.5417` |
| - Historical CLIP benchmark reference: |
| - interaction `0.3056` |
| - backbone `0.3333` |
| - reveal `0.2083` |
|
|
| ## Phase 1 |
|
|
| - Config: `proxy_interaction_r3d_stage1_dummy.yaml` |
| - Seeds: `13, 14, 15` |
| - Artifact roots: |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15` |
| - Mean train time: `20.45 s` |
| - Mean peak GPU memory: `629.62 MB` |
| - 3-seed benchmark means: |
| - mean success: `0.5787` |
| - foliage success: `0.4444` |
| - bag success: `0.6111` |
| - cloth success: `0.6806` |
| - reocclusion rate: `0.0000` |
| - persistence horizon MAE: `1.9553` |
| - disturbance cost: `0.3649` |
| - planner top-1: `0.2832` |
| - planner regret: `0.0143` |
| - planner score/utility spearman: `0.2504` |
| - role collapse: `0.0000` |
| - proposal diversity: `0.0245` |
| - swap equivariance error: `0.00768` |
| - Ablations: |
| - `no_planner`: `0.5648` mean success, drop `0.0139` |
| - `no_role_symmetry`: `0.5833` mean success, delta `+0.0046` |
| - Gate decisions: |
| - hard success gate `>= 0.58`: fail by `0.0013` |
| - planner must matter: fail, `no_planner` drop is only `0.0139` |
| - planner top-1 `>= 0.30`: fail, measured `0.2832` |
| - role symmetry must matter: fail, `no_role_symmetry` is slightly better than full |
| - proposal collapse must not happen: pass, diversity stayed nonzero across all seeds |
| - Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates. |
|
|
| ## Phase 2 |
|
|
| - Config: `proxy_interaction_r3d_stage2_dummy.yaml` |
| - Seeds: `21, 22, 23` |
| - Artifact roots: |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23` |
| - Mean train time: `20.76 s` |
| - Mean peak GPU memory: `639.39 MB` |
| - 3-seed benchmark means: |
| - mean success: `0.5463` |
| - foliage success: `0.4444` |
| - bag success: `0.5417` |
| - cloth success: `0.6528` |
| - reocclusion rate: `0.0121` |
| - persistence horizon MAE: `2.2358` |
| - disturbance cost: `0.3148` |
| - planner top-1: `0.3442` |
| - planner regret: `0.0208` |
| - planner score/utility spearman: `0.2397` |
| - belief calibration brier: `0.00842` |
| - reocclusion calibration brier: `0.2745` |
| - swap equivariance error: `0.00504` |
| - Ablations: |
| - `no_world_model`: `0.5463` mean success, drop `0.0000` |
| - `short_history`: `0.5463` mean success, delta `0.0000` |
| - Gate decisions: |
| - hard success gate `>= 0.60`: fail |
| - `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000` |
| - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history |
| - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened |
| - Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged. |
|
|
| ## Phase 3 |
|
|
| - RGB-only compatibility configs: |
| - `proxy_interaction_r3d_stage1_clip.yaml` |
| - `proxy_interaction_r3d_stage2_clip.yaml` |
| - RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml` |
| - Artifact roots: |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18` |
| - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19` |
| - RGB-only CLIP means: |
| - stage 1 clip mean success: `0.5324` |
| - stage 2 clip mean success: `0.4954` |
| - Stage 3 RGB-D means: |
| - mean train time: `145.93 s` |
| - mean peak GPU memory: `1952.12 MB` |
| - mean success: `0.5741` |
| - foliage success: `0.4861` |
| - bag success: `0.5417` |
| - cloth success: `0.6944` |
| - reocclusion rate: `0.0151` |
| - persistence horizon MAE: `1.7883` |
| - disturbance cost: `0.2258` |
| - planner top-1: `0.3265` |
| - planner regret: `0.0157` |
| - proposal diversity: `0.0270` |
| - swap equivariance error: `0.000094` |
| - `no_depth` ablation: |
| - mean success: `0.5231` |
| - absolute drop vs full: `0.0509` |
| - bag success drops `0.5417 -> 0.4722` |
| - foliage success drops `0.4861 -> 0.4167` |
| - Gate decisions: |
| - CLIP hard success gate `>= 0.37`: pass |
| - `no_depth` must hurt on at least one geometry-heavy proxy: pass |
| - no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics |
| - Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates. |
|
|
| ## Phase 4 |
|
|
| - Unit tests: |
| - command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests` |
| - result: `10 passed` |
| - RLBench import/config smoke: |
| - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt` |
| - status: pass |
| - imports `rlbench`, `pyrep`, `yarr` all resolved |
| - camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224` |
| - RLBench launch smoke: |
| - artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt` |
| - artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr` |
| - status: pass |
| - `open_drawer` resolves to `RightOpenDrawer` |
| - finite `18`-D action, camera shapes `[224, 224, 3]`, no crash |
| - RLBench open-drawer rollout: |
| - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json` |
| - status: pass as integration |
| - no import errors |
| - no historical workspace path error string |
| - rollout JSON written |
| - mean success remains `0.0`, so this is plumbing evidence only |
| - PerAct2 13-task smoke: |
| - artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json` |
| - artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md` |
| - status: pass as integration |
| - all `13/13` tasks launched |
| - finite action check: `13/13` |
| - summary JSON written |
| - Integration caveat: |
| - full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary. |
|
|
| ## Final Decision |
|
|
| - Phase 1: not accepted |
| - Phase 2: not accepted |
| - Phase 3: accepted |
| - Phase 4 integration: accepted |
|
|
| Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work. |
|
|