File size: 8,703 Bytes

# Phase Tracking

Date closed: `2026-03-25 UTC`

- Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable.
- Regression baselines: `/workspace/VLAarchtests/regression/baselines.md`
- Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.

## Phase 0

- Status: completed.
- Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`.
- Historical dummy benchmark reference:
  - interaction `0.5278`
  - backbone `0.5556`
  - reveal `0.5417`
- Historical CLIP benchmark reference:
  - interaction `0.3056`
  - backbone `0.3333`
  - reveal `0.2083`

## Phase 1

- Config: `proxy_interaction_r3d_stage1_dummy.yaml`
- Seeds: `13, 14, 15`
- Artifact roots:
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15`
- Mean train time: `20.45 s`
- Mean peak GPU memory: `629.62 MB`
- 3-seed benchmark means:
  - mean success: `0.5787`
  - foliage success: `0.4444`
  - bag success: `0.6111`
  - cloth success: `0.6806`
  - reocclusion rate: `0.0000`
  - persistence horizon MAE: `1.9553`
  - disturbance cost: `0.3649`
  - planner top-1: `0.2832`
  - planner regret: `0.0143`
  - planner score/utility spearman: `0.2504`
  - role collapse: `0.0000`
  - proposal diversity: `0.0245`
  - swap equivariance error: `0.00768`
- Ablations:
  - `no_planner`: `0.5648` mean success, drop `0.0139`
  - `no_role_symmetry`: `0.5833` mean success, delta `+0.0046`
- Gate decisions:
  - hard success gate `>= 0.58`: fail by `0.0013`
  - planner must matter: fail, `no_planner` drop is only `0.0139`
  - planner top-1 `>= 0.30`: fail, measured `0.2832`
  - role symmetry must matter: fail, `no_role_symmetry` is slightly better than full
  - proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
- Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates.

## Phase 2

- Config: `proxy_interaction_r3d_stage2_dummy.yaml`
- Seeds: `21, 22, 23`
- Artifact roots:
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23`
- Mean train time: `20.76 s`
- Mean peak GPU memory: `639.39 MB`
- 3-seed benchmark means:
  - mean success: `0.5463`
  - foliage success: `0.4444`
  - bag success: `0.5417`
  - cloth success: `0.6528`
  - reocclusion rate: `0.0121`
  - persistence horizon MAE: `2.2358`
  - disturbance cost: `0.3148`
  - planner top-1: `0.3442`
  - planner regret: `0.0208`
  - planner score/utility spearman: `0.2397`
  - belief calibration brier: `0.00842`
  - reocclusion calibration brier: `0.2745`
  - swap equivariance error: `0.00504`
- Ablations:
  - `no_world_model`: `0.5463` mean success, drop `0.0000`
  - `short_history`: `0.5463` mean success, delta `0.0000`
- Gate decisions:
  - hard success gate `>= 0.60`: fail
  - `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
  - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
  - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.

## Phase 3

- RGB-only compatibility configs:
  - `proxy_interaction_r3d_stage1_clip.yaml`
  - `proxy_interaction_r3d_stage2_clip.yaml`
- RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml`
- Artifact roots:
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19`
- RGB-only CLIP means:
  - stage 1 clip mean success: `0.5324`
  - stage 2 clip mean success: `0.4954`
- Stage 3 RGB-D means:
  - mean train time: `145.93 s`
  - mean peak GPU memory: `1952.12 MB`
  - mean success: `0.5741`
  - foliage success: `0.4861`
  - bag success: `0.5417`
  - cloth success: `0.6944`
  - reocclusion rate: `0.0151`
  - persistence horizon MAE: `1.7883`
  - disturbance cost: `0.2258`
  - planner top-1: `0.3265`
  - planner regret: `0.0157`
  - proposal diversity: `0.0270`
  - swap equivariance error: `0.000094`
- `no_depth` ablation:
  - mean success: `0.5231`
  - absolute drop vs full: `0.0509`
  - bag success drops `0.5417 -> 0.4722`
  - foliage success drops `0.4861 -> 0.4167`
- Gate decisions:
  - CLIP hard success gate `>= 0.37`: pass
  - `no_depth` must hurt on at least one geometry-heavy proxy: pass
  - no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
- Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.

## Phase 4

- Unit tests:
  - command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests`
  - result: `10 passed`
- RLBench import/config smoke:
  - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt`
  - status: pass
  - imports `rlbench`, `pyrep`, `yarr` all resolved
  - camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224`
- RLBench launch smoke:
  - artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt`
  - artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr`
  - status: pass
  - `open_drawer` resolves to `RightOpenDrawer`
  - finite `18`-D action, camera shapes `[224, 224, 3]`, no crash
- RLBench open-drawer rollout:
  - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json`
  - status: pass as integration
  - no import errors
  - no historical workspace path error string
  - rollout JSON written
  - mean success remains `0.0`, so this is plumbing evidence only
- PerAct2 13-task smoke:
  - artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json`
  - artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md`
  - status: pass as integration
  - all `13/13` tasks launched
  - finite action check: `13/13`
  - summary JSON written
- Integration caveat:
  - full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.

## Final Decision

- Phase 1: not accepted
- Phase 2: not accepted
- Phase 3: accepted
- Phase 4 integration: accepted

Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.