VLAarchtests / results /phase_tracking.md
lsnu's picture
Rerun fixed null-rollout world-model ablation
63a70c7 verified
# Phase Tracking
Date closed: `2026-03-25 UTC`
- Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable.
- Regression baselines: `/workspace/VLAarchtests/regression/baselines.md`
- Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.
## Phase 0
- Status: completed.
- Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`.
- Historical dummy benchmark reference:
- interaction `0.5278`
- backbone `0.5556`
- reveal `0.5417`
- Historical CLIP benchmark reference:
- interaction `0.3056`
- backbone `0.3333`
- reveal `0.2083`
## Phase 1
- Config: `proxy_interaction_r3d_stage1_dummy.yaml`
- Seeds: `13, 14, 15`
- Artifact roots:
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15`
- Mean train time: `20.45 s`
- Mean peak GPU memory: `629.62 MB`
- 3-seed benchmark means:
- mean success: `0.5787`
- foliage success: `0.4444`
- bag success: `0.6111`
- cloth success: `0.6806`
- reocclusion rate: `0.0000`
- persistence horizon MAE: `1.9553`
- disturbance cost: `0.3649`
- planner top-1: `0.2832`
- planner regret: `0.0143`
- planner score/utility spearman: `0.2504`
- role collapse: `0.0000`
- proposal diversity: `0.0245`
- swap equivariance error: `0.00768`
- Ablations:
- `no_planner`: `0.5648` mean success, drop `0.0139`
- `no_role_symmetry`: `0.5833` mean success, delta `+0.0046`
- Gate decisions:
- hard success gate `>= 0.58`: fail by `0.0013`
- planner must matter: fail, `no_planner` drop is only `0.0139`
- planner top-1 `>= 0.30`: fail, measured `0.2832`
- role symmetry must matter: fail, `no_role_symmetry` is slightly better than full
- proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
- Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates.
## Phase 2
- Config: `proxy_interaction_r3d_stage2_dummy.yaml`
- Seeds: `21, 22, 23`
- Artifact roots:
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23`
- Mean train time: `20.76 s`
- Mean peak GPU memory: `639.39 MB`
- 3-seed benchmark means:
- mean success: `0.5463`
- foliage success: `0.4444`
- bag success: `0.5417`
- cloth success: `0.6528`
- reocclusion rate: `0.0121`
- persistence horizon MAE: `2.2358`
- disturbance cost: `0.3148`
- planner top-1: `0.3442`
- planner regret: `0.0208`
- planner score/utility spearman: `0.2397`
- belief calibration brier: `0.00842`
- reocclusion calibration brier: `0.2745`
- swap equivariance error: `0.00504`
- Ablations:
- `no_world_model`: `0.5463` mean success, drop `0.0000`
- `short_history`: `0.5463` mean success, delta `0.0000`
- Gate decisions:
- hard success gate `>= 0.60`: fail
- `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
- state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.
## Phase 3
- RGB-only compatibility configs:
- `proxy_interaction_r3d_stage1_clip.yaml`
- `proxy_interaction_r3d_stage2_clip.yaml`
- RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml`
- Artifact roots:
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19`
- RGB-only CLIP means:
- stage 1 clip mean success: `0.5324`
- stage 2 clip mean success: `0.4954`
- Stage 3 RGB-D means:
- mean train time: `145.93 s`
- mean peak GPU memory: `1952.12 MB`
- mean success: `0.5741`
- foliage success: `0.4861`
- bag success: `0.5417`
- cloth success: `0.6944`
- reocclusion rate: `0.0151`
- persistence horizon MAE: `1.7883`
- disturbance cost: `0.2258`
- planner top-1: `0.3265`
- planner regret: `0.0157`
- proposal diversity: `0.0270`
- swap equivariance error: `0.000094`
- `no_depth` ablation:
- mean success: `0.5231`
- absolute drop vs full: `0.0509`
- bag success drops `0.5417 -> 0.4722`
- foliage success drops `0.4861 -> 0.4167`
- Gate decisions:
- CLIP hard success gate `>= 0.37`: pass
- `no_depth` must hurt on at least one geometry-heavy proxy: pass
- no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
- Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.
## Phase 4
- Unit tests:
- command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests`
- result: `10 passed`
- RLBench import/config smoke:
- artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt`
- status: pass
- imports `rlbench`, `pyrep`, `yarr` all resolved
- camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224`
- RLBench launch smoke:
- artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt`
- artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr`
- status: pass
- `open_drawer` resolves to `RightOpenDrawer`
- finite `18`-D action, camera shapes `[224, 224, 3]`, no crash
- RLBench open-drawer rollout:
- artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json`
- status: pass as integration
- no import errors
- no historical workspace path error string
- rollout JSON written
- mean success remains `0.0`, so this is plumbing evidence only
- PerAct2 13-task smoke:
- artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json`
- artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md`
- status: pass as integration
- all `13/13` tasks launched
- finite action check: `13/13`
- summary JSON written
- Integration caveat:
- full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.
## Final Decision
- Phase 1: not accepted
- Phase 2: not accepted
- Phase 3: accepted
- Phase 4 integration: accepted
Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.