VLAarchtests / results /phase_tracking.md

Rerun fixed null-rollout world-model ablation

63a70c7 verified 9 days ago

8.7 kB

	# Phase Tracking

	Date closed: `2026-03-25 UTC`

	- Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable.
	- Regression baselines: `/workspace/VLAarchtests/regression/baselines.md`
	- Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.

	## Phase 0

	- Status: completed.
	- Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`.
	- Historical dummy benchmark reference:
	- interaction `0.5278`
	- backbone `0.5556`
	- reveal `0.5417`
	- Historical CLIP benchmark reference:
	- interaction `0.3056`
	- backbone `0.3333`
	- reveal `0.2083`

	## Phase 1

	- Config: `proxy_interaction_r3d_stage1_dummy.yaml`
	- Seeds: `13, 14, 15`
	- Artifact roots:
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15`
	- Mean train time: `20.45 s`
	- Mean peak GPU memory: `629.62 MB`
	- 3-seed benchmark means:
	- mean success: `0.5787`
	- foliage success: `0.4444`
	- bag success: `0.6111`
	- cloth success: `0.6806`
	- reocclusion rate: `0.0000`
	- persistence horizon MAE: `1.9553`
	- disturbance cost: `0.3649`
	- planner top-1: `0.2832`
	- planner regret: `0.0143`
	- planner score/utility spearman: `0.2504`
	- role collapse: `0.0000`
	- proposal diversity: `0.0245`
	- swap equivariance error: `0.00768`
	- Ablations:
	- `no_planner`: `0.5648` mean success, drop `0.0139`
	- `no_role_symmetry`: `0.5833` mean success, delta `+0.0046`
	- Gate decisions:
	- hard success gate `>= 0.58`: fail by `0.0013`
	- planner must matter: fail, `no_planner` drop is only `0.0139`
	- planner top-1 `>= 0.30`: fail, measured `0.2832`
	- role symmetry must matter: fail, `no_role_symmetry` is slightly better than full
	- proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
	- Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates.

	## Phase 2

	- Config: `proxy_interaction_r3d_stage2_dummy.yaml`
	- Seeds: `21, 22, 23`
	- Artifact roots:
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23`
	- Mean train time: `20.76 s`
	- Mean peak GPU memory: `639.39 MB`
	- 3-seed benchmark means:
	- mean success: `0.5463`
	- foliage success: `0.4444`
	- bag success: `0.5417`
	- cloth success: `0.6528`
	- reocclusion rate: `0.0121`
	- persistence horizon MAE: `2.2358`
	- disturbance cost: `0.3148`
	- planner top-1: `0.3442`
	- planner regret: `0.0208`
	- planner score/utility spearman: `0.2397`
	- belief calibration brier: `0.00842`
	- reocclusion calibration brier: `0.2745`
	- swap equivariance error: `0.00504`
	- Ablations:
	- `no_world_model`: `0.5463` mean success, drop `0.0000`
	- `short_history`: `0.5463` mean success, delta `0.0000`
	- Gate decisions:
	- hard success gate `>= 0.60`: fail
	- `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
	- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
	- state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
	- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.

	## Phase 3

	- RGB-only compatibility configs:
	- `proxy_interaction_r3d_stage1_clip.yaml`
	- `proxy_interaction_r3d_stage2_clip.yaml`
	- RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml`
	- Artifact roots:
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18`
	- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19`
	- RGB-only CLIP means:
	- stage 1 clip mean success: `0.5324`
	- stage 2 clip mean success: `0.4954`
	- Stage 3 RGB-D means:
	- mean train time: `145.93 s`
	- mean peak GPU memory: `1952.12 MB`
	- mean success: `0.5741`
	- foliage success: `0.4861`
	- bag success: `0.5417`
	- cloth success: `0.6944`
	- reocclusion rate: `0.0151`
	- persistence horizon MAE: `1.7883`
	- disturbance cost: `0.2258`
	- planner top-1: `0.3265`
	- planner regret: `0.0157`
	- proposal diversity: `0.0270`
	- swap equivariance error: `0.000094`
	- `no_depth` ablation:
	- mean success: `0.5231`
	- absolute drop vs full: `0.0509`
	- bag success drops `0.5417 -> 0.4722`
	- foliage success drops `0.4861 -> 0.4167`
	- Gate decisions:
	- CLIP hard success gate `>= 0.37`: pass
	- `no_depth` must hurt on at least one geometry-heavy proxy: pass
	- no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
	- Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.

	## Phase 4

	- Unit tests:
	- command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests`
	- result: `10 passed`
	- RLBench import/config smoke:
	- artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt`
	- status: pass
	- imports `rlbench`, `pyrep`, `yarr` all resolved
	- camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224`
	- RLBench launch smoke:
	- artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt`
	- artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr`
	- status: pass
	- `open_drawer` resolves to `RightOpenDrawer`
	- finite `18`-D action, camera shapes `[224, 224, 3]`, no crash
	- RLBench open-drawer rollout:
	- artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json`
	- status: pass as integration
	- no import errors
	- no historical workspace path error string
	- rollout JSON written
	- mean success remains `0.0`, so this is plumbing evidence only
	- PerAct2 13-task smoke:
	- artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json`
	- artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md`
	- status: pass as integration
	- all `13/13` tasks launched
	- finite action check: `13/13`
	- summary JSON written
	- Integration caveat:
	- full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.

	## Final Decision

	- Phase 1: not accepted
	- Phase 2: not accepted
	- Phase 3: accepted
	- Phase 4 integration: accepted

	Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.