File size: 8,703 Bytes
16405f2 63a70c7 16405f2 63a70c7 16405f2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | # Phase Tracking
Date closed: `2026-03-25 UTC`
- Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable.
- Regression baselines: `/workspace/VLAarchtests/regression/baselines.md`
- Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.
## Phase 0
- Status: completed.
- Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`.
- Historical dummy benchmark reference:
- interaction `0.5278`
- backbone `0.5556`
- reveal `0.5417`
- Historical CLIP benchmark reference:
- interaction `0.3056`
- backbone `0.3333`
- reveal `0.2083`
## Phase 1
- Config: `proxy_interaction_r3d_stage1_dummy.yaml`
- Seeds: `13, 14, 15`
- Artifact roots:
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15`
- Mean train time: `20.45 s`
- Mean peak GPU memory: `629.62 MB`
- 3-seed benchmark means:
- mean success: `0.5787`
- foliage success: `0.4444`
- bag success: `0.6111`
- cloth success: `0.6806`
- reocclusion rate: `0.0000`
- persistence horizon MAE: `1.9553`
- disturbance cost: `0.3649`
- planner top-1: `0.2832`
- planner regret: `0.0143`
- planner score/utility spearman: `0.2504`
- role collapse: `0.0000`
- proposal diversity: `0.0245`
- swap equivariance error: `0.00768`
- Ablations:
- `no_planner`: `0.5648` mean success, drop `0.0139`
- `no_role_symmetry`: `0.5833` mean success, delta `+0.0046`
- Gate decisions:
- hard success gate `>= 0.58`: fail by `0.0013`
- planner must matter: fail, `no_planner` drop is only `0.0139`
- planner top-1 `>= 0.30`: fail, measured `0.2832`
- role symmetry must matter: fail, `no_role_symmetry` is slightly better than full
- proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
- Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates.
## Phase 2
- Config: `proxy_interaction_r3d_stage2_dummy.yaml`
- Seeds: `21, 22, 23`
- Artifact roots:
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23`
- Mean train time: `20.76 s`
- Mean peak GPU memory: `639.39 MB`
- 3-seed benchmark means:
- mean success: `0.5463`
- foliage success: `0.4444`
- bag success: `0.5417`
- cloth success: `0.6528`
- reocclusion rate: `0.0121`
- persistence horizon MAE: `2.2358`
- disturbance cost: `0.3148`
- planner top-1: `0.3442`
- planner regret: `0.0208`
- planner score/utility spearman: `0.2397`
- belief calibration brier: `0.00842`
- reocclusion calibration brier: `0.2745`
- swap equivariance error: `0.00504`
- Ablations:
- `no_world_model`: `0.5463` mean success, drop `0.0000`
- `short_history`: `0.5463` mean success, delta `0.0000`
- Gate decisions:
- hard success gate `>= 0.60`: fail
- `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
- state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.
## Phase 3
- RGB-only compatibility configs:
- `proxy_interaction_r3d_stage1_clip.yaml`
- `proxy_interaction_r3d_stage2_clip.yaml`
- RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml`
- Artifact roots:
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18`
- `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19`
- RGB-only CLIP means:
- stage 1 clip mean success: `0.5324`
- stage 2 clip mean success: `0.4954`
- Stage 3 RGB-D means:
- mean train time: `145.93 s`
- mean peak GPU memory: `1952.12 MB`
- mean success: `0.5741`
- foliage success: `0.4861`
- bag success: `0.5417`
- cloth success: `0.6944`
- reocclusion rate: `0.0151`
- persistence horizon MAE: `1.7883`
- disturbance cost: `0.2258`
- planner top-1: `0.3265`
- planner regret: `0.0157`
- proposal diversity: `0.0270`
- swap equivariance error: `0.000094`
- `no_depth` ablation:
- mean success: `0.5231`
- absolute drop vs full: `0.0509`
- bag success drops `0.5417 -> 0.4722`
- foliage success drops `0.4861 -> 0.4167`
- Gate decisions:
- CLIP hard success gate `>= 0.37`: pass
- `no_depth` must hurt on at least one geometry-heavy proxy: pass
- no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
- Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.
## Phase 4
- Unit tests:
- command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests`
- result: `10 passed`
- RLBench import/config smoke:
- artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt`
- status: pass
- imports `rlbench`, `pyrep`, `yarr` all resolved
- camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224`
- RLBench launch smoke:
- artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt`
- artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr`
- status: pass
- `open_drawer` resolves to `RightOpenDrawer`
- finite `18`-D action, camera shapes `[224, 224, 3]`, no crash
- RLBench open-drawer rollout:
- artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json`
- status: pass as integration
- no import errors
- no historical workspace path error string
- rollout JSON written
- mean success remains `0.0`, so this is plumbing evidence only
- PerAct2 13-task smoke:
- artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json`
- artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md`
- status: pass as integration
- all `13/13` tasks launched
- finite action check: `13/13`
- summary JSON written
- Integration caveat:
- full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.
## Final Decision
- Phase 1: not accepted
- Phase 2: not accepted
- Phase 3: accepted
- Phase 4 integration: accepted
Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.
|