File size: 8,703 Bytes
16405f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63a70c7
16405f2
 
63a70c7
16405f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# Phase Tracking

Date closed: `2026-03-25 UTC`

- Snapshot note: this Hugging Face snapshot does not contain `.git`, so a git commit hash is unavailable.
- Regression baselines: `/workspace/VLAarchtests/regression/baselines.md`
- Acceptance rule: only proxy metrics are used for phase acceptance. RLBench and PerAct2 remain integration-only checks.

## Phase 0

- Status: completed.
- Historical baseline artifacts are locked in `/workspace/VLAarchtests/regression/baselines.md`.
- Historical dummy benchmark reference:
  - interaction `0.5278`
  - backbone `0.5556`
  - reveal `0.5417`
- Historical CLIP benchmark reference:
  - interaction `0.3056`
  - backbone `0.3333`
  - reveal `0.2083`

## Phase 1

- Config: `proxy_interaction_r3d_stage1_dummy.yaml`
- Seeds: `13, 14, 15`
- Artifact roots:
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed14`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed15`
- Mean train time: `20.45 s`
- Mean peak GPU memory: `629.62 MB`
- 3-seed benchmark means:
  - mean success: `0.5787`
  - foliage success: `0.4444`
  - bag success: `0.6111`
  - cloth success: `0.6806`
  - reocclusion rate: `0.0000`
  - persistence horizon MAE: `1.9553`
  - disturbance cost: `0.3649`
  - planner top-1: `0.2832`
  - planner regret: `0.0143`
  - planner score/utility spearman: `0.2504`
  - role collapse: `0.0000`
  - proposal diversity: `0.0245`
  - swap equivariance error: `0.00768`
- Ablations:
  - `no_planner`: `0.5648` mean success, drop `0.0139`
  - `no_role_symmetry`: `0.5833` mean success, delta `+0.0046`
- Gate decisions:
  - hard success gate `>= 0.58`: fail by `0.0013`
  - planner must matter: fail, `no_planner` drop is only `0.0139`
  - planner top-1 `>= 0.30`: fail, measured `0.2832`
  - role symmetry must matter: fail, `no_role_symmetry` is slightly better than full
  - proposal collapse must not happen: pass, diversity stayed nonzero across all seeds
- Takeaway: the structure refactor improved over the historical interaction baseline (`0.5787` vs `0.5278`) and exceeded the historical dummy backbone baseline (`0.5556`), but it did not clear the phase-1 acceptance gates.

## Phase 2

- Config: `proxy_interaction_r3d_stage2_dummy.yaml`
- Seeds: `21, 22, 23`
- Artifact roots:
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23`
- Mean train time: `20.76 s`
- Mean peak GPU memory: `639.39 MB`
- 3-seed benchmark means:
  - mean success: `0.5463`
  - foliage success: `0.4444`
  - bag success: `0.5417`
  - cloth success: `0.6528`
  - reocclusion rate: `0.0121`
  - persistence horizon MAE: `2.2358`
  - disturbance cost: `0.3148`
  - planner top-1: `0.3442`
  - planner regret: `0.0208`
  - planner score/utility spearman: `0.2397`
  - belief calibration brier: `0.00842`
  - reocclusion calibration brier: `0.2745`
  - swap equivariance error: `0.00504`
- Ablations:
  - `no_world_model`: `0.5463` mean success, drop `0.0000`
  - `short_history`: `0.5463` mean success, delta `0.0000`
- Gate decisions:
  - hard success gate `>= 0.60`: fail
  - `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
  - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
  - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.

## Phase 3

- RGB-only compatibility configs:
  - `proxy_interaction_r3d_stage1_clip.yaml`
  - `proxy_interaction_r3d_stage2_clip.yaml`
- RGB-D config: `proxy_interaction_r3d_stage3_clip_rgbd.yaml`
- Artifact roots:
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed7`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed8`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage1_clip_seed9`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed11`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed12`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_clip_seed13`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed18`
  - `/workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed19`
- RGB-only CLIP means:
  - stage 1 clip mean success: `0.5324`
  - stage 2 clip mean success: `0.4954`
- Stage 3 RGB-D means:
  - mean train time: `145.93 s`
  - mean peak GPU memory: `1952.12 MB`
  - mean success: `0.5741`
  - foliage success: `0.4861`
  - bag success: `0.5417`
  - cloth success: `0.6944`
  - reocclusion rate: `0.0151`
  - persistence horizon MAE: `1.7883`
  - disturbance cost: `0.2258`
  - planner top-1: `0.3265`
  - planner regret: `0.0157`
  - proposal diversity: `0.0270`
  - swap equivariance error: `0.000094`
- `no_depth` ablation:
  - mean success: `0.5231`
  - absolute drop vs full: `0.0509`
  - bag success drops `0.5417 -> 0.4722`
  - foliage success drops `0.4861 -> 0.4167`
- Gate decisions:
  - CLIP hard success gate `>= 0.37`: pass
  - `no_depth` must hurt on at least one geometry-heavy proxy: pass
  - no RGB-only regression: pass, both RGB-only CLIP configs still run and produce sane metrics
- Takeaway: the RGB-D path is the first phase that cleanly clears its acceptance gates.

## Phase 4

- Unit tests:
  - command: `PYTHONPATH=/workspace/venv_r3d/lib/python3.11/site-packages:/workspace/VLAarchtests/code/reveal_vla_bimanual:/usr/local/lib/python3.11/dist-packages python -m pytest -q /workspace/VLAarchtests/tests`
  - result: `10 passed`
- RLBench import/config smoke:
  - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/smoke_test_output.txt`
  - status: pass
  - imports `rlbench`, `pyrep`, `yarr` all resolved
  - camera contract preserved: `front`, `wrist_left`, `wrist_right` at `224x224`
- RLBench launch smoke:
  - artifact stdout: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.txt`
  - artifact stderr: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_smokes/launch_smoke_open_drawer.stderr`
  - status: pass
  - `open_drawer` resolves to `RightOpenDrawer`
  - finite `18`-D action, camera shapes `[224, 224, 3]`, no crash
- RLBench open-drawer rollout:
  - artifact: `/workspace/VLAarchtests/artifacts/outputs/r3d/rlbench_open_drawer_r3d_rollout/rollout_eval.json`
  - status: pass as integration
  - no import errors
  - no historical workspace path error string
  - rollout JSON written
  - mean success remains `0.0`, so this is plumbing evidence only
- PerAct2 13-task smoke:
  - artifact summary JSON: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.json`
  - artifact summary markdown: `/workspace/VLAarchtests/artifacts/outputs/r3d/peract2_13_launch_smoke/launch_smoke_summary.md`
  - status: pass as integration
  - all `13/13` tasks launched
  - finite action check: `13/13`
  - summary JSON written
- Integration caveat:
  - full multi-task rollout in a single process is not reliable with this CoppeliaSim build. A direct batched `run_rlbench_rollout_eval` attempt hit a Qt/OpenGL segfault after repeated env recycle, and the subprocess-isolated full rollout sweep was too slow to be a reasonable smoke. The accepted PerAct2 artifact is therefore the launch/noop smoke, which matches the stated gate: launch stability, finite actions, and written summary.

## Final Decision

- Phase 1: not accepted
- Phase 2: not accepted
- Phase 3: accepted
- Phase 4 integration: accepted

Overall status: the repo-preserving R3D-VLA refactor is implemented, verified, and benchmarked. The strongest positive result is the RGB-D CLIP phase. The structural planner/world-model claims are still not validated strongly enough on the dummy proxy benchmark to support a stronger paper claim without more work.