File size: 7,898 Bytes
aa584de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
# VLAarchtests3

`VLAarchtests3` is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine.

It is a successor snapshot to the earlier `VLAarchtests` and `VLAarchtests2` work:

- `VLAarchtests`: earlier architecture-search and benchmark-debugging work.
- `VLAarchtests2`: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation.
- `VLAarchtests3`: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here.

## What Was Done

The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner `trunk + structured adapter + no-op fallback` stack.

The final exported code contains:

- a clean wrapped-policy interface with `trunk_only`, `adapter_noop`, and `adapter_active` modes,
- a structured elastic-occlusion adapter with:
  - reveal-state prediction,
  - task-routed reveal/retrieve proposal families,
  - retrieve-feasibility gating,
  - a lightweight reveal-state transition model,
- explicit tests that protect:
  - no-op equivalence,
  - generic-task fallback,
  - benchmark protocol identity,
  - unsafe retrieve blocking,
  - cloth-specific selection behavior.

The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it:

- summarize scene readiness at the scene level rather than worst-candidate level,
- hard-mask unsafe retrieve candidates,
- switch from reveal to retrieve once feasibility is met,
- use task-specific bag and cloth readiness criteria,
- prefer reveal macros early and retrieve later.

## What Was Actually Evaluated

Two different kinds of evidence are included.

### 1. Trusted General-Task Anchor

This was kept narrow on purpose because only `dual_push_buttons` was trusted on this setup.

Trusted anchor evidence:

- official AnyBimanual local anchor summary on `dual_push_buttons`:
  - `25` episodes
  - success `0.96`
- live rerun on this RunPod:
  - `5` episodes
  - scores `[0, 100, 100, 0, 0]`
  - mean score `40.0`

Interpretation:

- the official trunk path is real and non-trivial on the one stable anchor task,
- this does **not** mean the local custom CLIP trunk was competitive broadly,
- this does **not** validate the other unstable RLBench target-like tasks.

### 2. Reveal/Retrieve Proxy Benchmark

This benchmark is useful for mechanism debugging, but it is **not** a real robot/physics benchmark.

The final reported held-out smoke benchmark used:

- `12` foliage episodes,
- `12` bag episodes,
- `12` cloth episodes,
- `36` total episodes,
- separate held-out procedural seeds from the adapter train/val splits.

Results:

- non-intervention / matched no-op:
  - mean success `0.000`
  - foliage `0.000`
  - bag `0.000`
  - cloth `0.000`
  - visibility integral `2.275`
  - corridor availability `0.0312`
  - disturbance cost `0.7433`

- intervention / adapter active:
  - mean success `0.6667`
  - foliage `0.6667`
  - bag `0.7500`
  - cloth `0.5833`
  - visibility integral `19.9503`
  - corridor availability `0.7974`
  - disturbance cost `0.2835`
  - reocclusion rate `0.00278`
  - planner regret `0.1586`

The active policy did really intervene on these tasks. It did not just fall back silently to the trunk:

- all recorded selections on the final held-out smoke run were non-base candidates,
- typical successful pattern:
  - foliage: reveal (`pin_canopy`) then `retrieve`,
  - bag: reveal (`widen_mouth`) then `retrieve`,
  - cloth: reveal (`separate_layer`) then `retrieve`.

## Important Limitation

The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator.

It has:

- synthetic RGB-D renders,
- internal latent state,
- hand-coded transition rules,
- scripted teacher/oracle supervision.

It does **not** have:

- rigid-body or deformable physics,
- actual robot kinematics,
- true contact/grasp simulation,
- a fair end-to-end manipulation distribution for a pretrained trunk.

Therefore:

- the proxy result is useful to validate adapter logic,
- the proxy result is **not** sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark.

## What Was Learned

The work supports the following conclusions:

- the structured adapter idea is still alive,
- the explicit reveal-state variables are worth keeping,
- task-routed reveal macros matter,
- retrieve-feasibility gating matters,
- the no-op fallback path for general tasks is sound,
- the old heavy memory/world-model story is not where the strongest evidence lives.

The work does **not** yet justify:

- a claim of broad general-task superiority,
- a claim that the current proxy benchmark is a fair end-to-end benchmark,
- a claim that the architecture is validated on realistic target-like sim tasks.

## Was The Adapter Trained?

Yes.

The final proxy adapter checkpoint was trained with:

- frozen trunk,
- adapter-only updates,
- trained components:
  - reveal/state head,
  - proposal prior,
  - transition model,
  - planner/reranker.

Proxy training data:

- train: `128` episodes per proxy family,
- val: `32` episodes per proxy family,
- proxy families:
  - foliage,
  - bag,
  - cloth.

The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds.

## Was This A Perfect Fairness Story?

No.

What is fair in the current export:

- matched active vs no-op comparisons on the same wrapped checkpoint,
- held-out procedural seeds for the final proxy benchmark,
- exact no-op and generic-task fallback tests.

What is still missing for a stronger paper-quality comparison:

1. same-initialization `trunk_only` fine-tuned on the same proxy data,
2. same-initialization `trunk + adapter` fine-tuned on the same proxy data,
3. comparison on held-out proxy seeds,
4. comparison on stable real-sim tasks.

## What Is Left To Do

The main remaining work is on real sim benchmarks, not more abstract proxy optimization.

Priority list:

1. Train a fair control:
   - same initialization,
   - `trunk_only` fine-tuned on the same reveal/retrieve proxy data,
   - compare against `trunk + adapter`.

2. Attach the adapter directly to a strong public trunk:
   - official AnyBimanual,
   - official PerAct2 / RVT,
   - or 3D FlowMatch Actor if practical.

3. Validate on stable real-sim tasks:
   - do not trust unstable RLBench tasks with infeasible waypoints,
   - rebuild a trustworthy target-like evaluation subset,
   - keep `dual_push_buttons` as a regression anchor only.

4. Add a deformable / garment benchmark:
   - this is the most relevant public step toward the future suitcase/clothes benchmark.

5. Only after that:
   - revisit larger RLBench sweeps,
   - or collect custom teleop data.

## Repository Layout

- `code/`
  - cleaned code snapshot used for the handoff
- `artifacts/outputs/`
  - current adapter checkpoints and training outputs
- `artifacts/reports/`
  - evaluation and debugging reports
- `artifacts/data/reveal_proxy/`
  - proxy train/val datasets used by this stage
- `legacy/`
  - exact older checkpoints and summaries that the current work depends on
- `docs/`
  - audit, iteration, and completion reports from this handoff
- `setup/`
  - same-machine environment notes and helper scripts

## Recommended Use Of This Repo

Use this repo as:

- the archival handoff state,
- the codebase to continue adapter work from,
- the source of the current checkpoints and benchmark reports,
- the baseline package before moving to real sim validation.

Do **not** use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.