File size: 13,899 Bytes
5611258
 
 
 
 
 
 
 
 
 
c725033
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1973904
c725033
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1973904
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c725033
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1973904
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c725033
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
---
tags:
  - robotics
  - vision-language-action
  - bimanual-manipulation
  - maniskill
  - rlbench
  - rgbd
---

# VLAarchtests4

`VLAarchtests4` is the fresh organization repo for the RunPod work staged from `/workspace` on `2026-04-01 UTC`.

It carries forward the earlier repo lineage and adds the current public-sim benchmark package work:

- `VLAarchtests`
  - early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the `2026-03-25/26` sessions
- `VLAarchtests2`
  - larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation
- `VLAarchtests3`
  - cleaned export focused on the elastic-occlusion `trunk + structured adapter + no-op fallback` refactor, validated tests, current checkpoints, and handoff docs
- `VLAarchtests4`
  - keeps the `VLAarchtests3` export intact and adds the full current workspace `reports/`, `outputs/`, and `data/` trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass

## What This Repo Adds

The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter:

- real public-sim smoke runs on:
  - ManiSkill `PickClutterYCB-v1` as the dense occluded retrieval proxy
  - ManiSkill bridge basket retrieval proxy as the bag retrieval proxy
  - ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy
- the public benchmark package code and summaries
- the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs
- full visual rerenders of the final `smoke_v5_eval_tuned_softerpref` dense-occlusion benchmark for both `trunk_only_ft` and `adapter_active_ft`
- the same-machine environment snapshot for the public benchmark stack used on this RunPod

## Top-Level Contents

- `code/`
  - the cleaned code snapshot inherited from `VLAarchtests3`
- `artifacts/`
  - prior staged checkpoints, proxy data, reports, and generated configs already bundled by `VLAarchtests3`
- `docs/`
  - prior handoff/audit docs plus the current public benchmark run logs and correction notes
- `legacy/`
  - older exact artifacts preserved by `VLAarchtests3`
- `setup/`
  - prior environment files plus a new public benchmark environment snapshot under `setup/public_benchmark/`
- `history/`
  - copied README history for `VLAarchtests`, `VLAarchtests2`, and `VLAarchtests3`
- `reports/`
  - the full current `/workspace/workspace/reports` tree from this machine
- `outputs/`
  - the full current `/workspace/workspace/outputs` tree from this machine
- `data/`
  - the full current `/workspace/workspace/data` tree from this machine
- `PUBLIC_BENCHMARK_RESULTS.md`
  - compact index of all public benchmark train/eval results from this session
- `MODEL_AND_ARTIFACT_INDEX.md`
  - practical map of the main artifact roots to start from

## Benchmark GIF Renders

The repo now also includes a full rendered replay of the final dense-occlusion benchmark:

- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
  - `50` held-out `trunk_only_ft` gifs
  - `50` held-out `adapter_active_ft` gifs
  - `index.html`, `INDEX.md`, and `manifest.json` for browsing and validation
- renderer:
  - `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py`

Important caveats:

- these gifs are rerendered from the saved `smoke_v5_eval_tuned_softerpref` checkpoints and exact held-out seeds, not a different benchmark run
- the rerender kept the same `softer_pref` planner override used in the reported held-out result
- the rerender manifest records `0` success mismatches versus the saved benchmark json files
- only the dense-occlusion track has this full gif export right now

## Architecture State Carried Forward

The core model family inherited from `VLAarchtests3` is still:

- `trunk_only`
- `adapter_noop`
- `adapter_active`

The important architectural state carried into the public benchmark work is:

- wrapped-policy interface with exact `trunk_only`, `adapter_noop`, and `adapter_active` modes
- structured reveal/retrieve adapter with:
  - state prediction
  - task-routed proposal families
  - retrieve-feasibility gating
  - lightweight transition model
  - planner/reranker
- planner fixes that replaced hard vetoes with softer stage penalties in:
  - `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py`

## Public Benchmark Summary

Detailed per-run results are in `PUBLIC_BENCHMARK_RESULTS.md`. The short version is:

### 1. Dense occluded retrieval proxy

Benchmark:

- ManiSkill `PickClutterYCB-v1`

Best current held-out result:

- directory:
  - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/`
- summary:
  - `trunk_only_ft = 0.04`
  - `adapter_noop = 0.04`
  - `adapter_active_ft = 0.62`
  - `delta_active_vs_trunk = +0.58`
  - `95% CI = [0.44, 0.72]`
  - `intervention_rate = 1.0`
  - `non_base_selection_rate = 1.0`

Important caveat:

- this was not a new retrain after `smoke_v5`
- it used the same `smoke_v5` checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split

### 2. Bag retrieval proxy

Benchmark:

- public ManiSkill bridge basket retrieval proxy

Current fair read:

- seed `17` corrected held-out:
  - `trunk = 0.32`
  - `noop = 0.00`
  - `active = 0.48`
- seed `23` corrected held-out:
  - `trunk = 0.48`
  - `noop = 0.08`
  - `active = 0.48`
- corrected 2-seed aggregate:
  - `trunk = 0.40`
  - `noop = 0.04`
  - `active = 0.48`
  - `delta = +0.08`

Interpretation:

- bag remains modestly positive after using one consistent corrected planner across seeds
- the effect is smaller and less clean than the best occlusion result

### 3. Cloth retrieval proxy

Benchmark:

- public ManiSkill bridge cloth retrieval proxy

Current read:

- seed `17`:
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.10`
- seed `23`:
  - `trunk = 0.04`
  - `noop = 0.02`
  - `active = 0.02`
- seed `29`:
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.04`
- 3-seed aggregate:
  - `trunk = 0.0400`
  - `noop = 0.0333`
  - `active = 0.0533`
  - `delta = +0.0133`

Interpretation:

- cloth is weak and unstable
- current evidence does not support a strong cloth-specific win

## Important Fairness Notes

The fairness story is mixed and should be stated plainly.

What is fair in the strongest public benchmark result:

- same initialization checkpoint for `trunk_only_ft` and `adapter_active_ft`
- same train/val/test split within each task
- same optimizer, LR, batch size, and unfreeze scope within each task
- `adapter_noop` is evaluated from the same adapter checkpoint as `adapter_active_ft`
- the held-out test episodes were not hand-picked after seeing outcomes

What is not fully paper-clean yet:

- most current public benchmark evidence is smoke-scale and low-seed
- the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint
- bag required eval-side planner correction for one seed to avoid a collapse
- cloth remains weak even after additional seeds and val sweeps

### PickClutter Split Fairness

The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions.

- `data/maniskill_pickclutter/smoke_v1/episode_splits.json`
- `data/maniskill_pickclutter/smoke_v2/episode_splits.json`
- `data/maniskill_pickclutter/smoke_v3/episode_splits.json`

These files contain the same episode ids:

- train: `170000..170031`
- val: `171000..171007`
- eval: `172000..172049`

Also:

- there is no `data/maniskill_pickclutter/smoke_v4/`
- there is no `data/maniskill_pickclutter/smoke_v5/`

`smoke_v4` and `smoke_v5` were code/report version labels, not new held-out episode bundles.

### What Changed Across PickClutter Versions

The big changes across `smoke_v2`, `smoke_v3`, `smoke_v4`, and `smoke_v5` were:

- more benchmark-derived state supervision
- transition-model training enablement
- planner bug fixes
- fairness fixes so the adapter checkpoint did not hide a stronger shared trunk
- then a frozen-validation planner sweep for the final held-out eval

The big occlusion win was not caused by changing the eval episodes.

### Dense-Occlusion Render Artifacts

The final dense-occlusion run also has a full visual export in:

- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`

Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For `adapter_active_ft`, the overlay includes:

- adaptor on/off state
- whether a non-base proposal was selected
- candidate index
- planner name
- planner score/confidence
- state signals such as visibility, access, gap, and damage

## Crucial Caveats

### Occlusion result was planner-tuned

The large jump in:

- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/`

came from validation-selected planner tuning on top of the same `smoke_v5` checkpoint.

The selected override values were:

- `mode_preference_bonus = 0.75`
- `premature_retrieve_penalty = 0.5`
- `premature_insert_penalty = 0.25`
- `premature_maintain_penalty = 1.0`
- `occlusion_maintain_gap_min_access = 0.30`
- `occlusion_maintain_gap_min_visibility = 0.20`
- `retrieve_stage_access_threshold = 0.18`
- `retrieve_stage_reveal_threshold = 0.18`
- `retrieve_stage_support_threshold = 0.18`

That was a validation-only selection step. It was not a fresh retrain.

### Bag and cloth did not use real depth

The bridge-task runner for the bag and cloth proxies used:

- one real RGB camera
- copied into all camera slots
- zero-filled depth channels

The runner labels this stack:

- `rgb_triplicate_zero_depth`

This is a real limitation and it should not be hidden.

It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack.

Consequences:

- bag and cloth are not modality-matched to the PickClutter runs
- PickClutter used real `rgbd_3cam`
- bag and cloth used weaker perception input

### Bag and cloth also used a different control wrapper

PickClutter:

- observation stack: `rgbd_3cam`
- action space: `bimanual_delta_pose`

Bag and cloth:

- observation stack: `rgb_triplicate_zero_depth`
- action space: `widowx_delta_pose`

So the cross-track story is architecture-consistent but not fully input/control-identical.

### `smoke_v4_evalprobe_fromv3` is not a clean retrain result

This run:

- `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/`

used corrected planner logic on top of `smoke_v3` weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain.

## What Was Actually Learned

The current repo supports the following claims:

- the structured adapter is still alive
- the active branch can clearly matter on a real public dense-occlusion benchmark proxy
- `adapter_noop` remains a useful fairness control
- bag-like retrieval still shows modest positive evidence
- cloth-like retrieval is currently the weak link

It does not support the following stronger claims yet:

- broad superiority on realistic manipulation benchmarks
- stable multi-seed wins across all three target-like public proxy tracks
- a clean modality-matched comparison across occlusion, bag, and cloth

## Environment And Setup

Two environment stories exist in this repo.

### Prior `VLAarchtests3` / RLBench stack

Preserved under:

- `setup/ENVIRONMENT.md`
- `setup/env_vars.sh`
- `setup/rlbench_pip_freeze.txt`

This is the older RLBench / AnyBimanual oriented environment.

### Current public benchmark stack

Preserved under:

- `setup/public_benchmark/ENVIRONMENT.md`
- `setup/public_benchmark/env_vars.sh`
- `setup/public_benchmark/python_version.txt`
- `setup/public_benchmark/uname.txt`
- `setup/public_benchmark/nvidia_smi.txt`
- `setup/public_benchmark/gpu_short.txt`
- `setup/public_benchmark/pip_freeze_python311.txt`
- `setup/public_benchmark/rlbench_env_pip_freeze.txt`
- `setup/public_benchmark/hf_env.txt`

The public benchmark runs in this session were assembled on:

- GPU: `NVIDIA L40S`
- VRAM: `46068 MiB`
- driver: `580.126.09`
- Python: `3.11.10`
- kernel: `Linux 6.8.0-88-generic`

## Recommended Starting Points

If you want the strongest current public benchmark evidence, start here:

- `docs/maniskill_pickclutter_correction_log_2026-04-01.md`
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`

If you want the bag/cloth public bridge follow-up, start here:

- `docs/public_bridge_smoke_run_log_2026-04-01.md`
- `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json`
- `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json`

If you want the repo lineage context, start here:

- `history/VLAarchtests_previous_README.md`
- `history/VLAarchtests2_previous_README.md`
- `history/VLAarchtests3_previous_README.md`

## Bottom Line

This repo is the complete organization package for the current workspace state.

It includes:

- the `VLAarchtests3` export base
- the full current machine `reports/`, `outputs/`, and `data/` trees
- the public benchmark code, datasets, checkpoints, and results
- the environment files needed to stand up the same stack on similar hardware

Use it as the archival handoff state for continuing the elastic-occlusion adapter work.

Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite.