File size: 13,899 Bytes
5611258 c725033 1973904 c725033 1973904 c725033 1973904 c725033 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | ---
tags:
- robotics
- vision-language-action
- bimanual-manipulation
- maniskill
- rlbench
- rgbd
---
# VLAarchtests4
`VLAarchtests4` is the fresh organization repo for the RunPod work staged from `/workspace` on `2026-04-01 UTC`.
It carries forward the earlier repo lineage and adds the current public-sim benchmark package work:
- `VLAarchtests`
- early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the `2026-03-25/26` sessions
- `VLAarchtests2`
- larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation
- `VLAarchtests3`
- cleaned export focused on the elastic-occlusion `trunk + structured adapter + no-op fallback` refactor, validated tests, current checkpoints, and handoff docs
- `VLAarchtests4`
- keeps the `VLAarchtests3` export intact and adds the full current workspace `reports/`, `outputs/`, and `data/` trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass
## What This Repo Adds
The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter:
- real public-sim smoke runs on:
- ManiSkill `PickClutterYCB-v1` as the dense occluded retrieval proxy
- ManiSkill bridge basket retrieval proxy as the bag retrieval proxy
- ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy
- the public benchmark package code and summaries
- the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs
- full visual rerenders of the final `smoke_v5_eval_tuned_softerpref` dense-occlusion benchmark for both `trunk_only_ft` and `adapter_active_ft`
- the same-machine environment snapshot for the public benchmark stack used on this RunPod
## Top-Level Contents
- `code/`
- the cleaned code snapshot inherited from `VLAarchtests3`
- `artifacts/`
- prior staged checkpoints, proxy data, reports, and generated configs already bundled by `VLAarchtests3`
- `docs/`
- prior handoff/audit docs plus the current public benchmark run logs and correction notes
- `legacy/`
- older exact artifacts preserved by `VLAarchtests3`
- `setup/`
- prior environment files plus a new public benchmark environment snapshot under `setup/public_benchmark/`
- `history/`
- copied README history for `VLAarchtests`, `VLAarchtests2`, and `VLAarchtests3`
- `reports/`
- the full current `/workspace/workspace/reports` tree from this machine
- `outputs/`
- the full current `/workspace/workspace/outputs` tree from this machine
- `data/`
- the full current `/workspace/workspace/data` tree from this machine
- `PUBLIC_BENCHMARK_RESULTS.md`
- compact index of all public benchmark train/eval results from this session
- `MODEL_AND_ARTIFACT_INDEX.md`
- practical map of the main artifact roots to start from
## Benchmark GIF Renders
The repo now also includes a full rendered replay of the final dense-occlusion benchmark:
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
- `50` held-out `trunk_only_ft` gifs
- `50` held-out `adapter_active_ft` gifs
- `index.html`, `INDEX.md`, and `manifest.json` for browsing and validation
- renderer:
- `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py`
Important caveats:
- these gifs are rerendered from the saved `smoke_v5_eval_tuned_softerpref` checkpoints and exact held-out seeds, not a different benchmark run
- the rerender kept the same `softer_pref` planner override used in the reported held-out result
- the rerender manifest records `0` success mismatches versus the saved benchmark json files
- only the dense-occlusion track has this full gif export right now
## Architecture State Carried Forward
The core model family inherited from `VLAarchtests3` is still:
- `trunk_only`
- `adapter_noop`
- `adapter_active`
The important architectural state carried into the public benchmark work is:
- wrapped-policy interface with exact `trunk_only`, `adapter_noop`, and `adapter_active` modes
- structured reveal/retrieve adapter with:
- state prediction
- task-routed proposal families
- retrieve-feasibility gating
- lightweight transition model
- planner/reranker
- planner fixes that replaced hard vetoes with softer stage penalties in:
- `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py`
## Public Benchmark Summary
Detailed per-run results are in `PUBLIC_BENCHMARK_RESULTS.md`. The short version is:
### 1. Dense occluded retrieval proxy
Benchmark:
- ManiSkill `PickClutterYCB-v1`
Best current held-out result:
- directory:
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/`
- summary:
- `trunk_only_ft = 0.04`
- `adapter_noop = 0.04`
- `adapter_active_ft = 0.62`
- `delta_active_vs_trunk = +0.58`
- `95% CI = [0.44, 0.72]`
- `intervention_rate = 1.0`
- `non_base_selection_rate = 1.0`
Important caveat:
- this was not a new retrain after `smoke_v5`
- it used the same `smoke_v5` checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split
### 2. Bag retrieval proxy
Benchmark:
- public ManiSkill bridge basket retrieval proxy
Current fair read:
- seed `17` corrected held-out:
- `trunk = 0.32`
- `noop = 0.00`
- `active = 0.48`
- seed `23` corrected held-out:
- `trunk = 0.48`
- `noop = 0.08`
- `active = 0.48`
- corrected 2-seed aggregate:
- `trunk = 0.40`
- `noop = 0.04`
- `active = 0.48`
- `delta = +0.08`
Interpretation:
- bag remains modestly positive after using one consistent corrected planner across seeds
- the effect is smaller and less clean than the best occlusion result
### 3. Cloth retrieval proxy
Benchmark:
- public ManiSkill bridge cloth retrieval proxy
Current read:
- seed `17`:
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.10`
- seed `23`:
- `trunk = 0.04`
- `noop = 0.02`
- `active = 0.02`
- seed `29`:
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.04`
- 3-seed aggregate:
- `trunk = 0.0400`
- `noop = 0.0333`
- `active = 0.0533`
- `delta = +0.0133`
Interpretation:
- cloth is weak and unstable
- current evidence does not support a strong cloth-specific win
## Important Fairness Notes
The fairness story is mixed and should be stated plainly.
What is fair in the strongest public benchmark result:
- same initialization checkpoint for `trunk_only_ft` and `adapter_active_ft`
- same train/val/test split within each task
- same optimizer, LR, batch size, and unfreeze scope within each task
- `adapter_noop` is evaluated from the same adapter checkpoint as `adapter_active_ft`
- the held-out test episodes were not hand-picked after seeing outcomes
What is not fully paper-clean yet:
- most current public benchmark evidence is smoke-scale and low-seed
- the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint
- bag required eval-side planner correction for one seed to avoid a collapse
- cloth remains weak even after additional seeds and val sweeps
### PickClutter Split Fairness
The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions.
- `data/maniskill_pickclutter/smoke_v1/episode_splits.json`
- `data/maniskill_pickclutter/smoke_v2/episode_splits.json`
- `data/maniskill_pickclutter/smoke_v3/episode_splits.json`
These files contain the same episode ids:
- train: `170000..170031`
- val: `171000..171007`
- eval: `172000..172049`
Also:
- there is no `data/maniskill_pickclutter/smoke_v4/`
- there is no `data/maniskill_pickclutter/smoke_v5/`
`smoke_v4` and `smoke_v5` were code/report version labels, not new held-out episode bundles.
### What Changed Across PickClutter Versions
The big changes across `smoke_v2`, `smoke_v3`, `smoke_v4`, and `smoke_v5` were:
- more benchmark-derived state supervision
- transition-model training enablement
- planner bug fixes
- fairness fixes so the adapter checkpoint did not hide a stronger shared trunk
- then a frozen-validation planner sweep for the final held-out eval
The big occlusion win was not caused by changing the eval episodes.
### Dense-Occlusion Render Artifacts
The final dense-occlusion run also has a full visual export in:
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For `adapter_active_ft`, the overlay includes:
- adaptor on/off state
- whether a non-base proposal was selected
- candidate index
- planner name
- planner score/confidence
- state signals such as visibility, access, gap, and damage
## Crucial Caveats
### Occlusion result was planner-tuned
The large jump in:
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/`
came from validation-selected planner tuning on top of the same `smoke_v5` checkpoint.
The selected override values were:
- `mode_preference_bonus = 0.75`
- `premature_retrieve_penalty = 0.5`
- `premature_insert_penalty = 0.25`
- `premature_maintain_penalty = 1.0`
- `occlusion_maintain_gap_min_access = 0.30`
- `occlusion_maintain_gap_min_visibility = 0.20`
- `retrieve_stage_access_threshold = 0.18`
- `retrieve_stage_reveal_threshold = 0.18`
- `retrieve_stage_support_threshold = 0.18`
That was a validation-only selection step. It was not a fresh retrain.
### Bag and cloth did not use real depth
The bridge-task runner for the bag and cloth proxies used:
- one real RGB camera
- copied into all camera slots
- zero-filled depth channels
The runner labels this stack:
- `rgb_triplicate_zero_depth`
This is a real limitation and it should not be hidden.
It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack.
Consequences:
- bag and cloth are not modality-matched to the PickClutter runs
- PickClutter used real `rgbd_3cam`
- bag and cloth used weaker perception input
### Bag and cloth also used a different control wrapper
PickClutter:
- observation stack: `rgbd_3cam`
- action space: `bimanual_delta_pose`
Bag and cloth:
- observation stack: `rgb_triplicate_zero_depth`
- action space: `widowx_delta_pose`
So the cross-track story is architecture-consistent but not fully input/control-identical.
### `smoke_v4_evalprobe_fromv3` is not a clean retrain result
This run:
- `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/`
used corrected planner logic on top of `smoke_v3` weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain.
## What Was Actually Learned
The current repo supports the following claims:
- the structured adapter is still alive
- the active branch can clearly matter on a real public dense-occlusion benchmark proxy
- `adapter_noop` remains a useful fairness control
- bag-like retrieval still shows modest positive evidence
- cloth-like retrieval is currently the weak link
It does not support the following stronger claims yet:
- broad superiority on realistic manipulation benchmarks
- stable multi-seed wins across all three target-like public proxy tracks
- a clean modality-matched comparison across occlusion, bag, and cloth
## Environment And Setup
Two environment stories exist in this repo.
### Prior `VLAarchtests3` / RLBench stack
Preserved under:
- `setup/ENVIRONMENT.md`
- `setup/env_vars.sh`
- `setup/rlbench_pip_freeze.txt`
This is the older RLBench / AnyBimanual oriented environment.
### Current public benchmark stack
Preserved under:
- `setup/public_benchmark/ENVIRONMENT.md`
- `setup/public_benchmark/env_vars.sh`
- `setup/public_benchmark/python_version.txt`
- `setup/public_benchmark/uname.txt`
- `setup/public_benchmark/nvidia_smi.txt`
- `setup/public_benchmark/gpu_short.txt`
- `setup/public_benchmark/pip_freeze_python311.txt`
- `setup/public_benchmark/rlbench_env_pip_freeze.txt`
- `setup/public_benchmark/hf_env.txt`
The public benchmark runs in this session were assembled on:
- GPU: `NVIDIA L40S`
- VRAM: `46068 MiB`
- driver: `580.126.09`
- Python: `3.11.10`
- kernel: `Linux 6.8.0-88-generic`
## Recommended Starting Points
If you want the strongest current public benchmark evidence, start here:
- `docs/maniskill_pickclutter_correction_log_2026-04-01.md`
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`
If you want the bag/cloth public bridge follow-up, start here:
- `docs/public_bridge_smoke_run_log_2026-04-01.md`
- `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json`
- `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json`
If you want the repo lineage context, start here:
- `history/VLAarchtests_previous_README.md`
- `history/VLAarchtests2_previous_README.md`
- `history/VLAarchtests3_previous_README.md`
## Bottom Line
This repo is the complete organization package for the current workspace state.
It includes:
- the `VLAarchtests3` export base
- the full current machine `reports/`, `outputs/`, and `data/` trees
- the public benchmark code, datasets, checkpoints, and results
- the environment files needed to stand up the same stack on similar hardware
Use it as the archival handoff state for continuing the elastic-occlusion adapter work.
Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite.
|