| --- |
| tags: |
| - robotics |
| - vision-language-action |
| - bimanual-manipulation |
| - maniskill |
| - rlbench |
| - rgbd |
| --- |
| |
| # VLAarchtests4 |
|
|
| `VLAarchtests4` is the fresh organization repo for the RunPod work staged from `/workspace` on `2026-04-01 UTC`. |
|
|
| It carries forward the earlier repo lineage and adds the current public-sim benchmark package work: |
|
|
| - `VLAarchtests` |
| - early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the `2026-03-25/26` sessions |
| - `VLAarchtests2` |
| - larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation |
| - `VLAarchtests3` |
| - cleaned export focused on the elastic-occlusion `trunk + structured adapter + no-op fallback` refactor, validated tests, current checkpoints, and handoff docs |
| - `VLAarchtests4` |
| - keeps the `VLAarchtests3` export intact and adds the full current workspace `reports/`, `outputs/`, and `data/` trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass |
|
|
| ## What This Repo Adds |
|
|
| The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter: |
|
|
| - real public-sim smoke runs on: |
| - ManiSkill `PickClutterYCB-v1` as the dense occluded retrieval proxy |
| - ManiSkill bridge basket retrieval proxy as the bag retrieval proxy |
| - ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy |
| - the public benchmark package code and summaries |
| - the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs |
| - full visual rerenders of the final `smoke_v5_eval_tuned_softerpref` dense-occlusion benchmark for both `trunk_only_ft` and `adapter_active_ft` |
| - the same-machine environment snapshot for the public benchmark stack used on this RunPod |
|
|
| ## Top-Level Contents |
|
|
| - `code/` |
| - the cleaned code snapshot inherited from `VLAarchtests3` |
| - `artifacts/` |
| - prior staged checkpoints, proxy data, reports, and generated configs already bundled by `VLAarchtests3` |
| - `docs/` |
| - prior handoff/audit docs plus the current public benchmark run logs and correction notes |
| - `legacy/` |
| - older exact artifacts preserved by `VLAarchtests3` |
| - `setup/` |
| - prior environment files plus a new public benchmark environment snapshot under `setup/public_benchmark/` |
| - `history/` |
| - copied README history for `VLAarchtests`, `VLAarchtests2`, and `VLAarchtests3` |
| - `reports/` |
| - the full current `/workspace/workspace/reports` tree from this machine |
| - `outputs/` |
| - the full current `/workspace/workspace/outputs` tree from this machine |
| - `data/` |
| - the full current `/workspace/workspace/data` tree from this machine |
| - `PUBLIC_BENCHMARK_RESULTS.md` |
| - compact index of all public benchmark train/eval results from this session |
| - `MODEL_AND_ARTIFACT_INDEX.md` |
| - practical map of the main artifact roots to start from |
|
|
| ## Benchmark GIF Renders |
|
|
| The repo now also includes a full rendered replay of the final dense-occlusion benchmark: |
|
|
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/` |
| - `50` held-out `trunk_only_ft` gifs |
| - `50` held-out `adapter_active_ft` gifs |
| - `index.html`, `INDEX.md`, and `manifest.json` for browsing and validation |
| - renderer: |
| - `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py` |
|
|
| Important caveats: |
|
|
| - these gifs are rerendered from the saved `smoke_v5_eval_tuned_softerpref` checkpoints and exact held-out seeds, not a different benchmark run |
| - the rerender kept the same `softer_pref` planner override used in the reported held-out result |
| - the rerender manifest records `0` success mismatches versus the saved benchmark json files |
| - only the dense-occlusion track has this full gif export right now |
|
|
| ## Architecture State Carried Forward |
|
|
| The core model family inherited from `VLAarchtests3` is still: |
|
|
| - `trunk_only` |
| - `adapter_noop` |
| - `adapter_active` |
|
|
| The important architectural state carried into the public benchmark work is: |
|
|
| - wrapped-policy interface with exact `trunk_only`, `adapter_noop`, and `adapter_active` modes |
| - structured reveal/retrieve adapter with: |
| - state prediction |
| - task-routed proposal families |
| - retrieve-feasibility gating |
| - lightweight transition model |
| - planner/reranker |
| - planner fixes that replaced hard vetoes with softer stage penalties in: |
| - `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py` |
|
|
| ## Public Benchmark Summary |
|
|
| Detailed per-run results are in `PUBLIC_BENCHMARK_RESULTS.md`. The short version is: |
|
|
| ### 1. Dense occluded retrieval proxy |
|
|
| Benchmark: |
|
|
| - ManiSkill `PickClutterYCB-v1` |
|
|
| Best current held-out result: |
|
|
| - directory: |
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/` |
| - summary: |
| - `trunk_only_ft = 0.04` |
| - `adapter_noop = 0.04` |
| - `adapter_active_ft = 0.62` |
| - `delta_active_vs_trunk = +0.58` |
| - `95% CI = [0.44, 0.72]` |
| - `intervention_rate = 1.0` |
| - `non_base_selection_rate = 1.0` |
|
|
| Important caveat: |
|
|
| - this was not a new retrain after `smoke_v5` |
| - it used the same `smoke_v5` checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split |
|
|
| ### 2. Bag retrieval proxy |
|
|
| Benchmark: |
|
|
| - public ManiSkill bridge basket retrieval proxy |
|
|
| Current fair read: |
|
|
| - seed `17` corrected held-out: |
| - `trunk = 0.32` |
| - `noop = 0.00` |
| - `active = 0.48` |
| - seed `23` corrected held-out: |
| - `trunk = 0.48` |
| - `noop = 0.08` |
| - `active = 0.48` |
| - corrected 2-seed aggregate: |
| - `trunk = 0.40` |
| - `noop = 0.04` |
| - `active = 0.48` |
| - `delta = +0.08` |
|
|
| Interpretation: |
|
|
| - bag remains modestly positive after using one consistent corrected planner across seeds |
| - the effect is smaller and less clean than the best occlusion result |
|
|
| ### 3. Cloth retrieval proxy |
|
|
| Benchmark: |
|
|
| - public ManiSkill bridge cloth retrieval proxy |
|
|
| Current read: |
|
|
| - seed `17`: |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.10` |
| - seed `23`: |
| - `trunk = 0.04` |
| - `noop = 0.02` |
| - `active = 0.02` |
| - seed `29`: |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.04` |
| - 3-seed aggregate: |
| - `trunk = 0.0400` |
| - `noop = 0.0333` |
| - `active = 0.0533` |
| - `delta = +0.0133` |
|
|
| Interpretation: |
|
|
| - cloth is weak and unstable |
| - current evidence does not support a strong cloth-specific win |
|
|
| ## Important Fairness Notes |
|
|
| The fairness story is mixed and should be stated plainly. |
|
|
| What is fair in the strongest public benchmark result: |
|
|
| - same initialization checkpoint for `trunk_only_ft` and `adapter_active_ft` |
| - same train/val/test split within each task |
| - same optimizer, LR, batch size, and unfreeze scope within each task |
| - `adapter_noop` is evaluated from the same adapter checkpoint as `adapter_active_ft` |
| - the held-out test episodes were not hand-picked after seeing outcomes |
|
|
| What is not fully paper-clean yet: |
|
|
| - most current public benchmark evidence is smoke-scale and low-seed |
| - the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint |
| - bag required eval-side planner correction for one seed to avoid a collapse |
| - cloth remains weak even after additional seeds and val sweeps |
|
|
| ### PickClutter Split Fairness |
|
|
| The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions. |
|
|
| - `data/maniskill_pickclutter/smoke_v1/episode_splits.json` |
| - `data/maniskill_pickclutter/smoke_v2/episode_splits.json` |
| - `data/maniskill_pickclutter/smoke_v3/episode_splits.json` |
|
|
| These files contain the same episode ids: |
|
|
| - train: `170000..170031` |
| - val: `171000..171007` |
| - eval: `172000..172049` |
|
|
| Also: |
|
|
| - there is no `data/maniskill_pickclutter/smoke_v4/` |
| - there is no `data/maniskill_pickclutter/smoke_v5/` |
|
|
| `smoke_v4` and `smoke_v5` were code/report version labels, not new held-out episode bundles. |
|
|
| ### What Changed Across PickClutter Versions |
|
|
| The big changes across `smoke_v2`, `smoke_v3`, `smoke_v4`, and `smoke_v5` were: |
|
|
| - more benchmark-derived state supervision |
| - transition-model training enablement |
| - planner bug fixes |
| - fairness fixes so the adapter checkpoint did not hide a stronger shared trunk |
| - then a frozen-validation planner sweep for the final held-out eval |
|
|
| The big occlusion win was not caused by changing the eval episodes. |
|
|
| ### Dense-Occlusion Render Artifacts |
|
|
| The final dense-occlusion run also has a full visual export in: |
|
|
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/` |
|
|
| Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For `adapter_active_ft`, the overlay includes: |
|
|
| - adaptor on/off state |
| - whether a non-base proposal was selected |
| - candidate index |
| - planner name |
| - planner score/confidence |
| - state signals such as visibility, access, gap, and damage |
|
|
| ## Crucial Caveats |
|
|
| ### Occlusion result was planner-tuned |
|
|
| The large jump in: |
|
|
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/` |
|
|
| came from validation-selected planner tuning on top of the same `smoke_v5` checkpoint. |
|
|
| The selected override values were: |
|
|
| - `mode_preference_bonus = 0.75` |
| - `premature_retrieve_penalty = 0.5` |
| - `premature_insert_penalty = 0.25` |
| - `premature_maintain_penalty = 1.0` |
| - `occlusion_maintain_gap_min_access = 0.30` |
| - `occlusion_maintain_gap_min_visibility = 0.20` |
| - `retrieve_stage_access_threshold = 0.18` |
| - `retrieve_stage_reveal_threshold = 0.18` |
| - `retrieve_stage_support_threshold = 0.18` |
|
|
| That was a validation-only selection step. It was not a fresh retrain. |
|
|
| ### Bag and cloth did not use real depth |
|
|
| The bridge-task runner for the bag and cloth proxies used: |
|
|
| - one real RGB camera |
| - copied into all camera slots |
| - zero-filled depth channels |
|
|
| The runner labels this stack: |
|
|
| - `rgb_triplicate_zero_depth` |
|
|
| This is a real limitation and it should not be hidden. |
|
|
| It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack. |
|
|
| Consequences: |
|
|
| - bag and cloth are not modality-matched to the PickClutter runs |
| - PickClutter used real `rgbd_3cam` |
| - bag and cloth used weaker perception input |
|
|
| ### Bag and cloth also used a different control wrapper |
|
|
| PickClutter: |
|
|
| - observation stack: `rgbd_3cam` |
| - action space: `bimanual_delta_pose` |
|
|
| Bag and cloth: |
|
|
| - observation stack: `rgb_triplicate_zero_depth` |
| - action space: `widowx_delta_pose` |
|
|
| So the cross-track story is architecture-consistent but not fully input/control-identical. |
|
|
| ### `smoke_v4_evalprobe_fromv3` is not a clean retrain result |
| |
| This run: |
| |
| - `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/` |
|
|
| used corrected planner logic on top of `smoke_v3` weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain. |
|
|
| ## What Was Actually Learned |
|
|
| The current repo supports the following claims: |
|
|
| - the structured adapter is still alive |
| - the active branch can clearly matter on a real public dense-occlusion benchmark proxy |
| - `adapter_noop` remains a useful fairness control |
| - bag-like retrieval still shows modest positive evidence |
| - cloth-like retrieval is currently the weak link |
|
|
| It does not support the following stronger claims yet: |
|
|
| - broad superiority on realistic manipulation benchmarks |
| - stable multi-seed wins across all three target-like public proxy tracks |
| - a clean modality-matched comparison across occlusion, bag, and cloth |
|
|
| ## Environment And Setup |
|
|
| Two environment stories exist in this repo. |
|
|
| ### Prior `VLAarchtests3` / RLBench stack |
|
|
| Preserved under: |
|
|
| - `setup/ENVIRONMENT.md` |
| - `setup/env_vars.sh` |
| - `setup/rlbench_pip_freeze.txt` |
|
|
| This is the older RLBench / AnyBimanual oriented environment. |
|
|
| ### Current public benchmark stack |
|
|
| Preserved under: |
|
|
| - `setup/public_benchmark/ENVIRONMENT.md` |
| - `setup/public_benchmark/env_vars.sh` |
| - `setup/public_benchmark/python_version.txt` |
| - `setup/public_benchmark/uname.txt` |
| - `setup/public_benchmark/nvidia_smi.txt` |
| - `setup/public_benchmark/gpu_short.txt` |
| - `setup/public_benchmark/pip_freeze_python311.txt` |
| - `setup/public_benchmark/rlbench_env_pip_freeze.txt` |
| - `setup/public_benchmark/hf_env.txt` |
|
|
| The public benchmark runs in this session were assembled on: |
|
|
| - GPU: `NVIDIA L40S` |
| - VRAM: `46068 MiB` |
| - driver: `580.126.09` |
| - Python: `3.11.10` |
| - kernel: `Linux 6.8.0-88-generic` |
|
|
| ## Recommended Starting Points |
|
|
| If you want the strongest current public benchmark evidence, start here: |
|
|
| - `docs/maniskill_pickclutter_correction_log_2026-04-01.md` |
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json` |
|
|
| If you want the bag/cloth public bridge follow-up, start here: |
|
|
| - `docs/public_bridge_smoke_run_log_2026-04-01.md` |
| - `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json` |
| - `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json` |
|
|
| If you want the repo lineage context, start here: |
|
|
| - `history/VLAarchtests_previous_README.md` |
| - `history/VLAarchtests2_previous_README.md` |
| - `history/VLAarchtests3_previous_README.md` |
|
|
| ## Bottom Line |
|
|
| This repo is the complete organization package for the current workspace state. |
|
|
| It includes: |
|
|
| - the `VLAarchtests3` export base |
| - the full current machine `reports/`, `outputs/`, and `data/` trees |
| - the public benchmark code, datasets, checkpoints, and results |
| - the environment files needed to stand up the same stack on similar hardware |
|
|
| Use it as the archival handoff state for continuing the elastic-occlusion adapter work. |
|
|
| Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite. |
|
|