tags:
- robotics
- vision-language-action
- bimanual-manipulation
- maniskill
- rlbench
- rgbd
VLAarchtests4
VLAarchtests4 is the fresh organization repo for the RunPod work staged from /workspace on 2026-04-01 UTC.
It carries forward the earlier repo lineage and adds the current public-sim benchmark package work:
VLAarchtests- early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the
2026-03-25/26sessions
- early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the
VLAarchtests2- larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation
VLAarchtests3- cleaned export focused on the elastic-occlusion
trunk + structured adapter + no-op fallbackrefactor, validated tests, current checkpoints, and handoff docs
- cleaned export focused on the elastic-occlusion
VLAarchtests4- keeps the
VLAarchtests3export intact and adds the full current workspacereports/,outputs/, anddata/trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass
- keeps the
What This Repo Adds
The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter:
- real public-sim smoke runs on:
- ManiSkill
PickClutterYCB-v1as the dense occluded retrieval proxy - ManiSkill bridge basket retrieval proxy as the bag retrieval proxy
- ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy
- ManiSkill
- the public benchmark package code and summaries
- the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs
- full visual rerenders of the final
smoke_v5_eval_tuned_softerprefdense-occlusion benchmark for bothtrunk_only_ftandadapter_active_ft - the same-machine environment snapshot for the public benchmark stack used on this RunPod
Top-Level Contents
code/- the cleaned code snapshot inherited from
VLAarchtests3
- the cleaned code snapshot inherited from
artifacts/- prior staged checkpoints, proxy data, reports, and generated configs already bundled by
VLAarchtests3
- prior staged checkpoints, proxy data, reports, and generated configs already bundled by
docs/- prior handoff/audit docs plus the current public benchmark run logs and correction notes
legacy/- older exact artifacts preserved by
VLAarchtests3
- older exact artifacts preserved by
setup/- prior environment files plus a new public benchmark environment snapshot under
setup/public_benchmark/
- prior environment files plus a new public benchmark environment snapshot under
history/- copied README history for
VLAarchtests,VLAarchtests2, andVLAarchtests3
- copied README history for
reports/- the full current
/workspace/workspace/reportstree from this machine
- the full current
outputs/- the full current
/workspace/workspace/outputstree from this machine
- the full current
data/- the full current
/workspace/workspace/datatree from this machine
- the full current
PUBLIC_BENCHMARK_RESULTS.md- compact index of all public benchmark train/eval results from this session
MODEL_AND_ARTIFACT_INDEX.md- practical map of the main artifact roots to start from
Benchmark GIF Renders
The repo now also includes a full rendered replay of the final dense-occlusion benchmark:
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/50held-outtrunk_only_ftgifs50held-outadapter_active_ftgifsindex.html,INDEX.md, andmanifest.jsonfor browsing and validation
- renderer:
code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py
Important caveats:
- these gifs are rerendered from the saved
smoke_v5_eval_tuned_softerprefcheckpoints and exact held-out seeds, not a different benchmark run - the rerender kept the same
softer_prefplanner override used in the reported held-out result - the rerender manifest records
0success mismatches versus the saved benchmark json files - only the dense-occlusion track has this full gif export right now
Architecture State Carried Forward
The core model family inherited from VLAarchtests3 is still:
trunk_onlyadapter_noopadapter_active
The important architectural state carried into the public benchmark work is:
- wrapped-policy interface with exact
trunk_only,adapter_noop, andadapter_activemodes - structured reveal/retrieve adapter with:
- state prediction
- task-routed proposal families
- retrieve-feasibility gating
- lightweight transition model
- planner/reranker
- planner fixes that replaced hard vetoes with softer stage penalties in:
code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py
Public Benchmark Summary
Detailed per-run results are in PUBLIC_BENCHMARK_RESULTS.md. The short version is:
1. Dense occluded retrieval proxy
Benchmark:
- ManiSkill
PickClutterYCB-v1
Best current held-out result:
- directory:
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/
- summary:
trunk_only_ft = 0.04adapter_noop = 0.04adapter_active_ft = 0.62delta_active_vs_trunk = +0.5895% CI = [0.44, 0.72]intervention_rate = 1.0non_base_selection_rate = 1.0
Important caveat:
- this was not a new retrain after
smoke_v5 - it used the same
smoke_v5checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split
2. Bag retrieval proxy
Benchmark:
- public ManiSkill bridge basket retrieval proxy
Current fair read:
- seed
17corrected held-out:trunk = 0.32noop = 0.00active = 0.48
- seed
23corrected held-out:trunk = 0.48noop = 0.08active = 0.48
- corrected 2-seed aggregate:
trunk = 0.40noop = 0.04active = 0.48delta = +0.08
Interpretation:
- bag remains modestly positive after using one consistent corrected planner across seeds
- the effect is smaller and less clean than the best occlusion result
3. Cloth retrieval proxy
Benchmark:
- public ManiSkill bridge cloth retrieval proxy
Current read:
- seed
17:trunk = 0.04noop = 0.04active = 0.10
- seed
23:trunk = 0.04noop = 0.02active = 0.02
- seed
29:trunk = 0.04noop = 0.04active = 0.04
- 3-seed aggregate:
trunk = 0.0400noop = 0.0333active = 0.0533delta = +0.0133
Interpretation:
- cloth is weak and unstable
- current evidence does not support a strong cloth-specific win
Important Fairness Notes
The fairness story is mixed and should be stated plainly.
What is fair in the strongest public benchmark result:
- same initialization checkpoint for
trunk_only_ftandadapter_active_ft - same train/val/test split within each task
- same optimizer, LR, batch size, and unfreeze scope within each task
adapter_noopis evaluated from the same adapter checkpoint asadapter_active_ft- the held-out test episodes were not hand-picked after seeing outcomes
What is not fully paper-clean yet:
- most current public benchmark evidence is smoke-scale and low-seed
- the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint
- bag required eval-side planner correction for one seed to avoid a collapse
- cloth remains weak even after additional seeds and val sweeps
PickClutter Split Fairness
The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions.
data/maniskill_pickclutter/smoke_v1/episode_splits.jsondata/maniskill_pickclutter/smoke_v2/episode_splits.jsondata/maniskill_pickclutter/smoke_v3/episode_splits.json
These files contain the same episode ids:
- train:
170000..170031 - val:
171000..171007 - eval:
172000..172049
Also:
- there is no
data/maniskill_pickclutter/smoke_v4/ - there is no
data/maniskill_pickclutter/smoke_v5/
smoke_v4 and smoke_v5 were code/report version labels, not new held-out episode bundles.
What Changed Across PickClutter Versions
The big changes across smoke_v2, smoke_v3, smoke_v4, and smoke_v5 were:
- more benchmark-derived state supervision
- transition-model training enablement
- planner bug fixes
- fairness fixes so the adapter checkpoint did not hide a stronger shared trunk
- then a frozen-validation planner sweep for the final held-out eval
The big occlusion win was not caused by changing the eval episodes.
Dense-Occlusion Render Artifacts
The final dense-occlusion run also has a full visual export in:
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/
Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For adapter_active_ft, the overlay includes:
- adaptor on/off state
- whether a non-base proposal was selected
- candidate index
- planner name
- planner score/confidence
- state signals such as visibility, access, gap, and damage
Crucial Caveats
Occlusion result was planner-tuned
The large jump in:
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/
came from validation-selected planner tuning on top of the same smoke_v5 checkpoint.
The selected override values were:
mode_preference_bonus = 0.75premature_retrieve_penalty = 0.5premature_insert_penalty = 0.25premature_maintain_penalty = 1.0occlusion_maintain_gap_min_access = 0.30occlusion_maintain_gap_min_visibility = 0.20retrieve_stage_access_threshold = 0.18retrieve_stage_reveal_threshold = 0.18retrieve_stage_support_threshold = 0.18
That was a validation-only selection step. It was not a fresh retrain.
Bag and cloth did not use real depth
The bridge-task runner for the bag and cloth proxies used:
- one real RGB camera
- copied into all camera slots
- zero-filled depth channels
The runner labels this stack:
rgb_triplicate_zero_depth
This is a real limitation and it should not be hidden.
It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack.
Consequences:
- bag and cloth are not modality-matched to the PickClutter runs
- PickClutter used real
rgbd_3cam - bag and cloth used weaker perception input
Bag and cloth also used a different control wrapper
PickClutter:
- observation stack:
rgbd_3cam - action space:
bimanual_delta_pose
Bag and cloth:
- observation stack:
rgb_triplicate_zero_depth - action space:
widowx_delta_pose
So the cross-track story is architecture-consistent but not fully input/control-identical.
smoke_v4_evalprobe_fromv3 is not a clean retrain result
This run:
reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/
used corrected planner logic on top of smoke_v3 weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain.
What Was Actually Learned
The current repo supports the following claims:
- the structured adapter is still alive
- the active branch can clearly matter on a real public dense-occlusion benchmark proxy
adapter_noopremains a useful fairness control- bag-like retrieval still shows modest positive evidence
- cloth-like retrieval is currently the weak link
It does not support the following stronger claims yet:
- broad superiority on realistic manipulation benchmarks
- stable multi-seed wins across all three target-like public proxy tracks
- a clean modality-matched comparison across occlusion, bag, and cloth
Environment And Setup
Two environment stories exist in this repo.
Prior VLAarchtests3 / RLBench stack
Preserved under:
setup/ENVIRONMENT.mdsetup/env_vars.shsetup/rlbench_pip_freeze.txt
This is the older RLBench / AnyBimanual oriented environment.
Current public benchmark stack
Preserved under:
setup/public_benchmark/ENVIRONMENT.mdsetup/public_benchmark/env_vars.shsetup/public_benchmark/python_version.txtsetup/public_benchmark/uname.txtsetup/public_benchmark/nvidia_smi.txtsetup/public_benchmark/gpu_short.txtsetup/public_benchmark/pip_freeze_python311.txtsetup/public_benchmark/rlbench_env_pip_freeze.txtsetup/public_benchmark/hf_env.txt
The public benchmark runs in this session were assembled on:
- GPU:
NVIDIA L40S - VRAM:
46068 MiB - driver:
580.126.09 - Python:
3.11.10 - kernel:
Linux 6.8.0-88-generic
Recommended Starting Points
If you want the strongest current public benchmark evidence, start here:
docs/maniskill_pickclutter_correction_log_2026-04-01.mdreports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json
If you want the bag/cloth public bridge follow-up, start here:
docs/public_bridge_smoke_run_log_2026-04-01.mdreports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.jsonreports/maniskill_cloth_bridge_val_sweep_seed23/summary.json
If you want the repo lineage context, start here:
history/VLAarchtests_previous_README.mdhistory/VLAarchtests2_previous_README.mdhistory/VLAarchtests3_previous_README.md
Bottom Line
This repo is the complete organization package for the current workspace state.
It includes:
- the
VLAarchtests3export base - the full current machine
reports/,outputs/, anddata/trees - the public benchmark code, datasets, checkpoints, and results
- the environment files needed to stand up the same stack on similar hardware
Use it as the archival handoff state for continuing the elastic-occlusion adapter work.
Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite.