Add PickClutter smoke_v5 benchmark GIF renders

1973904 verified 6 days ago

13.9 kB

tags:
  - robotics
  - vision-language-action
  - bimanual-manipulation
  - maniskill
  - rlbench
  - rgbd

VLAarchtests4

VLAarchtests4 is the fresh organization repo for the RunPod work staged from /workspace on 2026-04-01 UTC.

It carries forward the earlier repo lineage and adds the current public-sim benchmark package work:

VLAarchtests
- early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the 2026-03-25/26 sessions
VLAarchtests2
- larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation
VLAarchtests3
- cleaned export focused on the elastic-occlusion trunk + structured adapter + no-op fallback refactor, validated tests, current checkpoints, and handoff docs
VLAarchtests4
- keeps the VLAarchtests3 export intact and adds the full current workspace reports/, outputs/, and data/ trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass

What This Repo Adds

The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter:

real public-sim smoke runs on:
- ManiSkill PickClutterYCB-v1 as the dense occluded retrieval proxy
- ManiSkill bridge basket retrieval proxy as the bag retrieval proxy
- ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy
the public benchmark package code and summaries
the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs
full visual rerenders of the final smoke_v5_eval_tuned_softerpref dense-occlusion benchmark for both trunk_only_ft and adapter_active_ft
the same-machine environment snapshot for the public benchmark stack used on this RunPod

Top-Level Contents

code/
- the cleaned code snapshot inherited from VLAarchtests3
artifacts/
- prior staged checkpoints, proxy data, reports, and generated configs already bundled by VLAarchtests3
docs/
- prior handoff/audit docs plus the current public benchmark run logs and correction notes
legacy/
- older exact artifacts preserved by VLAarchtests3
setup/
- prior environment files plus a new public benchmark environment snapshot under setup/public_benchmark/
history/
- copied README history for VLAarchtests, VLAarchtests2, and VLAarchtests3
reports/
- the full current /workspace/workspace/reports tree from this machine
outputs/
- the full current /workspace/workspace/outputs tree from this machine
data/
- the full current /workspace/workspace/data tree from this machine
PUBLIC_BENCHMARK_RESULTS.md
- compact index of all public benchmark train/eval results from this session
MODEL_AND_ARTIFACT_INDEX.md
- practical map of the main artifact roots to start from

Benchmark GIF Renders

The repo now also includes a full rendered replay of the final dense-occlusion benchmark:

reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/
- 50 held-out trunk_only_ft gifs
- 50 held-out adapter_active_ft gifs
- index.html, INDEX.md, and manifest.json for browsing and validation
renderer:
- code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py

Important caveats:

these gifs are rerendered from the saved smoke_v5_eval_tuned_softerpref checkpoints and exact held-out seeds, not a different benchmark run
the rerender kept the same softer_pref planner override used in the reported held-out result
the rerender manifest records 0 success mismatches versus the saved benchmark json files
only the dense-occlusion track has this full gif export right now

Architecture State Carried Forward

The core model family inherited from VLAarchtests3 is still:

trunk_only
adapter_noop
adapter_active

The important architectural state carried into the public benchmark work is:

wrapped-policy interface with exact trunk_only, adapter_noop, and adapter_active modes
structured reveal/retrieve adapter with:
- state prediction
- task-routed proposal families
- retrieve-feasibility gating
- lightweight transition model
- planner/reranker
planner fixes that replaced hard vetoes with softer stage penalties in:
- code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py

Public Benchmark Summary

Detailed per-run results are in PUBLIC_BENCHMARK_RESULTS.md. The short version is:

1. Dense occluded retrieval proxy

Benchmark:

ManiSkill PickClutterYCB-v1

Best current held-out result:

directory:
- reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/
summary:
- trunk_only_ft = 0.04
- adapter_noop = 0.04
- adapter_active_ft = 0.62
- delta_active_vs_trunk = +0.58
- 95% CI = [0.44, 0.72]
- intervention_rate = 1.0
- non_base_selection_rate = 1.0

Important caveat:

this was not a new retrain after smoke_v5
it used the same smoke_v5 checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split

2. Bag retrieval proxy

Benchmark:

public ManiSkill bridge basket retrieval proxy

Current fair read:

seed 17 corrected held-out:
- trunk = 0.32
- noop = 0.00
- active = 0.48
seed 23 corrected held-out:
- trunk = 0.48
- noop = 0.08
- active = 0.48
corrected 2-seed aggregate:
- trunk = 0.40
- noop = 0.04
- active = 0.48
- delta = +0.08

Interpretation:

bag remains modestly positive after using one consistent corrected planner across seeds
the effect is smaller and less clean than the best occlusion result

3. Cloth retrieval proxy

Benchmark:

public ManiSkill bridge cloth retrieval proxy

Current read:

seed 17:
- trunk = 0.04
- noop = 0.04
- active = 0.10
seed 23:
- trunk = 0.04
- noop = 0.02
- active = 0.02
seed 29:
- trunk = 0.04
- noop = 0.04
- active = 0.04
3-seed aggregate:
- trunk = 0.0400
- noop = 0.0333
- active = 0.0533
- delta = +0.0133

Interpretation:

cloth is weak and unstable
current evidence does not support a strong cloth-specific win

Important Fairness Notes

The fairness story is mixed and should be stated plainly.

What is fair in the strongest public benchmark result:

same initialization checkpoint for trunk_only_ft and adapter_active_ft
same train/val/test split within each task
same optimizer, LR, batch size, and unfreeze scope within each task
adapter_noop is evaluated from the same adapter checkpoint as adapter_active_ft
the held-out test episodes were not hand-picked after seeing outcomes

What is not fully paper-clean yet:

most current public benchmark evidence is smoke-scale and low-seed
the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint
bag required eval-side planner correction for one seed to avoid a collapse
cloth remains weak even after additional seeds and val sweeps

PickClutter Split Fairness

The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions.

data/maniskill_pickclutter/smoke_v1/episode_splits.json
data/maniskill_pickclutter/smoke_v2/episode_splits.json
data/maniskill_pickclutter/smoke_v3/episode_splits.json

These files contain the same episode ids:

train: 170000..170031
val: 171000..171007
eval: 172000..172049

Also:

there is no data/maniskill_pickclutter/smoke_v4/
there is no data/maniskill_pickclutter/smoke_v5/

smoke_v4 and smoke_v5 were code/report version labels, not new held-out episode bundles.

What Changed Across PickClutter Versions

The big changes across smoke_v2, smoke_v3, smoke_v4, and smoke_v5 were:

more benchmark-derived state supervision
transition-model training enablement
planner bug fixes
fairness fixes so the adapter checkpoint did not hide a stronger shared trunk
then a frozen-validation planner sweep for the final held-out eval

The big occlusion win was not caused by changing the eval episodes.

Dense-Occlusion Render Artifacts

The final dense-occlusion run also has a full visual export in:

reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/

Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For adapter_active_ft, the overlay includes:

adaptor on/off state
whether a non-base proposal was selected
candidate index
planner name
planner score/confidence
state signals such as visibility, access, gap, and damage

Crucial Caveats

Occlusion result was planner-tuned

The large jump in:

reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/

came from validation-selected planner tuning on top of the same smoke_v5 checkpoint.

The selected override values were:

mode_preference_bonus = 0.75
premature_retrieve_penalty = 0.5
premature_insert_penalty = 0.25
premature_maintain_penalty = 1.0
occlusion_maintain_gap_min_access = 0.30
occlusion_maintain_gap_min_visibility = 0.20
retrieve_stage_access_threshold = 0.18
retrieve_stage_reveal_threshold = 0.18
retrieve_stage_support_threshold = 0.18

That was a validation-only selection step. It was not a fresh retrain.

Bag and cloth did not use real depth

The bridge-task runner for the bag and cloth proxies used:

one real RGB camera
copied into all camera slots
zero-filled depth channels

The runner labels this stack:

rgb_triplicate_zero_depth

This is a real limitation and it should not be hidden.

It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack.

Consequences:

bag and cloth are not modality-matched to the PickClutter runs
PickClutter used real rgbd_3cam
bag and cloth used weaker perception input

Bag and cloth also used a different control wrapper

PickClutter:

observation stack: rgbd_3cam
action space: bimanual_delta_pose

Bag and cloth:

observation stack: rgb_triplicate_zero_depth
action space: widowx_delta_pose

So the cross-track story is architecture-consistent but not fully input/control-identical.

`smoke_v4_evalprobe_fromv3` is not a clean retrain result

This run:

reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/

used corrected planner logic on top of smoke_v3 weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain.

What Was Actually Learned

The current repo supports the following claims:

the structured adapter is still alive
the active branch can clearly matter on a real public dense-occlusion benchmark proxy
adapter_noop remains a useful fairness control
bag-like retrieval still shows modest positive evidence
cloth-like retrieval is currently the weak link

It does not support the following stronger claims yet:

broad superiority on realistic manipulation benchmarks
stable multi-seed wins across all three target-like public proxy tracks
a clean modality-matched comparison across occlusion, bag, and cloth

Environment And Setup

Two environment stories exist in this repo.

Prior `VLAarchtests3` / RLBench stack

Preserved under:

setup/ENVIRONMENT.md
setup/env_vars.sh
setup/rlbench_pip_freeze.txt

This is the older RLBench / AnyBimanual oriented environment.

Current public benchmark stack

Preserved under:

setup/public_benchmark/ENVIRONMENT.md
setup/public_benchmark/env_vars.sh
setup/public_benchmark/python_version.txt
setup/public_benchmark/uname.txt
setup/public_benchmark/nvidia_smi.txt
setup/public_benchmark/gpu_short.txt
setup/public_benchmark/pip_freeze_python311.txt
setup/public_benchmark/rlbench_env_pip_freeze.txt
setup/public_benchmark/hf_env.txt

The public benchmark runs in this session were assembled on:

GPU: NVIDIA L40S
VRAM: 46068 MiB
driver: 580.126.09
Python: 3.11.10
kernel: Linux 6.8.0-88-generic

Recommended Starting Points

If you want the strongest current public benchmark evidence, start here:

docs/maniskill_pickclutter_correction_log_2026-04-01.md
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json

If you want the bag/cloth public bridge follow-up, start here:

docs/public_bridge_smoke_run_log_2026-04-01.md
reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json
reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json

If you want the repo lineage context, start here:

history/VLAarchtests_previous_README.md
history/VLAarchtests2_previous_README.md
history/VLAarchtests3_previous_README.md

Bottom Line

This repo is the complete organization package for the current workspace state.

It includes:

the VLAarchtests3 export base
the full current machine reports/, outputs/, and data/ trees
the public benchmark code, datasets, checkpoints, and results
the environment files needed to stand up the same stack on similar hardware

Use it as the archival handoff state for continuing the elastic-occlusion adapter work.

Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite.

VLAarchtests4

What This Repo Adds

Top-Level Contents

Benchmark GIF Renders

Architecture State Carried Forward

Public Benchmark Summary

1. Dense occluded retrieval proxy

2. Bag retrieval proxy

3. Cloth retrieval proxy

Important Fairness Notes

PickClutter Split Fairness

What Changed Across PickClutter Versions

Dense-Occlusion Render Artifacts

Crucial Caveats

Occlusion result was planner-tuned

Bag and cloth did not use real depth

Bag and cloth also used a different control wrapper

smoke_v4_evalprobe_fromv3 is not a clean retrain result

What Was Actually Learned

Environment And Setup

Prior VLAarchtests3 / RLBench stack

Current public benchmark stack

Recommended Starting Points

Bottom Line

`smoke_v4_evalprobe_fromv3` is not a clean retrain result

Prior `VLAarchtests3` / RLBench stack