Add PickClutter smoke_v5 benchmark GIF renders

1973904 verified 6 days ago

13.9 kB

	---
	tags:
	- robotics
	- vision-language-action
	- bimanual-manipulation
	- maniskill
	- rlbench
	- rgbd
	---

	# VLAarchtests4

	`VLAarchtests4` is the fresh organization repo for the RunPod work staged from `/workspace` on `2026-04-01 UTC`.

	It carries forward the earlier repo lineage and adds the current public-sim benchmark package work:

	- `VLAarchtests`
	- early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the `2026-03-25/26` sessions
	- `VLAarchtests2`
	- larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation
	- `VLAarchtests3`
	- cleaned export focused on the elastic-occlusion `trunk + structured adapter + no-op fallback` refactor, validated tests, current checkpoints, and handoff docs
	- `VLAarchtests4`
	- keeps the `VLAarchtests3` export intact and adds the full current workspace `reports/`, `outputs/`, and `data/` trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass

	## What This Repo Adds

	The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter:

	- real public-sim smoke runs on:
	- ManiSkill `PickClutterYCB-v1` as the dense occluded retrieval proxy
	- ManiSkill bridge basket retrieval proxy as the bag retrieval proxy
	- ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy
	- the public benchmark package code and summaries
	- the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs
	- full visual rerenders of the final `smoke_v5_eval_tuned_softerpref` dense-occlusion benchmark for both `trunk_only_ft` and `adapter_active_ft`
	- the same-machine environment snapshot for the public benchmark stack used on this RunPod

	## Top-Level Contents

	- `code/`
	- the cleaned code snapshot inherited from `VLAarchtests3`
	- `artifacts/`
	- prior staged checkpoints, proxy data, reports, and generated configs already bundled by `VLAarchtests3`
	- `docs/`
	- prior handoff/audit docs plus the current public benchmark run logs and correction notes
	- `legacy/`
	- older exact artifacts preserved by `VLAarchtests3`
	- `setup/`
	- prior environment files plus a new public benchmark environment snapshot under `setup/public_benchmark/`
	- `history/`
	- copied README history for `VLAarchtests`, `VLAarchtests2`, and `VLAarchtests3`
	- `reports/`
	- the full current `/workspace/workspace/reports` tree from this machine
	- `outputs/`
	- the full current `/workspace/workspace/outputs` tree from this machine
	- `data/`
	- the full current `/workspace/workspace/data` tree from this machine
	- `PUBLIC_BENCHMARK_RESULTS.md`
	- compact index of all public benchmark train/eval results from this session
	- `MODEL_AND_ARTIFACT_INDEX.md`
	- practical map of the main artifact roots to start from

	## Benchmark GIF Renders

	The repo now also includes a full rendered replay of the final dense-occlusion benchmark:

	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
	- `50` held-out `trunk_only_ft` gifs
	- `50` held-out `adapter_active_ft` gifs
	- `index.html`, `INDEX.md`, and `manifest.json` for browsing and validation
	- renderer:
	- `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py`

	Important caveats:

	- these gifs are rerendered from the saved `smoke_v5_eval_tuned_softerpref` checkpoints and exact held-out seeds, not a different benchmark run
	- the rerender kept the same `softer_pref` planner override used in the reported held-out result
	- the rerender manifest records `0` success mismatches versus the saved benchmark json files
	- only the dense-occlusion track has this full gif export right now

	## Architecture State Carried Forward

	The core model family inherited from `VLAarchtests3` is still:

	- `trunk_only`
	- `adapter_noop`
	- `adapter_active`

	The important architectural state carried into the public benchmark work is:

	- wrapped-policy interface with exact `trunk_only`, `adapter_noop`, and `adapter_active` modes
	- structured reveal/retrieve adapter with:
	- state prediction
	- task-routed proposal families
	- retrieve-feasibility gating
	- lightweight transition model
	- planner/reranker
	- planner fixes that replaced hard vetoes with softer stage penalties in:
	- `code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py`

	## Public Benchmark Summary

	Detailed per-run results are in `PUBLIC_BENCHMARK_RESULTS.md`. The short version is:

	### 1. Dense occluded retrieval proxy

	Benchmark:

	- ManiSkill `PickClutterYCB-v1`

	Best current held-out result:

	- directory:
	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/`
	- summary:
	- `trunk_only_ft = 0.04`
	- `adapter_noop = 0.04`
	- `adapter_active_ft = 0.62`
	- `delta_active_vs_trunk = +0.58`
	- `95% CI = [0.44, 0.72]`
	- `intervention_rate = 1.0`
	- `non_base_selection_rate = 1.0`

	Important caveat:

	- this was not a new retrain after `smoke_v5`
	- it used the same `smoke_v5` checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split

	### 2. Bag retrieval proxy

	Benchmark:

	- public ManiSkill bridge basket retrieval proxy

	Current fair read:

	- seed `17` corrected held-out:
	- `trunk = 0.32`
	- `noop = 0.00`
	- `active = 0.48`
	- seed `23` corrected held-out:
	- `trunk = 0.48`
	- `noop = 0.08`
	- `active = 0.48`
	- corrected 2-seed aggregate:
	- `trunk = 0.40`
	- `noop = 0.04`
	- `active = 0.48`
	- `delta = +0.08`

	Interpretation:

	- bag remains modestly positive after using one consistent corrected planner across seeds
	- the effect is smaller and less clean than the best occlusion result

	### 3. Cloth retrieval proxy

	Benchmark:

	- public ManiSkill bridge cloth retrieval proxy

	Current read:

	- seed `17`:
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.10`
	- seed `23`:
	- `trunk = 0.04`
	- `noop = 0.02`
	- `active = 0.02`
	- seed `29`:
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.04`
	- 3-seed aggregate:
	- `trunk = 0.0400`
	- `noop = 0.0333`
	- `active = 0.0533`
	- `delta = +0.0133`

	Interpretation:

	- cloth is weak and unstable
	- current evidence does not support a strong cloth-specific win

	## Important Fairness Notes

	The fairness story is mixed and should be stated plainly.

	What is fair in the strongest public benchmark result:

	- same initialization checkpoint for `trunk_only_ft` and `adapter_active_ft`
	- same train/val/test split within each task
	- same optimizer, LR, batch size, and unfreeze scope within each task
	- `adapter_noop` is evaluated from the same adapter checkpoint as `adapter_active_ft`
	- the held-out test episodes were not hand-picked after seeing outcomes

	What is not fully paper-clean yet:

	- most current public benchmark evidence is smoke-scale and low-seed
	- the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint
	- bag required eval-side planner correction for one seed to avoid a collapse
	- cloth remains weak even after additional seeds and val sweeps

	### PickClutter Split Fairness

	The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions.

	- `data/maniskill_pickclutter/smoke_v1/episode_splits.json`
	- `data/maniskill_pickclutter/smoke_v2/episode_splits.json`
	- `data/maniskill_pickclutter/smoke_v3/episode_splits.json`

	These files contain the same episode ids:

	- train: `170000..170031`
	- val: `171000..171007`
	- eval: `172000..172049`

	Also:

	- there is no `data/maniskill_pickclutter/smoke_v4/`
	- there is no `data/maniskill_pickclutter/smoke_v5/`

	`smoke_v4` and `smoke_v5` were code/report version labels, not new held-out episode bundles.

	### What Changed Across PickClutter Versions

	The big changes across `smoke_v2`, `smoke_v3`, `smoke_v4`, and `smoke_v5` were:

	- more benchmark-derived state supervision
	- transition-model training enablement
	- planner bug fixes
	- fairness fixes so the adapter checkpoint did not hide a stronger shared trunk
	- then a frozen-validation planner sweep for the final held-out eval

	The big occlusion win was not caused by changing the eval episodes.

	### Dense-Occlusion Render Artifacts

	The final dense-occlusion run also has a full visual export in:

	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`

	Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For `adapter_active_ft`, the overlay includes:

	- adaptor on/off state
	- whether a non-base proposal was selected
	- candidate index
	- planner name
	- planner score/confidence
	- state signals such as visibility, access, gap, and damage

	## Crucial Caveats

	### Occlusion result was planner-tuned

	The large jump in:

	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/`

	came from validation-selected planner tuning on top of the same `smoke_v5` checkpoint.

	The selected override values were:

	- `mode_preference_bonus = 0.75`
	- `premature_retrieve_penalty = 0.5`
	- `premature_insert_penalty = 0.25`
	- `premature_maintain_penalty = 1.0`
	- `occlusion_maintain_gap_min_access = 0.30`
	- `occlusion_maintain_gap_min_visibility = 0.20`
	- `retrieve_stage_access_threshold = 0.18`
	- `retrieve_stage_reveal_threshold = 0.18`
	- `retrieve_stage_support_threshold = 0.18`

	That was a validation-only selection step. It was not a fresh retrain.

	### Bag and cloth did not use real depth

	The bridge-task runner for the bag and cloth proxies used:

	- one real RGB camera
	- copied into all camera slots
	- zero-filled depth channels

	The runner labels this stack:

	- `rgb_triplicate_zero_depth`

	This is a real limitation and it should not be hidden.

	It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack.

	Consequences:

	- bag and cloth are not modality-matched to the PickClutter runs
	- PickClutter used real `rgbd_3cam`
	- bag and cloth used weaker perception input

	### Bag and cloth also used a different control wrapper

	PickClutter:

	- observation stack: `rgbd_3cam`
	- action space: `bimanual_delta_pose`

	Bag and cloth:

	- observation stack: `rgb_triplicate_zero_depth`
	- action space: `widowx_delta_pose`

	So the cross-track story is architecture-consistent but not fully input/control-identical.

	### `smoke_v4_evalprobe_fromv3` is not a clean retrain result

	This run:

	- `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/`

	used corrected planner logic on top of `smoke_v3` weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain.

	## What Was Actually Learned

	The current repo supports the following claims:

	- the structured adapter is still alive
	- the active branch can clearly matter on a real public dense-occlusion benchmark proxy
	- `adapter_noop` remains a useful fairness control
	- bag-like retrieval still shows modest positive evidence
	- cloth-like retrieval is currently the weak link

	It does not support the following stronger claims yet:

	- broad superiority on realistic manipulation benchmarks
	- stable multi-seed wins across all three target-like public proxy tracks
	- a clean modality-matched comparison across occlusion, bag, and cloth

	## Environment And Setup

	Two environment stories exist in this repo.

	### Prior `VLAarchtests3` / RLBench stack

	Preserved under:

	- `setup/ENVIRONMENT.md`
	- `setup/env_vars.sh`
	- `setup/rlbench_pip_freeze.txt`

	This is the older RLBench / AnyBimanual oriented environment.

	### Current public benchmark stack

	Preserved under:

	- `setup/public_benchmark/ENVIRONMENT.md`
	- `setup/public_benchmark/env_vars.sh`
	- `setup/public_benchmark/python_version.txt`
	- `setup/public_benchmark/uname.txt`
	- `setup/public_benchmark/nvidia_smi.txt`
	- `setup/public_benchmark/gpu_short.txt`
	- `setup/public_benchmark/pip_freeze_python311.txt`
	- `setup/public_benchmark/rlbench_env_pip_freeze.txt`
	- `setup/public_benchmark/hf_env.txt`

	The public benchmark runs in this session were assembled on:

	- GPU: `NVIDIA L40S`
	- VRAM: `46068 MiB`
	- driver: `580.126.09`
	- Python: `3.11.10`
	- kernel: `Linux 6.8.0-88-generic`

	## Recommended Starting Points

	If you want the strongest current public benchmark evidence, start here:

	- `docs/maniskill_pickclutter_correction_log_2026-04-01.md`
	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`

	If you want the bag/cloth public bridge follow-up, start here:

	- `docs/public_bridge_smoke_run_log_2026-04-01.md`
	- `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json`
	- `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json`

	If you want the repo lineage context, start here:

	- `history/VLAarchtests_previous_README.md`
	- `history/VLAarchtests2_previous_README.md`
	- `history/VLAarchtests3_previous_README.md`

	## Bottom Line

	This repo is the complete organization package for the current workspace state.

	It includes:

	- the `VLAarchtests3` export base
	- the full current machine `reports/`, `outputs/`, and `data/` trees
	- the public benchmark code, datasets, checkpoints, and results
	- the environment files needed to stand up the same stack on similar hardware

	Use it as the archival handoff state for continuing the elastic-occlusion adapter work.

	Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite.