VLAarchtests4 / README.md
lsnu's picture
Add PickClutter smoke_v5 benchmark GIF renders
1973904 verified
metadata
tags:
  - robotics
  - vision-language-action
  - bimanual-manipulation
  - maniskill
  - rlbench
  - rgbd

VLAarchtests4

VLAarchtests4 is the fresh organization repo for the RunPod work staged from /workspace on 2026-04-01 UTC.

It carries forward the earlier repo lineage and adds the current public-sim benchmark package work:

  • VLAarchtests
    • early proxy + RLBench architecture search, handoff checkpoints, and environment recreation files from the 2026-03-25/26 sessions
  • VLAarchtests2
    • larger exploratory organization repo with more baselines, overlap/anchor work, frequent model changes, mixed artifacts, and several results that required later reinterpretation
  • VLAarchtests3
    • cleaned export focused on the elastic-occlusion trunk + structured adapter + no-op fallback refactor, validated tests, current checkpoints, and handoff docs
  • VLAarchtests4
    • keeps the VLAarchtests3 export intact and adds the full current workspace reports/, outputs/, and data/ trees, including all public benchmark smoke runs, checkpoint directories, dataset bundles, validation sweeps, and environment snapshots from the public-sim evaluation pass

What This Repo Adds

The main new addition in this repo is the public benchmark track work for the elastic-occlusion adapter:

  • real public-sim smoke runs on:
    • ManiSkill PickClutterYCB-v1 as the dense occluded retrieval proxy
    • ManiSkill bridge basket retrieval proxy as the bag retrieval proxy
    • ManiSkill bridge cloth retrieval proxy as the folded-cloth retrieval proxy
  • the public benchmark package code and summaries
  • the train/eval logs, checkpoints, cached datasets, validation sweeps, and correction logs for those runs
  • full visual rerenders of the final smoke_v5_eval_tuned_softerpref dense-occlusion benchmark for both trunk_only_ft and adapter_active_ft
  • the same-machine environment snapshot for the public benchmark stack used on this RunPod

Top-Level Contents

  • code/
    • the cleaned code snapshot inherited from VLAarchtests3
  • artifacts/
    • prior staged checkpoints, proxy data, reports, and generated configs already bundled by VLAarchtests3
  • docs/
    • prior handoff/audit docs plus the current public benchmark run logs and correction notes
  • legacy/
    • older exact artifacts preserved by VLAarchtests3
  • setup/
    • prior environment files plus a new public benchmark environment snapshot under setup/public_benchmark/
  • history/
    • copied README history for VLAarchtests, VLAarchtests2, and VLAarchtests3
  • reports/
    • the full current /workspace/workspace/reports tree from this machine
  • outputs/
    • the full current /workspace/workspace/outputs tree from this machine
  • data/
    • the full current /workspace/workspace/data tree from this machine
  • PUBLIC_BENCHMARK_RESULTS.md
    • compact index of all public benchmark train/eval results from this session
  • MODEL_AND_ARTIFACT_INDEX.md
    • practical map of the main artifact roots to start from

Benchmark GIF Renders

The repo now also includes a full rendered replay of the final dense-occlusion benchmark:

  • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/
    • 50 held-out trunk_only_ft gifs
    • 50 held-out adapter_active_ft gifs
    • index.html, INDEX.md, and manifest.json for browsing and validation
  • renderer:
    • code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/render_maniskill_pickclutter_benchmark_gifs.py

Important caveats:

  • these gifs are rerendered from the saved smoke_v5_eval_tuned_softerpref checkpoints and exact held-out seeds, not a different benchmark run
  • the rerender kept the same softer_pref planner override used in the reported held-out result
  • the rerender manifest records 0 success mismatches versus the saved benchmark json files
  • only the dense-occlusion track has this full gif export right now

Architecture State Carried Forward

The core model family inherited from VLAarchtests3 is still:

  • trunk_only
  • adapter_noop
  • adapter_active

The important architectural state carried into the public benchmark work is:

  • wrapped-policy interface with exact trunk_only, adapter_noop, and adapter_active modes
  • structured reveal/retrieve adapter with:
    • state prediction
    • task-routed proposal families
    • retrieve-feasibility gating
    • lightweight transition model
    • planner/reranker
  • planner fixes that replaced hard vetoes with softer stage penalties in:
    • code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py

Public Benchmark Summary

Detailed per-run results are in PUBLIC_BENCHMARK_RESULTS.md. The short version is:

1. Dense occluded retrieval proxy

Benchmark:

  • ManiSkill PickClutterYCB-v1

Best current held-out result:

  • directory:
    • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/
  • summary:
    • trunk_only_ft = 0.04
    • adapter_noop = 0.04
    • adapter_active_ft = 0.62
    • delta_active_vs_trunk = +0.58
    • 95% CI = [0.44, 0.72]
    • intervention_rate = 1.0
    • non_base_selection_rate = 1.0

Important caveat:

  • this was not a new retrain after smoke_v5
  • it used the same smoke_v5 checkpoints with planner hyperparameters selected on the frozen validation split and then applied once to the untouched held-out split

2. Bag retrieval proxy

Benchmark:

  • public ManiSkill bridge basket retrieval proxy

Current fair read:

  • seed 17 corrected held-out:
    • trunk = 0.32
    • noop = 0.00
    • active = 0.48
  • seed 23 corrected held-out:
    • trunk = 0.48
    • noop = 0.08
    • active = 0.48
  • corrected 2-seed aggregate:
    • trunk = 0.40
    • noop = 0.04
    • active = 0.48
    • delta = +0.08

Interpretation:

  • bag remains modestly positive after using one consistent corrected planner across seeds
  • the effect is smaller and less clean than the best occlusion result

3. Cloth retrieval proxy

Benchmark:

  • public ManiSkill bridge cloth retrieval proxy

Current read:

  • seed 17:
    • trunk = 0.04
    • noop = 0.04
    • active = 0.10
  • seed 23:
    • trunk = 0.04
    • noop = 0.02
    • active = 0.02
  • seed 29:
    • trunk = 0.04
    • noop = 0.04
    • active = 0.04
  • 3-seed aggregate:
    • trunk = 0.0400
    • noop = 0.0333
    • active = 0.0533
    • delta = +0.0133

Interpretation:

  • cloth is weak and unstable
  • current evidence does not support a strong cloth-specific win

Important Fairness Notes

The fairness story is mixed and should be stated plainly.

What is fair in the strongest public benchmark result:

  • same initialization checkpoint for trunk_only_ft and adapter_active_ft
  • same train/val/test split within each task
  • same optimizer, LR, batch size, and unfreeze scope within each task
  • adapter_noop is evaluated from the same adapter checkpoint as adapter_active_ft
  • the held-out test episodes were not hand-picked after seeing outcomes

What is not fully paper-clean yet:

  • most current public benchmark evidence is smoke-scale and low-seed
  • the occlusion headline result depends on validation-selected planner tuning on top of a fixed checkpoint
  • bag required eval-side planner correction for one seed to avoid a collapse
  • cloth remains weak even after additional seeds and val sweeps

PickClutter Split Fairness

The important point for the dense-occlusion track is that the dataset split did not drift across the early smoke versions.

  • data/maniskill_pickclutter/smoke_v1/episode_splits.json
  • data/maniskill_pickclutter/smoke_v2/episode_splits.json
  • data/maniskill_pickclutter/smoke_v3/episode_splits.json

These files contain the same episode ids:

  • train: 170000..170031
  • val: 171000..171007
  • eval: 172000..172049

Also:

  • there is no data/maniskill_pickclutter/smoke_v4/
  • there is no data/maniskill_pickclutter/smoke_v5/

smoke_v4 and smoke_v5 were code/report version labels, not new held-out episode bundles.

What Changed Across PickClutter Versions

The big changes across smoke_v2, smoke_v3, smoke_v4, and smoke_v5 were:

  • more benchmark-derived state supervision
  • transition-model training enablement
  • planner bug fixes
  • fairness fixes so the adapter checkpoint did not hide a stronger shared trunk
  • then a frozen-validation planner sweep for the final held-out eval

The big occlusion win was not caused by changing the eval episodes.

Dense-Occlusion Render Artifacts

The final dense-occlusion run also has a full visual export in:

  • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/

Those gifs show the robot interacting with the 3D scene and overlay the adaptor state per frame. For adapter_active_ft, the overlay includes:

  • adaptor on/off state
  • whether a non-base proposal was selected
  • candidate index
  • planner name
  • planner score/confidence
  • state signals such as visibility, access, gap, and damage

Crucial Caveats

Occlusion result was planner-tuned

The large jump in:

  • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/

came from validation-selected planner tuning on top of the same smoke_v5 checkpoint.

The selected override values were:

  • mode_preference_bonus = 0.75
  • premature_retrieve_penalty = 0.5
  • premature_insert_penalty = 0.25
  • premature_maintain_penalty = 1.0
  • occlusion_maintain_gap_min_access = 0.30
  • occlusion_maintain_gap_min_visibility = 0.20
  • retrieve_stage_access_threshold = 0.18
  • retrieve_stage_reveal_threshold = 0.18
  • retrieve_stage_support_threshold = 0.18

That was a validation-only selection step. It was not a fresh retrain.

Bag and cloth did not use real depth

The bridge-task runner for the bag and cloth proxies used:

  • one real RGB camera
  • copied into all camera slots
  • zero-filled depth channels

The runner labels this stack:

  • rgb_triplicate_zero_depth

This is a real limitation and it should not be hidden.

It happened because the bridge proxy runner used a compatibility shim to satisfy the shared multi-camera tensor interface without plumbing real bridge-scene multiview depth through the stack.

Consequences:

  • bag and cloth are not modality-matched to the PickClutter runs
  • PickClutter used real rgbd_3cam
  • bag and cloth used weaker perception input

Bag and cloth also used a different control wrapper

PickClutter:

  • observation stack: rgbd_3cam
  • action space: bimanual_delta_pose

Bag and cloth:

  • observation stack: rgb_triplicate_zero_depth
  • action space: widowx_delta_pose

So the cross-track story is architecture-consistent but not fully input/control-identical.

smoke_v4_evalprobe_fromv3 is not a clean retrain result

This run:

  • reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/

used corrected planner logic on top of smoke_v3 weights. It is useful evidence that the active adapter can matter, but it is not a clean end-to-end retrain.

What Was Actually Learned

The current repo supports the following claims:

  • the structured adapter is still alive
  • the active branch can clearly matter on a real public dense-occlusion benchmark proxy
  • adapter_noop remains a useful fairness control
  • bag-like retrieval still shows modest positive evidence
  • cloth-like retrieval is currently the weak link

It does not support the following stronger claims yet:

  • broad superiority on realistic manipulation benchmarks
  • stable multi-seed wins across all three target-like public proxy tracks
  • a clean modality-matched comparison across occlusion, bag, and cloth

Environment And Setup

Two environment stories exist in this repo.

Prior VLAarchtests3 / RLBench stack

Preserved under:

  • setup/ENVIRONMENT.md
  • setup/env_vars.sh
  • setup/rlbench_pip_freeze.txt

This is the older RLBench / AnyBimanual oriented environment.

Current public benchmark stack

Preserved under:

  • setup/public_benchmark/ENVIRONMENT.md
  • setup/public_benchmark/env_vars.sh
  • setup/public_benchmark/python_version.txt
  • setup/public_benchmark/uname.txt
  • setup/public_benchmark/nvidia_smi.txt
  • setup/public_benchmark/gpu_short.txt
  • setup/public_benchmark/pip_freeze_python311.txt
  • setup/public_benchmark/rlbench_env_pip_freeze.txt
  • setup/public_benchmark/hf_env.txt

The public benchmark runs in this session were assembled on:

  • GPU: NVIDIA L40S
  • VRAM: 46068 MiB
  • driver: 580.126.09
  • Python: 3.11.10
  • kernel: Linux 6.8.0-88-generic

Recommended Starting Points

If you want the strongest current public benchmark evidence, start here:

  • docs/maniskill_pickclutter_correction_log_2026-04-01.md
  • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json

If you want the bag/cloth public bridge follow-up, start here:

  • docs/public_bridge_smoke_run_log_2026-04-01.md
  • reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json
  • reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json

If you want the repo lineage context, start here:

  • history/VLAarchtests_previous_README.md
  • history/VLAarchtests2_previous_README.md
  • history/VLAarchtests3_previous_README.md

Bottom Line

This repo is the complete organization package for the current workspace state.

It includes:

  • the VLAarchtests3 export base
  • the full current machine reports/, outputs/, and data/ trees
  • the public benchmark code, datasets, checkpoints, and results
  • the environment files needed to stand up the same stack on similar hardware

Use it as the archival handoff state for continuing the elastic-occlusion adapter work.

Do not cite it as if all three target-like public proxy tracks are already cleanly solved. The occlusion track is the strongest current evidence; bag is modest; cloth remains weak; and the bridge-task perception stack still needs a proper real-depth rewrite.