| # VLAarchtests3 |
|
|
| `VLAarchtests3` is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine. |
|
|
| It is a successor snapshot to the earlier `VLAarchtests` and `VLAarchtests2` work: |
|
|
| - `VLAarchtests`: earlier architecture-search and benchmark-debugging work. |
| - `VLAarchtests2`: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation. |
| - `VLAarchtests3`: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here. |
|
|
| ## What Was Done |
|
|
| The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner `trunk + structured adapter + no-op fallback` stack. |
|
|
| The final exported code contains: |
|
|
| - a clean wrapped-policy interface with `trunk_only`, `adapter_noop`, and `adapter_active` modes, |
| - a structured elastic-occlusion adapter with: |
| - reveal-state prediction, |
| - task-routed reveal/retrieve proposal families, |
| - retrieve-feasibility gating, |
| - a lightweight reveal-state transition model, |
| - explicit tests that protect: |
| - no-op equivalence, |
| - generic-task fallback, |
| - benchmark protocol identity, |
| - unsafe retrieve blocking, |
| - cloth-specific selection behavior. |
|
|
| The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it: |
|
|
| - summarize scene readiness at the scene level rather than worst-candidate level, |
| - hard-mask unsafe retrieve candidates, |
| - switch from reveal to retrieve once feasibility is met, |
| - use task-specific bag and cloth readiness criteria, |
| - prefer reveal macros early and retrieve later. |
|
|
| ## What Was Actually Evaluated |
|
|
| Two different kinds of evidence are included. |
|
|
| ### 1. Trusted General-Task Anchor |
|
|
| This was kept narrow on purpose because only `dual_push_buttons` was trusted on this setup. |
|
|
| Trusted anchor evidence: |
|
|
| - official AnyBimanual local anchor summary on `dual_push_buttons`: |
| - `25` episodes |
| - success `0.96` |
| - live rerun on this RunPod: |
| - `5` episodes |
| - scores `[0, 100, 100, 0, 0]` |
| - mean score `40.0` |
|
|
| Interpretation: |
|
|
| - the official trunk path is real and non-trivial on the one stable anchor task, |
| - this does **not** mean the local custom CLIP trunk was competitive broadly, |
| - this does **not** validate the other unstable RLBench target-like tasks. |
|
|
| ### 2. Reveal/Retrieve Proxy Benchmark |
|
|
| This benchmark is useful for mechanism debugging, but it is **not** a real robot/physics benchmark. |
|
|
| The final reported held-out smoke benchmark used: |
|
|
| - `12` foliage episodes, |
| - `12` bag episodes, |
| - `12` cloth episodes, |
| - `36` total episodes, |
| - separate held-out procedural seeds from the adapter train/val splits. |
|
|
| Results: |
|
|
| - non-intervention / matched no-op: |
| - mean success `0.000` |
| - foliage `0.000` |
| - bag `0.000` |
| - cloth `0.000` |
| - visibility integral `2.275` |
| - corridor availability `0.0312` |
| - disturbance cost `0.7433` |
|
|
| - intervention / adapter active: |
| - mean success `0.6667` |
| - foliage `0.6667` |
| - bag `0.7500` |
| - cloth `0.5833` |
| - visibility integral `19.9503` |
| - corridor availability `0.7974` |
| - disturbance cost `0.2835` |
| - reocclusion rate `0.00278` |
| - planner regret `0.1586` |
|
|
| The active policy did really intervene on these tasks. It did not just fall back silently to the trunk: |
|
|
| - all recorded selections on the final held-out smoke run were non-base candidates, |
| - typical successful pattern: |
| - foliage: reveal (`pin_canopy`) then `retrieve`, |
| - bag: reveal (`widen_mouth`) then `retrieve`, |
| - cloth: reveal (`separate_layer`) then `retrieve`. |
|
|
| ## Important Limitation |
|
|
| The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator. |
|
|
| It has: |
|
|
| - synthetic RGB-D renders, |
| - internal latent state, |
| - hand-coded transition rules, |
| - scripted teacher/oracle supervision. |
|
|
| It does **not** have: |
|
|
| - rigid-body or deformable physics, |
| - actual robot kinematics, |
| - true contact/grasp simulation, |
| - a fair end-to-end manipulation distribution for a pretrained trunk. |
|
|
| Therefore: |
|
|
| - the proxy result is useful to validate adapter logic, |
| - the proxy result is **not** sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark. |
|
|
| ## What Was Learned |
|
|
| The work supports the following conclusions: |
|
|
| - the structured adapter idea is still alive, |
| - the explicit reveal-state variables are worth keeping, |
| - task-routed reveal macros matter, |
| - retrieve-feasibility gating matters, |
| - the no-op fallback path for general tasks is sound, |
| - the old heavy memory/world-model story is not where the strongest evidence lives. |
|
|
| The work does **not** yet justify: |
|
|
| - a claim of broad general-task superiority, |
| - a claim that the current proxy benchmark is a fair end-to-end benchmark, |
| - a claim that the architecture is validated on realistic target-like sim tasks. |
|
|
| ## Was The Adapter Trained? |
|
|
| Yes. |
|
|
| The final proxy adapter checkpoint was trained with: |
|
|
| - frozen trunk, |
| - adapter-only updates, |
| - trained components: |
| - reveal/state head, |
| - proposal prior, |
| - transition model, |
| - planner/reranker. |
|
|
| Proxy training data: |
|
|
| - train: `128` episodes per proxy family, |
| - val: `32` episodes per proxy family, |
| - proxy families: |
| - foliage, |
| - bag, |
| - cloth. |
|
|
| The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds. |
|
|
| ## Was This A Perfect Fairness Story? |
|
|
| No. |
|
|
| What is fair in the current export: |
|
|
| - matched active vs no-op comparisons on the same wrapped checkpoint, |
| - held-out procedural seeds for the final proxy benchmark, |
| - exact no-op and generic-task fallback tests. |
|
|
| What is still missing for a stronger paper-quality comparison: |
|
|
| 1. same-initialization `trunk_only` fine-tuned on the same proxy data, |
| 2. same-initialization `trunk + adapter` fine-tuned on the same proxy data, |
| 3. comparison on held-out proxy seeds, |
| 4. comparison on stable real-sim tasks. |
|
|
| ## What Is Left To Do |
|
|
| The main remaining work is on real sim benchmarks, not more abstract proxy optimization. |
|
|
| Priority list: |
|
|
| 1. Train a fair control: |
| - same initialization, |
| - `trunk_only` fine-tuned on the same reveal/retrieve proxy data, |
| - compare against `trunk + adapter`. |
|
|
| 2. Attach the adapter directly to a strong public trunk: |
| - official AnyBimanual, |
| - official PerAct2 / RVT, |
| - or 3D FlowMatch Actor if practical. |
|
|
| 3. Validate on stable real-sim tasks: |
| - do not trust unstable RLBench tasks with infeasible waypoints, |
| - rebuild a trustworthy target-like evaluation subset, |
| - keep `dual_push_buttons` as a regression anchor only. |
|
|
| 4. Add a deformable / garment benchmark: |
| - this is the most relevant public step toward the future suitcase/clothes benchmark. |
|
|
| 5. Only after that: |
| - revisit larger RLBench sweeps, |
| - or collect custom teleop data. |
|
|
| ## Repository Layout |
|
|
| - `code/` |
| - cleaned code snapshot used for the handoff |
| - `artifacts/outputs/` |
| - current adapter checkpoints and training outputs |
| - `artifacts/reports/` |
| - evaluation and debugging reports |
| - `artifacts/data/reveal_proxy/` |
| - proxy train/val datasets used by this stage |
| - `legacy/` |
| - exact older checkpoints and summaries that the current work depends on |
| - `docs/` |
| - audit, iteration, and completion reports from this handoff |
| - `setup/` |
| - same-machine environment notes and helper scripts |
|
|
| ## Recommended Use Of This Repo |
|
|
| Use this repo as: |
|
|
| - the archival handoff state, |
| - the codebase to continue adapter work from, |
| - the source of the current checkpoints and benchmark reports, |
| - the baseline package before moving to real sim validation. |
|
|
| Do **not** use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next. |
|
|