VLAarchtests3 / README.md
lsnu's picture
Add files using upload-large-folder tool
aa584de verified
# VLAarchtests3
`VLAarchtests3` is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine.
It is a successor snapshot to the earlier `VLAarchtests` and `VLAarchtests2` work:
- `VLAarchtests`: earlier architecture-search and benchmark-debugging work.
- `VLAarchtests2`: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation.
- `VLAarchtests3`: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here.
## What Was Done
The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner `trunk + structured adapter + no-op fallback` stack.
The final exported code contains:
- a clean wrapped-policy interface with `trunk_only`, `adapter_noop`, and `adapter_active` modes,
- a structured elastic-occlusion adapter with:
- reveal-state prediction,
- task-routed reveal/retrieve proposal families,
- retrieve-feasibility gating,
- a lightweight reveal-state transition model,
- explicit tests that protect:
- no-op equivalence,
- generic-task fallback,
- benchmark protocol identity,
- unsafe retrieve blocking,
- cloth-specific selection behavior.
The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it:
- summarize scene readiness at the scene level rather than worst-candidate level,
- hard-mask unsafe retrieve candidates,
- switch from reveal to retrieve once feasibility is met,
- use task-specific bag and cloth readiness criteria,
- prefer reveal macros early and retrieve later.
## What Was Actually Evaluated
Two different kinds of evidence are included.
### 1. Trusted General-Task Anchor
This was kept narrow on purpose because only `dual_push_buttons` was trusted on this setup.
Trusted anchor evidence:
- official AnyBimanual local anchor summary on `dual_push_buttons`:
- `25` episodes
- success `0.96`
- live rerun on this RunPod:
- `5` episodes
- scores `[0, 100, 100, 0, 0]`
- mean score `40.0`
Interpretation:
- the official trunk path is real and non-trivial on the one stable anchor task,
- this does **not** mean the local custom CLIP trunk was competitive broadly,
- this does **not** validate the other unstable RLBench target-like tasks.
### 2. Reveal/Retrieve Proxy Benchmark
This benchmark is useful for mechanism debugging, but it is **not** a real robot/physics benchmark.
The final reported held-out smoke benchmark used:
- `12` foliage episodes,
- `12` bag episodes,
- `12` cloth episodes,
- `36` total episodes,
- separate held-out procedural seeds from the adapter train/val splits.
Results:
- non-intervention / matched no-op:
- mean success `0.000`
- foliage `0.000`
- bag `0.000`
- cloth `0.000`
- visibility integral `2.275`
- corridor availability `0.0312`
- disturbance cost `0.7433`
- intervention / adapter active:
- mean success `0.6667`
- foliage `0.6667`
- bag `0.7500`
- cloth `0.5833`
- visibility integral `19.9503`
- corridor availability `0.7974`
- disturbance cost `0.2835`
- reocclusion rate `0.00278`
- planner regret `0.1586`
The active policy did really intervene on these tasks. It did not just fall back silently to the trunk:
- all recorded selections on the final held-out smoke run were non-base candidates,
- typical successful pattern:
- foliage: reveal (`pin_canopy`) then `retrieve`,
- bag: reveal (`widen_mouth`) then `retrieve`,
- cloth: reveal (`separate_layer`) then `retrieve`.
## Important Limitation
The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator.
It has:
- synthetic RGB-D renders,
- internal latent state,
- hand-coded transition rules,
- scripted teacher/oracle supervision.
It does **not** have:
- rigid-body or deformable physics,
- actual robot kinematics,
- true contact/grasp simulation,
- a fair end-to-end manipulation distribution for a pretrained trunk.
Therefore:
- the proxy result is useful to validate adapter logic,
- the proxy result is **not** sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark.
## What Was Learned
The work supports the following conclusions:
- the structured adapter idea is still alive,
- the explicit reveal-state variables are worth keeping,
- task-routed reveal macros matter,
- retrieve-feasibility gating matters,
- the no-op fallback path for general tasks is sound,
- the old heavy memory/world-model story is not where the strongest evidence lives.
The work does **not** yet justify:
- a claim of broad general-task superiority,
- a claim that the current proxy benchmark is a fair end-to-end benchmark,
- a claim that the architecture is validated on realistic target-like sim tasks.
## Was The Adapter Trained?
Yes.
The final proxy adapter checkpoint was trained with:
- frozen trunk,
- adapter-only updates,
- trained components:
- reveal/state head,
- proposal prior,
- transition model,
- planner/reranker.
Proxy training data:
- train: `128` episodes per proxy family,
- val: `32` episodes per proxy family,
- proxy families:
- foliage,
- bag,
- cloth.
The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds.
## Was This A Perfect Fairness Story?
No.
What is fair in the current export:
- matched active vs no-op comparisons on the same wrapped checkpoint,
- held-out procedural seeds for the final proxy benchmark,
- exact no-op and generic-task fallback tests.
What is still missing for a stronger paper-quality comparison:
1. same-initialization `trunk_only` fine-tuned on the same proxy data,
2. same-initialization `trunk + adapter` fine-tuned on the same proxy data,
3. comparison on held-out proxy seeds,
4. comparison on stable real-sim tasks.
## What Is Left To Do
The main remaining work is on real sim benchmarks, not more abstract proxy optimization.
Priority list:
1. Train a fair control:
- same initialization,
- `trunk_only` fine-tuned on the same reveal/retrieve proxy data,
- compare against `trunk + adapter`.
2. Attach the adapter directly to a strong public trunk:
- official AnyBimanual,
- official PerAct2 / RVT,
- or 3D FlowMatch Actor if practical.
3. Validate on stable real-sim tasks:
- do not trust unstable RLBench tasks with infeasible waypoints,
- rebuild a trustworthy target-like evaluation subset,
- keep `dual_push_buttons` as a regression anchor only.
4. Add a deformable / garment benchmark:
- this is the most relevant public step toward the future suitcase/clothes benchmark.
5. Only after that:
- revisit larger RLBench sweeps,
- or collect custom teleop data.
## Repository Layout
- `code/`
- cleaned code snapshot used for the handoff
- `artifacts/outputs/`
- current adapter checkpoints and training outputs
- `artifacts/reports/`
- evaluation and debugging reports
- `artifacts/data/reveal_proxy/`
- proxy train/val datasets used by this stage
- `legacy/`
- exact older checkpoints and summaries that the current work depends on
- `docs/`
- audit, iteration, and completion reports from this handoff
- `setup/`
- same-machine environment notes and helper scripts
## Recommended Use Of This Repo
Use this repo as:
- the archival handoff state,
- the codebase to continue adapter work from,
- the source of the current checkpoints and benchmark reports,
- the baseline package before moving to real sim validation.
Do **not** use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.