File size: 7,898 Bytes

aa584de

# VLAarchtests3

`VLAarchtests3` is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine.

It is a successor snapshot to the earlier `VLAarchtests` and `VLAarchtests2` work:

- `VLAarchtests`: earlier architecture-search and benchmark-debugging work.
- `VLAarchtests2`: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation.
- `VLAarchtests3`: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here.

## What Was Done

The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner `trunk + structured adapter + no-op fallback` stack.

The final exported code contains:

- a clean wrapped-policy interface with `trunk_only`, `adapter_noop`, and `adapter_active` modes,
- a structured elastic-occlusion adapter with:
  - reveal-state prediction,
  - task-routed reveal/retrieve proposal families,
  - retrieve-feasibility gating,
  - a lightweight reveal-state transition model,
- explicit tests that protect:
  - no-op equivalence,
  - generic-task fallback,
  - benchmark protocol identity,
  - unsafe retrieve blocking,
  - cloth-specific selection behavior.

The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it:

- summarize scene readiness at the scene level rather than worst-candidate level,
- hard-mask unsafe retrieve candidates,
- switch from reveal to retrieve once feasibility is met,
- use task-specific bag and cloth readiness criteria,
- prefer reveal macros early and retrieve later.

## What Was Actually Evaluated

Two different kinds of evidence are included.

### 1. Trusted General-Task Anchor

This was kept narrow on purpose because only `dual_push_buttons` was trusted on this setup.

Trusted anchor evidence:

- official AnyBimanual local anchor summary on `dual_push_buttons`:
  - `25` episodes
  - success `0.96`
- live rerun on this RunPod:
  - `5` episodes
  - scores `[0, 100, 100, 0, 0]`
  - mean score `40.0`

Interpretation:

- the official trunk path is real and non-trivial on the one stable anchor task,
- this does **not** mean the local custom CLIP trunk was competitive broadly,
- this does **not** validate the other unstable RLBench target-like tasks.

### 2. Reveal/Retrieve Proxy Benchmark

This benchmark is useful for mechanism debugging, but it is **not** a real robot/physics benchmark.

The final reported held-out smoke benchmark used:

- `12` foliage episodes,
- `12` bag episodes,
- `12` cloth episodes,
- `36` total episodes,
- separate held-out procedural seeds from the adapter train/val splits.

Results:

- non-intervention / matched no-op:
  - mean success `0.000`
  - foliage `0.000`
  - bag `0.000`
  - cloth `0.000`
  - visibility integral `2.275`
  - corridor availability `0.0312`
  - disturbance cost `0.7433`

- intervention / adapter active:
  - mean success `0.6667`
  - foliage `0.6667`
  - bag `0.7500`
  - cloth `0.5833`
  - visibility integral `19.9503`
  - corridor availability `0.7974`
  - disturbance cost `0.2835`
  - reocclusion rate `0.00278`
  - planner regret `0.1586`

The active policy did really intervene on these tasks. It did not just fall back silently to the trunk:

- all recorded selections on the final held-out smoke run were non-base candidates,
- typical successful pattern:
  - foliage: reveal (`pin_canopy`) then `retrieve`,
  - bag: reveal (`widen_mouth`) then `retrieve`,
  - cloth: reveal (`separate_layer`) then `retrieve`.

## Important Limitation

The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator.

It has:

- synthetic RGB-D renders,
- internal latent state,
- hand-coded transition rules,
- scripted teacher/oracle supervision.

It does **not** have:

- rigid-body or deformable physics,
- actual robot kinematics,
- true contact/grasp simulation,
- a fair end-to-end manipulation distribution for a pretrained trunk.

Therefore:

- the proxy result is useful to validate adapter logic,
- the proxy result is **not** sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark.

## What Was Learned

The work supports the following conclusions:

- the structured adapter idea is still alive,
- the explicit reveal-state variables are worth keeping,
- task-routed reveal macros matter,
- retrieve-feasibility gating matters,
- the no-op fallback path for general tasks is sound,
- the old heavy memory/world-model story is not where the strongest evidence lives.

The work does **not** yet justify:

- a claim of broad general-task superiority,
- a claim that the current proxy benchmark is a fair end-to-end benchmark,
- a claim that the architecture is validated on realistic target-like sim tasks.

## Was The Adapter Trained?

Yes.

The final proxy adapter checkpoint was trained with:

- frozen trunk,
- adapter-only updates,
- trained components:
  - reveal/state head,
  - proposal prior,
  - transition model,
  - planner/reranker.

Proxy training data:

- train: `128` episodes per proxy family,
- val: `32` episodes per proxy family,
- proxy families:
  - foliage,
  - bag,
  - cloth.

The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds.

## Was This A Perfect Fairness Story?

No.

What is fair in the current export:

- matched active vs no-op comparisons on the same wrapped checkpoint,
- held-out procedural seeds for the final proxy benchmark,
- exact no-op and generic-task fallback tests.

What is still missing for a stronger paper-quality comparison:

1. same-initialization `trunk_only` fine-tuned on the same proxy data,
2. same-initialization `trunk + adapter` fine-tuned on the same proxy data,
3. comparison on held-out proxy seeds,
4. comparison on stable real-sim tasks.

## What Is Left To Do

The main remaining work is on real sim benchmarks, not more abstract proxy optimization.

Priority list:

1. Train a fair control:
   - same initialization,
   - `trunk_only` fine-tuned on the same reveal/retrieve proxy data,
   - compare against `trunk + adapter`.

2. Attach the adapter directly to a strong public trunk:
   - official AnyBimanual,
   - official PerAct2 / RVT,
   - or 3D FlowMatch Actor if practical.

3. Validate on stable real-sim tasks:
   - do not trust unstable RLBench tasks with infeasible waypoints,
   - rebuild a trustworthy target-like evaluation subset,
   - keep `dual_push_buttons` as a regression anchor only.

4. Add a deformable / garment benchmark:
   - this is the most relevant public step toward the future suitcase/clothes benchmark.

5. Only after that:
   - revisit larger RLBench sweeps,
   - or collect custom teleop data.

## Repository Layout

- `code/`
  - cleaned code snapshot used for the handoff
- `artifacts/outputs/`
  - current adapter checkpoints and training outputs
- `artifacts/reports/`
  - evaluation and debugging reports
- `artifacts/data/reveal_proxy/`
  - proxy train/val datasets used by this stage
- `legacy/`
  - exact older checkpoints and summaries that the current work depends on
- `docs/`
  - audit, iteration, and completion reports from this handoff
- `setup/`
  - same-machine environment notes and helper scripts

## Recommended Use Of This Repo

Use this repo as:

- the archival handoff state,
- the codebase to continue adapter work from,
- the source of the current checkpoints and benchmark reports,
- the baseline package before moving to real sim validation.

Do **not** use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.