File size: 7,898 Bytes
aa584de | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 | # VLAarchtests3
`VLAarchtests3` is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine.
It is a successor snapshot to the earlier `VLAarchtests` and `VLAarchtests2` work:
- `VLAarchtests`: earlier architecture-search and benchmark-debugging work.
- `VLAarchtests2`: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation.
- `VLAarchtests3`: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here.
## What Was Done
The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner `trunk + structured adapter + no-op fallback` stack.
The final exported code contains:
- a clean wrapped-policy interface with `trunk_only`, `adapter_noop`, and `adapter_active` modes,
- a structured elastic-occlusion adapter with:
- reveal-state prediction,
- task-routed reveal/retrieve proposal families,
- retrieve-feasibility gating,
- a lightweight reveal-state transition model,
- explicit tests that protect:
- no-op equivalence,
- generic-task fallback,
- benchmark protocol identity,
- unsafe retrieve blocking,
- cloth-specific selection behavior.
The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it:
- summarize scene readiness at the scene level rather than worst-candidate level,
- hard-mask unsafe retrieve candidates,
- switch from reveal to retrieve once feasibility is met,
- use task-specific bag and cloth readiness criteria,
- prefer reveal macros early and retrieve later.
## What Was Actually Evaluated
Two different kinds of evidence are included.
### 1. Trusted General-Task Anchor
This was kept narrow on purpose because only `dual_push_buttons` was trusted on this setup.
Trusted anchor evidence:
- official AnyBimanual local anchor summary on `dual_push_buttons`:
- `25` episodes
- success `0.96`
- live rerun on this RunPod:
- `5` episodes
- scores `[0, 100, 100, 0, 0]`
- mean score `40.0`
Interpretation:
- the official trunk path is real and non-trivial on the one stable anchor task,
- this does **not** mean the local custom CLIP trunk was competitive broadly,
- this does **not** validate the other unstable RLBench target-like tasks.
### 2. Reveal/Retrieve Proxy Benchmark
This benchmark is useful for mechanism debugging, but it is **not** a real robot/physics benchmark.
The final reported held-out smoke benchmark used:
- `12` foliage episodes,
- `12` bag episodes,
- `12` cloth episodes,
- `36` total episodes,
- separate held-out procedural seeds from the adapter train/val splits.
Results:
- non-intervention / matched no-op:
- mean success `0.000`
- foliage `0.000`
- bag `0.000`
- cloth `0.000`
- visibility integral `2.275`
- corridor availability `0.0312`
- disturbance cost `0.7433`
- intervention / adapter active:
- mean success `0.6667`
- foliage `0.6667`
- bag `0.7500`
- cloth `0.5833`
- visibility integral `19.9503`
- corridor availability `0.7974`
- disturbance cost `0.2835`
- reocclusion rate `0.00278`
- planner regret `0.1586`
The active policy did really intervene on these tasks. It did not just fall back silently to the trunk:
- all recorded selections on the final held-out smoke run were non-base candidates,
- typical successful pattern:
- foliage: reveal (`pin_canopy`) then `retrieve`,
- bag: reveal (`widen_mouth`) then `retrieve`,
- cloth: reveal (`separate_layer`) then `retrieve`.
## Important Limitation
The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator.
It has:
- synthetic RGB-D renders,
- internal latent state,
- hand-coded transition rules,
- scripted teacher/oracle supervision.
It does **not** have:
- rigid-body or deformable physics,
- actual robot kinematics,
- true contact/grasp simulation,
- a fair end-to-end manipulation distribution for a pretrained trunk.
Therefore:
- the proxy result is useful to validate adapter logic,
- the proxy result is **not** sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark.
## What Was Learned
The work supports the following conclusions:
- the structured adapter idea is still alive,
- the explicit reveal-state variables are worth keeping,
- task-routed reveal macros matter,
- retrieve-feasibility gating matters,
- the no-op fallback path for general tasks is sound,
- the old heavy memory/world-model story is not where the strongest evidence lives.
The work does **not** yet justify:
- a claim of broad general-task superiority,
- a claim that the current proxy benchmark is a fair end-to-end benchmark,
- a claim that the architecture is validated on realistic target-like sim tasks.
## Was The Adapter Trained?
Yes.
The final proxy adapter checkpoint was trained with:
- frozen trunk,
- adapter-only updates,
- trained components:
- reveal/state head,
- proposal prior,
- transition model,
- planner/reranker.
Proxy training data:
- train: `128` episodes per proxy family,
- val: `32` episodes per proxy family,
- proxy families:
- foliage,
- bag,
- cloth.
The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds.
## Was This A Perfect Fairness Story?
No.
What is fair in the current export:
- matched active vs no-op comparisons on the same wrapped checkpoint,
- held-out procedural seeds for the final proxy benchmark,
- exact no-op and generic-task fallback tests.
What is still missing for a stronger paper-quality comparison:
1. same-initialization `trunk_only` fine-tuned on the same proxy data,
2. same-initialization `trunk + adapter` fine-tuned on the same proxy data,
3. comparison on held-out proxy seeds,
4. comparison on stable real-sim tasks.
## What Is Left To Do
The main remaining work is on real sim benchmarks, not more abstract proxy optimization.
Priority list:
1. Train a fair control:
- same initialization,
- `trunk_only` fine-tuned on the same reveal/retrieve proxy data,
- compare against `trunk + adapter`.
2. Attach the adapter directly to a strong public trunk:
- official AnyBimanual,
- official PerAct2 / RVT,
- or 3D FlowMatch Actor if practical.
3. Validate on stable real-sim tasks:
- do not trust unstable RLBench tasks with infeasible waypoints,
- rebuild a trustworthy target-like evaluation subset,
- keep `dual_push_buttons` as a regression anchor only.
4. Add a deformable / garment benchmark:
- this is the most relevant public step toward the future suitcase/clothes benchmark.
5. Only after that:
- revisit larger RLBench sweeps,
- or collect custom teleop data.
## Repository Layout
- `code/`
- cleaned code snapshot used for the handoff
- `artifacts/outputs/`
- current adapter checkpoints and training outputs
- `artifacts/reports/`
- evaluation and debugging reports
- `artifacts/data/reveal_proxy/`
- proxy train/val datasets used by this stage
- `legacy/`
- exact older checkpoints and summaries that the current work depends on
- `docs/`
- audit, iteration, and completion reports from this handoff
- `setup/`
- same-machine environment notes and helper scripts
## Recommended Use Of This Repo
Use this repo as:
- the archival handoff state,
- the codebase to continue adapter work from,
- the source of the current checkpoints and benchmark reports,
- the baseline package before moving to real sim validation.
Do **not** use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.
|