VLAarchTestsBench / code /VLAarchtests4_root /docs /public_benchmark_progress_2026-04-01.md
lsnu's picture
Add files using upload-large-folder tool
5ce8761 verified
## Public Benchmark Progress
Date: 2026-04-01 UTC
### Confirmed Real Public Benchmark Result
- Public occlusion proxy: `ManiSkill PickClutterYCB-v1`
- Strongest adapter-specific result so far:
- summary: `/workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`
- `trunk_only_ft = 0.04`
- `adapter_noop = 0.04`
- `adapter_active_ft = 0.62`
- `delta_active_vs_trunk = +0.58`
- `95% CI = [0.44, 0.72]`
- `intervention_rate = 1.0`
- `non_base_selection_rate = 1.0`
- Interpretation:
- this is real adapter-specific sign of life on a public occlusion benchmark
- the gain is not coming from a stronger shared trunk, because `adapter_noop` stays flat
### BEHAVIOR Bag Proxy Investigation
Target public task family:
- official BEHAVIOR grocery-store bag/container retrieval proxy
- primary candidate: `paying_for_purchases`
- stricter but currently unusable candidate: `buy_basic_garden_tools`
Environment used:
- BEHAVIOR assets: `/workspace/workspace/BEHAVIOR-1K`
- venv used for probes: `/workspace/envs/behavior`
Findings:
- `buy_basic_garden_tools` is blocked by official scene-task geometry:
- repeated failure on `ontop ['rake.n.03_1', 'grocery_shelf.n.01_1']`
- even with whitelist attempts, the sampler never found a valid shelf placement
- `paying_for_purchases` is much healthier:
- `grocery_store_convenience`, `grocery_store_cafe`, and `grocery_store_asian` all load
- object scope binds the real task objects:
- `shopping_basket.n.01_1`
- `money.n.01_1`
- `checkout.n.03_1`
- `floor.n.01_1`
- Root sampler bug:
- official online sampling fails on the floor / agent chain
- without patching, the blocking warning is:
- `Room type [grocery_store] ... floor.n.01_1: , checkout.n.03_1: grocery_store_0`
- after removing the agent-on-floor condition from the sampler pipeline, the next blocker is:
- `ontop ['shopping_basket.n.01_1', 'floor.n.01_1'] False`
- Critical state-probe result:
- even when object bindings exist, the sampled movable objects remain parked at their far-away import positions
- observed example on `grocery_store_asian`:
- basket position near `[120, 120, -80]`
- money position near `[115, 115, -85]`
- apples position near `[110, 110, -90]` and `[105, 105, -95]`
- `money inside basket = False`
- `apple1 inside basket = False`
- `apple2 inside basket = False`
- Conclusion:
- as of 2026-04-01, the BEHAVIOR bag proxy is not yet a usable fair evaluation track in this workspace
- the public task objects bind, but the online sampler does not materialize a valid initial scene for training or evaluation
### Garment / Cloth Proxy Status
- GarmentLab repo cloned:
- `/workspace/workspace/GarmentLab`
- Immediate constraint:
- the repo expects Isaac Sim 4.0.0 plus external Google Drive assets
- Current status:
- code inspected only
- no runnable public cloth benchmark execution completed yet in this workspace
### Next Public Proxy Candidates
Given the BEHAVIOR blocker, the next-lightest public candidates already available locally are:
- `OpenCabinetDrawer-v1`
- public ManiSkill task
- good container reveal / access proxy
- `PutEggplantInBasketScene-v1`
- public ManiSkill bridge-dataset task
- public basket / container interaction proxy
- `PutSpoonOnTableClothInScene-v1`
- public ManiSkill bridge-dataset cloth interaction proxy
### Immediate Recommendation
- Keep the confirmed `PickClutterYCB-v1` result as the anchor public success case.
- Do not spend more time on BEHAVIOR online sampling until either:
- a cached valid scene instance is created, or
- the sampler is patched deeply enough to place container objects correctly instead of leaving them at far-away import positions.
- Pivot the next train/eval smoke to a lighter public ManiSkill proxy before returning to BEHAVIOR.