VLAarchTestsBench / code /VLAarchtests4_root /docs /public_benchmark_progress_2026-04-01.md

Add files using upload-large-folder tool

5ce8761 verified about 2 months ago

3.91 kB

	## Public Benchmark Progress

	Date: 2026-04-01 UTC

	### Confirmed Real Public Benchmark Result

	- Public occlusion proxy: `ManiSkill PickClutterYCB-v1`
	- Strongest adapter-specific result so far:
	- summary: `/workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`
	- `trunk_only_ft = 0.04`
	- `adapter_noop = 0.04`
	- `adapter_active_ft = 0.62`
	- `delta_active_vs_trunk = +0.58`
	- `95% CI = [0.44, 0.72]`
	- `intervention_rate = 1.0`
	- `non_base_selection_rate = 1.0`
	- Interpretation:
	- this is real adapter-specific sign of life on a public occlusion benchmark
	- the gain is not coming from a stronger shared trunk, because `adapter_noop` stays flat

	### BEHAVIOR Bag Proxy Investigation

	Target public task family:
	- official BEHAVIOR grocery-store bag/container retrieval proxy
	- primary candidate: `paying_for_purchases`
	- stricter but currently unusable candidate: `buy_basic_garden_tools`

	Environment used:
	- BEHAVIOR assets: `/workspace/workspace/BEHAVIOR-1K`
	- venv used for probes: `/workspace/envs/behavior`

	Findings:
	- `buy_basic_garden_tools` is blocked by official scene-task geometry:
	- repeated failure on `ontop ['rake.n.03_1', 'grocery_shelf.n.01_1']`
	- even with whitelist attempts, the sampler never found a valid shelf placement
	- `paying_for_purchases` is much healthier:
	- `grocery_store_convenience`, `grocery_store_cafe`, and `grocery_store_asian` all load
	- object scope binds the real task objects:
	- `shopping_basket.n.01_1`
	- `money.n.01_1`
	- `checkout.n.03_1`
	- `floor.n.01_1`
	- Root sampler bug:
	- official online sampling fails on the floor / agent chain
	- without patching, the blocking warning is:
	- `Room type [grocery_store] ... floor.n.01_1: , checkout.n.03_1: grocery_store_0`
	- after removing the agent-on-floor condition from the sampler pipeline, the next blocker is:
	- `ontop ['shopping_basket.n.01_1', 'floor.n.01_1'] False`
	- Critical state-probe result:
	- even when object bindings exist, the sampled movable objects remain parked at their far-away import positions
	- observed example on `grocery_store_asian`:
	- basket position near `[120, 120, -80]`
	- money position near `[115, 115, -85]`
	- apples position near `[110, 110, -90]` and `[105, 105, -95]`
	- `money inside basket = False`
	- `apple1 inside basket = False`
	- `apple2 inside basket = False`
	- Conclusion:
	- as of 2026-04-01, the BEHAVIOR bag proxy is not yet a usable fair evaluation track in this workspace
	- the public task objects bind, but the online sampler does not materialize a valid initial scene for training or evaluation

	### Garment / Cloth Proxy Status

	- GarmentLab repo cloned:
	- `/workspace/workspace/GarmentLab`
	- Immediate constraint:
	- the repo expects Isaac Sim 4.0.0 plus external Google Drive assets
	- Current status:
	- code inspected only
	- no runnable public cloth benchmark execution completed yet in this workspace

	### Next Public Proxy Candidates

	Given the BEHAVIOR blocker, the next-lightest public candidates already available locally are:

	- `OpenCabinetDrawer-v1`
	- public ManiSkill task
	- good container reveal / access proxy
	- `PutEggplantInBasketScene-v1`
	- public ManiSkill bridge-dataset task
	- public basket / container interaction proxy
	- `PutSpoonOnTableClothInScene-v1`
	- public ManiSkill bridge-dataset cloth interaction proxy

	### Immediate Recommendation

	- Keep the confirmed `PickClutterYCB-v1` result as the anchor public success case.
	- Do not spend more time on BEHAVIOR online sampling until either:
	- a cached valid scene instance is created, or
	- the sampler is patched deeply enough to place container objects correctly instead of leaving them at far-away import positions.
	- Pivot the next train/eval smoke to a lighter public ManiSkill proxy before returning to BEHAVIOR.