VLAarchTestsBench / code /VLAarchtests4_root /PUBLIC_BENCHMARK_RESULTS.md

Add files using upload-large-folder tool

5ce8761 verified 2 months ago

5.67 kB

	# Public Benchmark Results

	All dates below refer to `2026-04-01 UTC`.

	## Dense Occluded Retrieval Proxy

	Benchmark:

	- ManiSkill `PickClutterYCB-v1`

	### Completed runs

	- `reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json`
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.04`
	- `reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json`
	- `trunk = 0.04`
	- `noop = 0.32`
	- `active = 0.32`
	- not adapter-specific because `active == noop`
	- `reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json`
	- `trunk = 0.06`
	- `noop = 0.06`
	- `active = 0.06`
	- `reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json`
	- `trunk = 0.48`
	- `noop = 0.04`
	- `active = 0.04`
	- active intervened but regressed badly
	- `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json`
	- `trunk = 0.06`
	- `noop = 0.06`
	- `active = 0.62`
	- `delta = +0.56`
	- eval-probe only, not a clean retrain
	- `reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json`
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.04`
	- fairness-preserving retrain, but active still failed
	- `reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json`
	- val-only planner sweep
	- `baseline_corrected = 0.00`
	- `soft_pref = 0.00`
	- `softer_pref = 0.625`
	- `retrieve_open = 0.625`
	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.62`
	- `delta = +0.58`
	- `95% CI = [0.44, 0.72]`
	- `intervention_rate = 1.0`
	- `non_base_selection_rate = 1.0`
	- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
	- full rerender of all `50` held-out seeds for `trunk_only_ft` and `adapter_active_ft`
	- includes `index.html`, `INDEX.md`, and `manifest.json`
	- rerender manifest reports `0` success mismatches against the saved benchmark json files

	### Exact `smoke_v5` eval tuning carried to held-out

	- `mode_preference_bonus = 0.75`
	- `premature_retrieve_penalty = 0.5`
	- `premature_insert_penalty = 0.25`
	- `premature_maintain_penalty = 1.0`
	- `occlusion_maintain_gap_min_access = 0.30`
	- `occlusion_maintain_gap_min_visibility = 0.20`
	- `retrieve_stage_access_threshold = 0.18`
	- `retrieve_stage_reveal_threshold = 0.18`
	- `retrieve_stage_support_threshold = 0.18`

	## Bag Retrieval Proxy

	Benchmark:

	- ManiSkill public bridge basket retrieval proxy

	### Completed runs

	- `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json`
	- `0.32`
	- `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json`
	- `0.00`
	- `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json`
	- `0.48`

	- `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json`
	- `0.48`
	- `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json`
	- `0.08`
	- `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json`
	- `0.00`

	### Seed-23 validation sweep

	- `reports/maniskill_bag_bridge_val_sweep_seed23/summary.json`

	Configs:

	- `default`
	- `trunk = 0.125`
	- `noop = 0.125`
	- `active = 0.00`
	- `less_bonus`
	- `trunk = 0.125`
	- `noop = 0.125`
	- `active = 0.125`
	- intervention preserved
	- `conservative`
	- `trunk = 0.125`
	- `noop = 0.125`
	- `active = 0.125`
	- intervention effectively disabled
	- `low_bonus_high_thresh`
	- `trunk = 0.125`
	- `noop = 0.125`
	- `active = 0.125`
	- intervention effectively disabled

	### Corrected held-out evals

	- `reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json`
	- `trunk = 0.32`
	- `noop = 0.00`
	- `active = 0.48`
	- `delta = +0.16`
	- `reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json`
	- `trunk = 0.48`
	- `noop = 0.08`
	- `active = 0.48`
	- `delta = +0.00`
	- `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json`
	- `trunk = 0.40`
	- `noop = 0.04`
	- `active = 0.48`
	- `delta = +0.08`
	- run-bootstrap CI `[0.00, 0.16]`

	## Cloth Retrieval Proxy

	Benchmark:

	- ManiSkill public bridge cloth retrieval proxy

	### Completed held-out seeds

	- `seed17`
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.10`
	- `intervention = 0.3369`
	- `non_base = 0.2674`
	- `seed23`
	- `trunk = 0.04`
	- `noop = 0.02`
	- `active = 0.02`
	- `intervention = 0.0`
	- `non_base = 0.0`
	- `seed29`
	- `trunk = 0.04`
	- `noop = 0.04`
	- `active = 0.04`
	- `intervention = 0.0`
	- `non_base = 0.0`

	3-seed aggregate:

	- `trunk = 0.0400`
	- `noop = 0.0333`
	- `active = 0.0533`
	- `delta = +0.0133`

	### Seed-23 cloth validation sweep

	- `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json`

	Configs:

	- `default`
	- `trunk = 0.25`
	- `noop = 0.125`
	- `active = 0.125`
	- `intervention = 0.0`
	- `low_thresh`
	- `active = 0.125`
	- `intervention = 0.2`
	- `non_base = 0.0667`
	- `low_thresh_less_bonus`
	- `active = 0.125`
	- `intervention = 0.2`
	- `non_base = 0.0667`
	- `very_low_thresh_less_bonus`
	- `active = 0.125`
	- `intervention = 1.0`
	- `non_base = 0.5333`

	Interpretation:

	- seed23 cloth was not recoverable by eval-side planner tuning alone

	## Single-Seed Combined Proxy Suite

	- `reports/public_proxy_suite_smoke_v1/combined_summary.json`

	Single-seed summary:

	- occlusion proxy: `+0.58`
	- bag proxy: `+0.16`
	- cloth proxy: `+0.06`
	- macro delta: `+0.267`

	This combined single-seed picture is useful historically, but the stronger current read is:

	- occlusion: strong
	- bag: modestly positive across corrected 2-seed evaluation
	- cloth: weak/inconclusive across 3 seeds