VLAarchTestsBench / code /VLAarchtests4_root /PUBLIC_BENCHMARK_RESULTS.md
lsnu's picture
Add files using upload-large-folder tool
5ce8761 verified
# Public Benchmark Results
All dates below refer to `2026-04-01 UTC`.
## Dense Occluded Retrieval Proxy
Benchmark:
- ManiSkill `PickClutterYCB-v1`
### Completed runs
- `reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json`
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.04`
- `reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json`
- `trunk = 0.04`
- `noop = 0.32`
- `active = 0.32`
- not adapter-specific because `active == noop`
- `reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json`
- `trunk = 0.06`
- `noop = 0.06`
- `active = 0.06`
- `reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json`
- `trunk = 0.48`
- `noop = 0.04`
- `active = 0.04`
- active intervened but regressed badly
- `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json`
- `trunk = 0.06`
- `noop = 0.06`
- `active = 0.62`
- `delta = +0.56`
- eval-probe only, not a clean retrain
- `reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json`
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.04`
- fairness-preserving retrain, but active still failed
- `reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json`
- val-only planner sweep
- `baseline_corrected = 0.00`
- `soft_pref = 0.00`
- `softer_pref = 0.625`
- `retrieve_open = 0.625`
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.62`
- `delta = +0.58`
- `95% CI = [0.44, 0.72]`
- `intervention_rate = 1.0`
- `non_base_selection_rate = 1.0`
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
- full rerender of all `50` held-out seeds for `trunk_only_ft` and `adapter_active_ft`
- includes `index.html`, `INDEX.md`, and `manifest.json`
- rerender manifest reports `0` success mismatches against the saved benchmark json files
### Exact `smoke_v5` eval tuning carried to held-out
- `mode_preference_bonus = 0.75`
- `premature_retrieve_penalty = 0.5`
- `premature_insert_penalty = 0.25`
- `premature_maintain_penalty = 1.0`
- `occlusion_maintain_gap_min_access = 0.30`
- `occlusion_maintain_gap_min_visibility = 0.20`
- `retrieve_stage_access_threshold = 0.18`
- `retrieve_stage_reveal_threshold = 0.18`
- `retrieve_stage_support_threshold = 0.18`
## Bag Retrieval Proxy
Benchmark:
- ManiSkill public bridge basket retrieval proxy
### Completed runs
- `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json`
- `0.32`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json`
- `0.00`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json`
- `0.48`
- `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json`
- `0.48`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json`
- `0.08`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json`
- `0.00`
### Seed-23 validation sweep
- `reports/maniskill_bag_bridge_val_sweep_seed23/summary.json`
Configs:
- `default`
- `trunk = 0.125`
- `noop = 0.125`
- `active = 0.00`
- `less_bonus`
- `trunk = 0.125`
- `noop = 0.125`
- `active = 0.125`
- intervention preserved
- `conservative`
- `trunk = 0.125`
- `noop = 0.125`
- `active = 0.125`
- intervention effectively disabled
- `low_bonus_high_thresh`
- `trunk = 0.125`
- `noop = 0.125`
- `active = 0.125`
- intervention effectively disabled
### Corrected held-out evals
- `reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json`
- `trunk = 0.32`
- `noop = 0.00`
- `active = 0.48`
- `delta = +0.16`
- `reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json`
- `trunk = 0.48`
- `noop = 0.08`
- `active = 0.48`
- `delta = +0.00`
- `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json`
- `trunk = 0.40`
- `noop = 0.04`
- `active = 0.48`
- `delta = +0.08`
- run-bootstrap CI `[0.00, 0.16]`
## Cloth Retrieval Proxy
Benchmark:
- ManiSkill public bridge cloth retrieval proxy
### Completed held-out seeds
- `seed17`
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.10`
- `intervention = 0.3369`
- `non_base = 0.2674`
- `seed23`
- `trunk = 0.04`
- `noop = 0.02`
- `active = 0.02`
- `intervention = 0.0`
- `non_base = 0.0`
- `seed29`
- `trunk = 0.04`
- `noop = 0.04`
- `active = 0.04`
- `intervention = 0.0`
- `non_base = 0.0`
3-seed aggregate:
- `trunk = 0.0400`
- `noop = 0.0333`
- `active = 0.0533`
- `delta = +0.0133`
### Seed-23 cloth validation sweep
- `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json`
Configs:
- `default`
- `trunk = 0.25`
- `noop = 0.125`
- `active = 0.125`
- `intervention = 0.0`
- `low_thresh`
- `active = 0.125`
- `intervention = 0.2`
- `non_base = 0.0667`
- `low_thresh_less_bonus`
- `active = 0.125`
- `intervention = 0.2`
- `non_base = 0.0667`
- `very_low_thresh_less_bonus`
- `active = 0.125`
- `intervention = 1.0`
- `non_base = 0.5333`
Interpretation:
- seed23 cloth was not recoverable by eval-side planner tuning alone
## Single-Seed Combined Proxy Suite
- `reports/public_proxy_suite_smoke_v1/combined_summary.json`
Single-seed summary:
- occlusion proxy: `+0.58`
- bag proxy: `+0.16`
- cloth proxy: `+0.06`
- macro delta: `+0.267`
This combined single-seed picture is useful historically, but the stronger current read is:
- occlusion: strong
- bag: modestly positive across corrected 2-seed evaluation
- cloth: weak/inconclusive across 3 seeds