| # Public Benchmark Results |
|
|
| All dates below refer to `2026-04-01 UTC`. |
|
|
| ## Dense Occluded Retrieval Proxy |
|
|
| Benchmark: |
|
|
| - ManiSkill `PickClutterYCB-v1` |
|
|
| ### Completed runs |
|
|
| - `reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json` |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.04` |
| - `reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json` |
| - `trunk = 0.04` |
| - `noop = 0.32` |
| - `active = 0.32` |
| - not adapter-specific because `active == noop` |
| - `reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json` |
| - `trunk = 0.06` |
| - `noop = 0.06` |
| - `active = 0.06` |
| - `reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json` |
| - `trunk = 0.48` |
| - `noop = 0.04` |
| - `active = 0.04` |
| - active intervened but regressed badly |
| - `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json` |
| - `trunk = 0.06` |
| - `noop = 0.06` |
| - `active = 0.62` |
| - `delta = +0.56` |
| - eval-probe only, not a clean retrain |
| - `reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json` |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.04` |
| - fairness-preserving retrain, but active still failed |
| - `reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json` |
| - val-only planner sweep |
| - `baseline_corrected = 0.00` |
| - `soft_pref = 0.00` |
| - `softer_pref = 0.625` |
| - `retrieve_open = 0.625` |
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json` |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.62` |
| - `delta = +0.58` |
| - `95% CI = [0.44, 0.72]` |
| - `intervention_rate = 1.0` |
| - `non_base_selection_rate = 1.0` |
| - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/` |
| - full rerender of all `50` held-out seeds for `trunk_only_ft` and `adapter_active_ft` |
| - includes `index.html`, `INDEX.md`, and `manifest.json` |
| - rerender manifest reports `0` success mismatches against the saved benchmark json files |
|
|
| ### Exact `smoke_v5` eval tuning carried to held-out |
| |
| - `mode_preference_bonus = 0.75` |
| - `premature_retrieve_penalty = 0.5` |
| - `premature_insert_penalty = 0.25` |
| - `premature_maintain_penalty = 1.0` |
| - `occlusion_maintain_gap_min_access = 0.30` |
| - `occlusion_maintain_gap_min_visibility = 0.20` |
| - `retrieve_stage_access_threshold = 0.18` |
| - `retrieve_stage_reveal_threshold = 0.18` |
| - `retrieve_stage_support_threshold = 0.18` |
|
|
| ## Bag Retrieval Proxy |
|
|
| Benchmark: |
|
|
| - ManiSkill public bridge basket retrieval proxy |
|
|
| ### Completed runs |
|
|
| - `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json` |
| - `0.32` |
| - `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json` |
| - `0.00` |
| - `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json` |
| - `0.48` |
|
|
| - `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json` |
| - `0.48` |
| - `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json` |
| - `0.08` |
| - `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json` |
| - `0.00` |
|
|
| ### Seed-23 validation sweep |
|
|
| - `reports/maniskill_bag_bridge_val_sweep_seed23/summary.json` |
|
|
| Configs: |
|
|
| - `default` |
| - `trunk = 0.125` |
| - `noop = 0.125` |
| - `active = 0.00` |
| - `less_bonus` |
| - `trunk = 0.125` |
| - `noop = 0.125` |
| - `active = 0.125` |
| - intervention preserved |
| - `conservative` |
| - `trunk = 0.125` |
| - `noop = 0.125` |
| - `active = 0.125` |
| - intervention effectively disabled |
| - `low_bonus_high_thresh` |
| - `trunk = 0.125` |
| - `noop = 0.125` |
| - `active = 0.125` |
| - intervention effectively disabled |
|
|
| ### Corrected held-out evals |
|
|
| - `reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json` |
| - `trunk = 0.32` |
| - `noop = 0.00` |
| - `active = 0.48` |
| - `delta = +0.16` |
| - `reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json` |
| - `trunk = 0.48` |
| - `noop = 0.08` |
| - `active = 0.48` |
| - `delta = +0.00` |
| - `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json` |
| - `trunk = 0.40` |
| - `noop = 0.04` |
| - `active = 0.48` |
| - `delta = +0.08` |
| - run-bootstrap CI `[0.00, 0.16]` |
|
|
| ## Cloth Retrieval Proxy |
|
|
| Benchmark: |
|
|
| - ManiSkill public bridge cloth retrieval proxy |
|
|
| ### Completed held-out seeds |
|
|
| - `seed17` |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.10` |
| - `intervention = 0.3369` |
| - `non_base = 0.2674` |
| - `seed23` |
| - `trunk = 0.04` |
| - `noop = 0.02` |
| - `active = 0.02` |
| - `intervention = 0.0` |
| - `non_base = 0.0` |
| - `seed29` |
| - `trunk = 0.04` |
| - `noop = 0.04` |
| - `active = 0.04` |
| - `intervention = 0.0` |
| - `non_base = 0.0` |
|
|
| 3-seed aggregate: |
|
|
| - `trunk = 0.0400` |
| - `noop = 0.0333` |
| - `active = 0.0533` |
| - `delta = +0.0133` |
|
|
| ### Seed-23 cloth validation sweep |
|
|
| - `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json` |
|
|
| Configs: |
|
|
| - `default` |
| - `trunk = 0.25` |
| - `noop = 0.125` |
| - `active = 0.125` |
| - `intervention = 0.0` |
| - `low_thresh` |
| - `active = 0.125` |
| - `intervention = 0.2` |
| - `non_base = 0.0667` |
| - `low_thresh_less_bonus` |
| - `active = 0.125` |
| - `intervention = 0.2` |
| - `non_base = 0.0667` |
| - `very_low_thresh_less_bonus` |
| - `active = 0.125` |
| - `intervention = 1.0` |
| - `non_base = 0.5333` |
|
|
| Interpretation: |
|
|
| - seed23 cloth was not recoverable by eval-side planner tuning alone |
|
|
| ## Single-Seed Combined Proxy Suite |
|
|
| - `reports/public_proxy_suite_smoke_v1/combined_summary.json` |
|
|
| Single-seed summary: |
|
|
| - occlusion proxy: `+0.58` |
| - bag proxy: `+0.16` |
| - cloth proxy: `+0.06` |
| - macro delta: `+0.267` |
|
|
| This combined single-seed picture is useful historically, but the stronger current read is: |
|
|
| - occlusion: strong |
| - bag: modestly positive across corrected 2-seed evaluation |
| - cloth: weak/inconclusive across 3 seeds |
|
|