# Public Benchmark Results All dates below refer to `2026-04-01 UTC`. ## Dense Occluded Retrieval Proxy Benchmark: - ManiSkill `PickClutterYCB-v1` ### Completed runs - `reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json` - `trunk = 0.04` - `noop = 0.04` - `active = 0.04` - `reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json` - `trunk = 0.04` - `noop = 0.32` - `active = 0.32` - not adapter-specific because `active == noop` - `reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json` - `trunk = 0.06` - `noop = 0.06` - `active = 0.06` - `reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json` - `trunk = 0.48` - `noop = 0.04` - `active = 0.04` - active intervened but regressed badly - `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json` - `trunk = 0.06` - `noop = 0.06` - `active = 0.62` - `delta = +0.56` - eval-probe only, not a clean retrain - `reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json` - `trunk = 0.04` - `noop = 0.04` - `active = 0.04` - fairness-preserving retrain, but active still failed - `reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json` - val-only planner sweep - `baseline_corrected = 0.00` - `soft_pref = 0.00` - `softer_pref = 0.625` - `retrieve_open = 0.625` - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json` - `trunk = 0.04` - `noop = 0.04` - `active = 0.62` - `delta = +0.58` - `95% CI = [0.44, 0.72]` - `intervention_rate = 1.0` - `non_base_selection_rate = 1.0` - `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/` - full rerender of all `50` held-out seeds for `trunk_only_ft` and `adapter_active_ft` - includes `index.html`, `INDEX.md`, and `manifest.json` - rerender manifest reports `0` success mismatches against the saved benchmark json files ### Exact `smoke_v5` eval tuning carried to held-out - `mode_preference_bonus = 0.75` - `premature_retrieve_penalty = 0.5` - `premature_insert_penalty = 0.25` - `premature_maintain_penalty = 1.0` - `occlusion_maintain_gap_min_access = 0.30` - `occlusion_maintain_gap_min_visibility = 0.20` - `retrieve_stage_access_threshold = 0.18` - `retrieve_stage_reveal_threshold = 0.18` - `retrieve_stage_support_threshold = 0.18` ## Bag Retrieval Proxy Benchmark: - ManiSkill public bridge basket retrieval proxy ### Completed runs - `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json` - `0.32` - `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json` - `0.00` - `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json` - `0.48` - `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json` - `0.48` - `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json` - `0.08` - `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json` - `0.00` ### Seed-23 validation sweep - `reports/maniskill_bag_bridge_val_sweep_seed23/summary.json` Configs: - `default` - `trunk = 0.125` - `noop = 0.125` - `active = 0.00` - `less_bonus` - `trunk = 0.125` - `noop = 0.125` - `active = 0.125` - intervention preserved - `conservative` - `trunk = 0.125` - `noop = 0.125` - `active = 0.125` - intervention effectively disabled - `low_bonus_high_thresh` - `trunk = 0.125` - `noop = 0.125` - `active = 0.125` - intervention effectively disabled ### Corrected held-out evals - `reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json` - `trunk = 0.32` - `noop = 0.00` - `active = 0.48` - `delta = +0.16` - `reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json` - `trunk = 0.48` - `noop = 0.08` - `active = 0.48` - `delta = +0.00` - `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json` - `trunk = 0.40` - `noop = 0.04` - `active = 0.48` - `delta = +0.08` - run-bootstrap CI `[0.00, 0.16]` ## Cloth Retrieval Proxy Benchmark: - ManiSkill public bridge cloth retrieval proxy ### Completed held-out seeds - `seed17` - `trunk = 0.04` - `noop = 0.04` - `active = 0.10` - `intervention = 0.3369` - `non_base = 0.2674` - `seed23` - `trunk = 0.04` - `noop = 0.02` - `active = 0.02` - `intervention = 0.0` - `non_base = 0.0` - `seed29` - `trunk = 0.04` - `noop = 0.04` - `active = 0.04` - `intervention = 0.0` - `non_base = 0.0` 3-seed aggregate: - `trunk = 0.0400` - `noop = 0.0333` - `active = 0.0533` - `delta = +0.0133` ### Seed-23 cloth validation sweep - `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json` Configs: - `default` - `trunk = 0.25` - `noop = 0.125` - `active = 0.125` - `intervention = 0.0` - `low_thresh` - `active = 0.125` - `intervention = 0.2` - `non_base = 0.0667` - `low_thresh_less_bonus` - `active = 0.125` - `intervention = 0.2` - `non_base = 0.0667` - `very_low_thresh_less_bonus` - `active = 0.125` - `intervention = 1.0` - `non_base = 0.5333` Interpretation: - seed23 cloth was not recoverable by eval-side planner tuning alone ## Single-Seed Combined Proxy Suite - `reports/public_proxy_suite_smoke_v1/combined_summary.json` Single-seed summary: - occlusion proxy: `+0.58` - bag proxy: `+0.16` - cloth proxy: `+0.06` - macro delta: `+0.267` This combined single-seed picture is useful historically, but the stronger current read is: - occlusion: strong - bag: modestly positive across corrected 2-seed evaluation - cloth: weak/inconclusive across 3 seeds