# Public Benchmark Results

All dates below refer to `2026-04-01 UTC`.

## Dense Occluded Retrieval Proxy

Benchmark:

- ManiSkill `PickClutterYCB-v1`

### Completed runs

- `reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json`
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.04`
- `reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json`
  - `trunk = 0.04`
  - `noop = 0.32`
  - `active = 0.32`
  - not adapter-specific because `active == noop`
- `reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json`
  - `trunk = 0.06`
  - `noop = 0.06`
  - `active = 0.06`
- `reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json`
  - `trunk = 0.48`
  - `noop = 0.04`
  - `active = 0.04`
  - active intervened but regressed badly
- `reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json`
  - `trunk = 0.06`
  - `noop = 0.06`
  - `active = 0.62`
  - `delta = +0.56`
  - eval-probe only, not a clean retrain
- `reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json`
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.04`
  - fairness-preserving retrain, but active still failed
- `reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json`
  - val-only planner sweep
  - `baseline_corrected = 0.00`
  - `soft_pref = 0.00`
  - `softer_pref = 0.625`
  - `retrieve_open = 0.625`
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json`
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.62`
  - `delta = +0.58`
  - `95% CI = [0.44, 0.72]`
  - `intervention_rate = 1.0`
  - `non_base_selection_rate = 1.0`
- `reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/`
  - full rerender of all `50` held-out seeds for `trunk_only_ft` and `adapter_active_ft`
  - includes `index.html`, `INDEX.md`, and `manifest.json`
  - rerender manifest reports `0` success mismatches against the saved benchmark json files

### Exact `smoke_v5` eval tuning carried to held-out

- `mode_preference_bonus = 0.75`
- `premature_retrieve_penalty = 0.5`
- `premature_insert_penalty = 0.25`
- `premature_maintain_penalty = 1.0`
- `occlusion_maintain_gap_min_access = 0.30`
- `occlusion_maintain_gap_min_visibility = 0.20`
- `retrieve_stage_access_threshold = 0.18`
- `retrieve_stage_reveal_threshold = 0.18`
- `retrieve_stage_support_threshold = 0.18`

## Bag Retrieval Proxy

Benchmark:

- ManiSkill public bridge basket retrieval proxy

### Completed runs

- `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json`
  - `0.32`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json`
  - `0.00`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json`
  - `0.48`

- `reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json`
  - `0.48`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json`
  - `0.08`
- `reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json`
  - `0.00`

### Seed-23 validation sweep

- `reports/maniskill_bag_bridge_val_sweep_seed23/summary.json`

Configs:

- `default`
  - `trunk = 0.125`
  - `noop = 0.125`
  - `active = 0.00`
- `less_bonus`
  - `trunk = 0.125`
  - `noop = 0.125`
  - `active = 0.125`
  - intervention preserved
- `conservative`
  - `trunk = 0.125`
  - `noop = 0.125`
  - `active = 0.125`
  - intervention effectively disabled
- `low_bonus_high_thresh`
  - `trunk = 0.125`
  - `noop = 0.125`
  - `active = 0.125`
  - intervention effectively disabled

### Corrected held-out evals

- `reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json`
  - `trunk = 0.32`
  - `noop = 0.00`
  - `active = 0.48`
  - `delta = +0.16`
- `reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json`
  - `trunk = 0.48`
  - `noop = 0.08`
  - `active = 0.48`
  - `delta = +0.00`
- `reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json`
  - `trunk = 0.40`
  - `noop = 0.04`
  - `active = 0.48`
  - `delta = +0.08`
  - run-bootstrap CI `[0.00, 0.16]`

## Cloth Retrieval Proxy

Benchmark:

- ManiSkill public bridge cloth retrieval proxy

### Completed held-out seeds

- `seed17`
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.10`
  - `intervention = 0.3369`
  - `non_base = 0.2674`
- `seed23`
  - `trunk = 0.04`
  - `noop = 0.02`
  - `active = 0.02`
  - `intervention = 0.0`
  - `non_base = 0.0`
- `seed29`
  - `trunk = 0.04`
  - `noop = 0.04`
  - `active = 0.04`
  - `intervention = 0.0`
  - `non_base = 0.0`

3-seed aggregate:

- `trunk = 0.0400`
- `noop = 0.0333`
- `active = 0.0533`
- `delta = +0.0133`

### Seed-23 cloth validation sweep

- `reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json`

Configs:

- `default`
  - `trunk = 0.25`
  - `noop = 0.125`
  - `active = 0.125`
  - `intervention = 0.0`
- `low_thresh`
  - `active = 0.125`
  - `intervention = 0.2`
  - `non_base = 0.0667`
- `low_thresh_less_bonus`
  - `active = 0.125`
  - `intervention = 0.2`
  - `non_base = 0.0667`
- `very_low_thresh_less_bonus`
  - `active = 0.125`
  - `intervention = 1.0`
  - `non_base = 0.5333`

Interpretation:

- seed23 cloth was not recoverable by eval-side planner tuning alone

## Single-Seed Combined Proxy Suite

- `reports/public_proxy_suite_smoke_v1/combined_summary.json`

Single-seed summary:

- occlusion proxy: `+0.58`
- bag proxy: `+0.16`
- cloth proxy: `+0.06`
- macro delta: `+0.267`

This combined single-seed picture is useful historically, but the stronger current read is:

- occlusion: strong
- bag: modestly positive across corrected 2-seed evaluation
- cloth: weak/inconclusive across 3 seeds