File size: 24,155 Bytes

---
tags:
  - robotics
  - rlbench
  - benchmarking
  - label-validation
---

# VLAdaptorBench

This repository contains the benchmark setup, metric code, debug history, and validation artifacts for the proposed VLA + adaptor label study on `bimanual_take_tray_out_of_oven`.

This is still a label-validation repository, not a policy repository. No `pi0.5` integration is included here.

## Current Status

The latest work behind this upload produced:

- `metric_iter30_full100_single_pass_full_logging_fixed_templates_merged`
  - merged 100-episode dense/fuller-logging result tree from the single-pass fixed-template run

The current Hub upload includes:

- `artifacts/results/metric_iter31_sample10_all_metrics_verify/`
  - compact 10-episode verification subset with `all_metrics` GIFs only
- the fast `all_metrics`-only render path in:
  - `code/scripts/render_oven_metric_frame.py`
  - `code/scripts/render_oven_metric_gifs.py`

The new sample verification bundle is meant to be the quickest remote sanity-check entry point. It includes the sampled dense/keyframe tables, per-episode metrics, fuller debug sidecars, fixed templates, selection metadata, and one compact full-metrics GIF per sampled episode.

The earlier `metric_iter29_ep0_single_pass_full_logging_fixed_templates` validation pass for episode 0 remains the detailed single-episode reference for the fuller debug logging and the debug-aware GIF renderer.

That run keeps the trusted `iter24` template bundle fixed, adds the fuller dense/debug logging in a single pass, and regenerates the episode-0 visualization suite from the richer artifact. It is the current reference for:

- the `episode0.debug.jsonl` sidecar with per-frame `p_pre` and `p_ext` internals
- the single-pass dense CSV with fuller logged sub-metrics
- the updated `path_quality_focus` GIF that now exposes the `p_ext` milestone search, milestone scores, and planner outcomes directly in the visualization

The earlier `metric_iter24_*_door_contact_geom` reruns for episodes 0 and 1 remain the trusted baseline for the repaired oven metrics.

That rerun fixes the main simulator-state bugs that were still contaminating the oven metrics:

1. The reveal-to-retrieve transition used to occur too late, effectively at grasp time.
2. The visibility metric used to drop to zero around frame 232 even when the tray grasp region was clearly visible in `wrist_left`.
3. `p_pre` stayed near zero until grasp.
4. Extraction labels could flicker or drift because oracle rollouts were not restoring the simulator state exactly.
5. The old dense runner's restore-heavy path could still bias later frames after an oracle call.

The current code addresses those issues by:

- decoding RLBench mask PNGs correctly before converting them back to simulator handles
- scoring visibility directly from mask-handle agreement instead of the old depth/z heuristic
- inferring tray mask handles from grasp-region projections
- deriving a late-window pregrasp approach template instead of accidentally including frame-8 arm poses
- adding explicit `pregrasp_progress`, `pregrasp_distance`, `pregrasp_speed`, and `phase_score`
- making the repair path batch frames sequentially per worker so late-frame rows do not drift
- snapshotting and restoring exact arm joints, gripper joints, and the full grasped-object subtree
- supporting and now preferring `--independent-replay` for the authoritative dense study
- tightening `y_pre` so it stays on once the retriever is clearly inside the pregrasp corridor
- retuning `phase_score` so it tracks the reveal-to-retrieve handoff instead of generic early motion
- recomputing intervention validity from isolated per-frame env replays instead of the old live-cache path
- sampling intervention states earlier in the reveal phase so pre-ready extract checks are not contaminated by borderline near-ready states
- confirming extraction feasibility with repeated planner checks inside the extract oracle so one lucky planner sample is less likely to flip a label

The old `iter4_*`, `iter6_*`, `iter19_*`, and `iter22_*` outputs are still useful historical checkpoints, but the current authoritative outputs are:

- `artifacts/results/metric_iter24_ep0_door_contact_geom/`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/`
- `artifacts/results/metric_iter29_ep0_single_pass_full_logging_fixed_templates/`

The main new fix in `iter24` is the assisted-door contact scoring inside `p_pre`:

- the old `ignore_collisions=True` branch treated oven-door contact as name-whitelisted and only checked the final door angle change
- the new scorer traces door contacts step-by-step, estimates the local door-surface normal from simulator geometry, scores whether the retriever is sliding along the door or pushing it open, and penalizes direct head-on contact or door-closing motion
- this specifically removes the false closed-door `p_pre` spike in episode 0 around frames `43-56` without collapsing the later pregrasp rise once the door is actually opening

The current repo state should therefore be treated as the repaired benchmark snapshot with geometry-aware door assistance, not the final metric design.

Brief caveat: the current `y_ready` label still gates on low oven-door angular speed after extraction feasibility persists. In this task, the retriever arm can legitimately nudge the door while already committing to retrieval, so `y_ready` can still switch later than the true reveal-to-retrieve boundary. For the current oven benchmark, `y_ready` should therefore not be treated as a decisive validation metric or a trusted phase-switch target.

The oven task also has a highly structured reveal-to-retrieve handoff in the expert demos: both arms reposition, the revealer opens and clears the door, then the retriever commits. Because that phase pattern is so standardized, good results on this task are most useful as a task-specific smoke test or a "does the adaptor beat a base finetune here?" check, not as strong evidence of general reveal-and-retrieve reasoning.

## What Is In This Upload

- `code/rr_label_study/`
  - Core metric code.
  - Dense replay, visibility scoring, pregrasp/extraction oracles, keyframe extraction, intervention checks, and summary metric computation.
- `code/scripts/`
  - Study runners and helpers.
  - `run_oven_label_study.py`: dense/keyframe study runner.
  - `launch_parallel_oven_label_study.py`: multi-display worker launcher.
  - `recompute_oven_pregrasp_parallel.py`: targeted dense rerun for repaired `p_pre` labels.
  - `run_oven_pregrasp_batch.py`: sequential per-worker pregrasp recomputation helper.
  - `refresh_saved_oven_study.py`: recompute keyframes, per-episode metrics, intervention stats, and summary JSONs from saved dense CSVs after metric-code changes.
  - `run_oven_single_frame.py`: single-frame recomputation helper.
  - `run_oven_frame_batch.py`: new sequential batch recomputation helper used to avoid late-frame drift.
  - `repair_oven_episode_dense.py`: batched repair pass for suspicious dense rows.
  - `render_oven_metric_frame.py`: per-frame visualization renderer.
  - `render_oven_metric_gifs.py`: GIF renderer.
  - The visualization renderer now accepts either legacy `templates.pkl` files or the newer authoritative `templates.json` bundles.
- `artifacts/results/`
  - Full debug history, including stale runs and current validation outputs.
- `runtime_assets/`
  - Archived runtime assets needed to recreate this setup on another machine.
  - Includes the local oven-task dataset snapshot and the local `coppelia_sim` extraction used on this machine.
- `environment/`
  - Machine snapshot, env export, pip freeze, setup helpers, and dataset notes.
- `external/`
  - Local source snapshots of RLBench, PyRep, PerAct bimanual, and YARR used for this work.
- `MANIFEST.txt`
  - Flat file listing of the upload contents.

## Latest Metric Fixes

The latest code changes are in:

- `code/rr_label_study/oven_study.py`
- `code/scripts/recompute_oven_pregrasp_parallel.py`
- `code/scripts/run_oven_pregrasp_batch.py`
- `code/scripts/repair_oven_episode_dense.py`
- `code/scripts/run_oven_frame_batch.py`
- `code/scripts/render_oven_metric_frame.py`

The important changes are:

### 1. Visibility metric repair

- `_load_mask()` now rescales stored mask PNGs back to `[0, 1]` before calling `rgb_handles_to_mask`.
- Visibility is now computed by projecting grasp-region or whole-tray points into each camera and checking whether the decoded mask handle at the projected pixel matches the inferred tray handles.
- Template derivation now infers `mask_handle_ids` from reference frames near the actual pregrasp/grasp window.

This fixes the old failure where visibility dropped to zero even when the tray lip was visibly present in the wrist camera.

### 2. Pregrasp/path metric repair

- Template extraction now detects the pregrasp approach onset in a bounded late window before grasp instead of taking the first small negative slope in the entire episode.
- The current template approach frames for episode 0 are now:
  - `177, 187, 197, 208, 218, 229, 232`
- `p_pre` now uses the last few approach templates plus explicit geometric progress toward the pregrasp pose instead of only brittle planner success.
- `y_pre` now treats "already inside the pregrasp corridor" as success, which is appropriate for this oracle study.
- The assisted pregrasp branch no longer treats oven-door collisions as a binary whitelist:
  - it traces per-step door contacts under `ignore_collisions=True`
  - estimates a local door-surface normal from the contacted simulator shape
  - rewards tangential or door-opening contact
  - penalizes head-on or door-closing contact
  - requires a minimum geometry-aware door-contact quality before assisted `p_pre` credit is given

### 3. Replay/repair correctness

- The old isolated repair path replayed every suspicious frame from a fresh reset, which could corrupt late rows.
- The new helper `run_oven_frame_batch.py` computes frame rows sequentially inside a single env per worker.
- `repair_oven_episode_dense.py` now distributes frame batches, not individual frames, across displays.
- `SimulatorSnapshot` now restores:
  - arm joint trees and explicit joint positions
  - gripper joint trees and explicit joint positions
  - the full subtree under any grasped object
  - grasp attachments with the original release parent
- `ReplayCache` now keeps retrying stable grasp attachment while the demo gripper remains closed.

This fixed the major replay bug where post-oracle restores could leave the arm, gripper, or grasped tray in a subtly different state than the true demo frame.

### 4. Earlier phase signal

- The code now records:
  - `pregrasp_progress`
  - `pregrasp_distance`
  - `pregrasp_speed`
  - `phase_score`
- `phase_score` is now dominated by actual approach progress and `p_pre`, with a stricter threshold (`0.5`) so it no longer flips during the early reveal phase.
- `y_retrieve` is still oracle-like and monotone, but the metric side now has a cleaner approach-sensitive signal for early switching.

### 5. Independent replay

- `run_oven_label_study.py` already exposed `--independent-replay`.
- `launch_parallel_oven_label_study.py` now passes that flag through to worker runs.
- For the current oven study, independent replay is the trustworthy dense mode because it avoids cross-frame contamination from oracle rollouts.

### 6. Intervention validity repair

- The old intervention summary reused the dense-study replay cache, which could still corrupt post-ready extract checks.
- `_interventional_validity()` now evaluates each sampled intervention state from a fresh env/replay instance.
- `refresh_saved_oven_study.py` now supports `--dataset-root` so intervention metrics can be recomputed instead of copied forward from stale JSON.
- The refined intervention protocol now samples pre-ready states at `ready_onset-20` and `ready_onset-10` instead of `ready_onset-10` and `ready_onset-5`, which avoids counting borderline almost-ready states as generic reveal-phase interventions.

### 7. Extraction-oracle hardening

- `_extract_score_and_success()` now uses repeated planner checks before marking a milestone as feasible.
- The current configuration is intentionally modest:
  - `DEFAULT_PLAN_ATTEMPTS = 2`
  - `DEFAULT_PLAN_MIN_SUCCESSES = 2`
- This only hardens the extraction oracle, not the pregrasp score, so the dense study remains tractable while the noisy pre-ready extract successes are suppressed.

## Latest Validated Artifacts

The current trustworthy artifacts are:

- `artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.dense.csv`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.keyframes.csv`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.metrics.json`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/summary.json`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_all_metrics.gif`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_visibility_focus.gif`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_path_quality_focus.gif`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.dense.csv`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.keyframes.csv`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.metrics.json`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/summary.json`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_all_metrics.gif`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_visibility_focus.gif`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_path_quality_focus.gif`

- `artifacts/results/oven_episode0_iter4_templates/templates.json`
- `artifacts/results/oven_episode0_iter4_templates/templates.pkl`
- `artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv`
- `artifacts/results/oven_episode0_iter4_batch/frames/`
- `artifacts/results/oven_episode0_iter4_clean/iter4_targeted_comparison.csv`
- `artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv`
- `artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.keyframes.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.json`
- `artifacts/results/oven_episode0_iter6_independent_full/summary.json`
- `artifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.png`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.json`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif`
- `artifacts/results/manual_metric_checks/episode0_frame210_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame232_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame210_path.png`
- `artifacts/results/manual_metric_checks/episode6_frame230_path.png`
- `artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json`
- `artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_workers.json`

The `iter6_independent_full` CSVs and JSON summaries have been refreshed with the latest `phase_score` logic via `code/scripts/refresh_saved_oven_study.py`.

## Key Verified Findings

From the current independent-replay validation on episode 0:

- Visibility over the dense 170-234 window is clean:
  - min `three_view_visibility = 1.0`
  - min `full_view_visibility = 1.0`
- Pregrasp progress now rises well before grasp and stays predictive:
  - frame `210`: `pregrasp_progress ≈ 0.451`, `p_pre ≈ 0.185`, `y_pre = 0`
  - frame `215`: `pregrasp_progress ≈ 0.568`, `p_pre ≈ 0.375`, `y_pre = 1`
  - frame `220`: `pregrasp_progress ≈ 0.702`, `p_pre ≈ 0.496`, `y_pre = 1`
  - frame `225`: `pregrasp_progress ≈ 0.847`, `p_pre ≈ 0.559`, `y_pre = 1`
  - frame `230`: `pregrasp_progress ≈ 0.950`, `p_pre ≈ 0.654`, `y_pre = 1`
- Extraction feasibility is now separated from pregrasp:
  - frame `230`: `p_ext ≈ 0.0007`, `y_ext = 0`
  - frame `232`: `p_ext = 1.0`, `y_ext = 1`
  - frame `234`: `p_ext = 1.0`, `y_ext = 1`
- In the refreshed full independent episode-0 run:
  - `ppre_cross_frame = 216`
  - `pext_cross_frame = 232`
  - `phase_cross_frame = 214`
  - `retrieve_cross_frame = 215`
  - `ready_cross_frame = 234`
  - `single_switch_rate = 1.0`
  - `reversion_rate = 0.0`
  - `auroc_ppre_ypre ≈ 0.761`
  - `auprc_ppre_ypre ≈ 0.903`
  - `auroc_pext_yext = 1.0`
  - `auprc_pext_yext = 1.0`
  - `auroc_phase_yretrieve = 1.0`
  - `auprc_phase_yretrieve = 1.0`
  - `f1_phase_yretrieve ≈ 0.996`
  - `auroc_phase_yready ≈ 0.998`
  - `f1_phase_yready ≈ 0.905`
- In the refreshed isolated intervention check on episode 0:
  - pre-ready `open_more` increases `p_ext` on `2/2` sampled states
  - pre-ready `extract` succeeds on `0/2`
  - post-ready `extract` succeeds on `2/2`
  - post-ready `open_more` and `hold_open` both have low marginal gain on `2/2`
- The refreshed phase columns now place:
  - `first phase_switch` at frame `214`
  - `first y_retrieve` at frame `215`
  - `first y_ready` at frame `234`
- The refined 8-episode independent-replay smoke in `artifacts/results/iter12_parallel_smoke_8ep_refined/` shows:
  - `single_switch_rate = 1.0`
  - `reversion_rate = 0.0`
  - mean `auroc_ppre_ypre ≈ 0.809`
  - mean `auprc_ppre_ypre ≈ 0.924`
  - mean `auroc_pext_yext = 1.0`
  - mean `auprc_pext_yext = 1.0`
  - mean `f1_phase_yretrieve ≈ 0.996`
  - mean `f1_phase_yready ≈ 0.906`
  - mean dense boundary error to `y_retrieve ≈ 0.88` frames
  - mean pre-ready extract success `= 0.0/2.0`
  - mean pre-ready wait extract success `= 0.0/2.0`
  - mean post-ready extract success `≈ 1.625/2.0`
- The main remaining limitation on this oven task is not a broken metric but task structure:
  - the grasp-region visibility metric is visually faithful but only weakly predictive because the tray lip is already visible early in many demos
  - time remains a very strong trivial baseline for `y_ext` on expert demos
  - `open_more` improves `p_ext` mainly near the reveal/retrieve boundary, not uniformly throughout the whole pre-ready window

See:

- `artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv`
- `artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv`
- `artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.json`
- `artifacts/results/manual_metric_checks/episode0_frame210_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame232_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame210_path.png`
- `artifacts/results/manual_metric_checks/episode6_frame230_path.png`
- `artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json`

## Artifact Guide

### Current artifacts

- `oven_episode0_iter3_templates/`
  - First regenerated template bundle after the mask/approach fixes.
- `oven_episode0_iter4_templates/`
  - Current template bundle with the corrected late-window approach onset.
- `oven_episode0_iter4_clean/`
  - Isolated targeted frame checks used while diagnosing the old per-frame repair drift.
- `oven_episode0_iter4_batch/`
  - Current batched sequential repair validation.
- `oven_episode0_iter4_dense_geom_170_234.csv`
  - Dense sequential geometry and visibility sweep across the reveal-to-retrieve boundary.

### Historical artifacts

- `oven_episode0_repaired_v1/`
  - Useful historical reference, but not the current authoritative artifact.
  - It still contains the old late transition and old visibility/path issues.
- `oven_episode0_full*/`, `oven_to240_*/`, `oven_episode0_independent_v*/`
  - Debugging history from earlier iterations.
- `parallel_smoke_2x10/`
  - Xvfb/worker parallelization smoke test.
- `oven_smoke_*`
  - Early smoke runs.

## Repository Map

Relevant entry points:

- `code/rr_label_study/oven_study.py`
- `code/scripts/run_oven_label_study.py`
- `code/scripts/launch_parallel_oven_label_study.py`
- `code/scripts/run_oven_single_frame.py`
- `code/scripts/run_oven_frame_batch.py`
- `code/scripts/repair_oven_episode_dense.py`
- `code/scripts/render_oven_metric_frame.py`
- `code/scripts/render_oven_metric_gifs.py`

Relevant current artifacts:

- `artifacts/results/oven_episode0_iter4_templates/templates.json`
- `artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv`
- `artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/summary.json`
- `artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png`
- `artifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.png`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.json`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif`

## Environment

This was run on:

- Ubuntu `22.04.5`
- Kernel `6.8.0-65-generic`
- `96` CPU cores visible
- `503 GiB` RAM visible
- `NVIDIA A40`

See:

- `environment/system_info.txt`
- `environment/repo_revisions.txt`
- `environment/conda_env_rlbench.yml`
- `environment/pip_freeze_rlbench.txt`

## Upstream Repos Used

Exact revisions are recorded in `environment/repo_revisions.txt`.

The local run used:

- `markusgrotz/RLBench`
- `markusgrotz/PyRep`
- `markusgrotz/peract_bimanual`
- `markusgrotz/YARR`

Those source snapshots are included under `external/`.

## Reproducing On The Same Hardware Class

1. Read `environment/dataset_notes.txt`.
2. Run `environment/setup_same_hardware.sh /workspace`.
3. Source `environment/activate_rlbench_runtime.sh /workspace`.
4. Run the dense study:

```bash
python /workspace/VLAdaptorBench_upload/code/scripts/run_oven_label_study.py \
  --dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
  --result-dir /workspace/tmp_run \
  --max-episodes 1 \
  --checkpoint-stride 16 \
  --template-episode-index 0 \
  --independent-replay
```

5. If you want to repair suspicious frames in parallel with the new batched path:

```bash
python /workspace/VLAdaptorBench_upload/code/scripts/repair_oven_episode_dense.py \
  --dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
  --episode-dir /workspace/data/bimanual_take_tray_out_of_oven_train_128/all_variations/episodes/episode0 \
  --input-dense-csv /workspace/tmp_run/episode0.dense.csv \
  --output-dir /workspace/tmp_run_repaired \
  --checkpoint-stride 16 \
  --num-workers 4 \
  --base-display 170
```

## Important Note

The full 100-episode independent-replay run is not yet the authoritative artifact in this upload. The current repository state documents the repaired metric code, the exact snapshot/restore fixes, and the episode-0 independent validation that is required before scaling to the full study.

## Dataset Note

The RLBench demonstration dataset itself is not re-uploaded here. This repository contains the study code and generated artifacts only. The expected dataset path is documented in `environment/dataset_notes.txt`.

CoppeliaSim binaries are not included. The setup helpers expect a local extraction at `/workspace/coppelia_sim`.