File size: 24,155 Bytes
408c79e
 
 
 
 
 
 
 
150d02a
 
712dc89
150d02a
712dc89
150d02a
712dc89
 
b427a51
86d9504
b427a51
86d9504
b427a51
 
 
86d9504
 
 
 
 
 
 
 
 
ba3985e
 
 
 
 
 
 
 
7f173cd
 
712dc89
 
 
ae83e33
 
 
712dc89
 
 
 
 
 
 
 
 
ae83e33
 
 
 
0bcd290
 
 
712dc89
cc43250
7f173cd
cc43250
 
ba3985e
7f173cd
cc43250
7f173cd
cc43250
 
 
7f173cd
cc43250
712dc89
df456b3
 
 
8d1e257
712dc89
150d02a
 
712dc89
 
150d02a
712dc89
 
150d02a
7f173cd
 
0bcd290
712dc89
 
 
 
 
561f6a2
150d02a
712dc89
52a944d
 
 
150d02a
712dc89
 
 
150d02a
712dc89
 
 
 
 
 
 
7f173cd
 
712dc89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc43250
 
 
 
 
 
712dc89
 
 
 
 
 
ae83e33
 
 
 
 
 
712dc89
ae83e33
712dc89
 
 
 
 
 
 
 
ae83e33
 
 
 
 
 
 
 
712dc89
0bcd290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
712dc89
 
 
 
cc43250
 
 
 
 
 
 
 
 
 
 
 
 
 
7f173cd
712dc89
 
 
 
 
 
ae83e33
 
 
 
 
20d0366
561f6a2
 
 
 
 
0bcd290
 
 
 
 
 
712dc89
20d0366
712dc89
 
 
ae83e33
712dc89
 
 
 
ae83e33
 
 
 
 
 
 
 
 
 
20d0366
ae83e33
 
20d0366
ae83e33
 
 
 
 
 
 
 
20d0366
 
 
 
 
0bcd290
 
 
 
 
20d0366
 
 
 
0bcd290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
712dc89
 
 
 
 
ae83e33
 
 
0bcd290
 
 
 
 
712dc89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae83e33
20d0366
ae83e33
20d0366
561f6a2
 
 
 
 
150d02a
 
 
712dc89
150d02a
 
 
 
 
712dc89
150d02a
 
 
 
 
 
 
 
712dc89
150d02a
712dc89
150d02a
 
 
712dc89
 
 
 
150d02a
712dc89
d72206d
712dc89
150d02a
 
 
712dc89
150d02a
 
 
 
 
 
 
 
ae83e33
 
150d02a
 
712dc89
150d02a
 
 
 
 
 
 
 
 
712dc89
150d02a
 
712dc89
150d02a
ae83e33
150d02a
712dc89
d72206d
712dc89
d72206d
712dc89
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
---
tags:
  - robotics
  - rlbench
  - benchmarking
  - label-validation
---

# VLAdaptorBench

This repository contains the benchmark setup, metric code, debug history, and validation artifacts for the proposed VLA + adaptor label study on `bimanual_take_tray_out_of_oven`.

This is still a label-validation repository, not a policy repository. No `pi0.5` integration is included here.

## Current Status

The latest work behind this upload produced:

- `metric_iter30_full100_single_pass_full_logging_fixed_templates_merged`
  - merged 100-episode dense/fuller-logging result tree from the single-pass fixed-template run

The current Hub upload includes:

- `artifacts/results/metric_iter31_sample10_all_metrics_verify/`
  - compact 10-episode verification subset with `all_metrics` GIFs only
- the fast `all_metrics`-only render path in:
  - `code/scripts/render_oven_metric_frame.py`
  - `code/scripts/render_oven_metric_gifs.py`

The new sample verification bundle is meant to be the quickest remote sanity-check entry point. It includes the sampled dense/keyframe tables, per-episode metrics, fuller debug sidecars, fixed templates, selection metadata, and one compact full-metrics GIF per sampled episode.

The earlier `metric_iter29_ep0_single_pass_full_logging_fixed_templates` validation pass for episode 0 remains the detailed single-episode reference for the fuller debug logging and the debug-aware GIF renderer.

That run keeps the trusted `iter24` template bundle fixed, adds the fuller dense/debug logging in a single pass, and regenerates the episode-0 visualization suite from the richer artifact. It is the current reference for:

- the `episode0.debug.jsonl` sidecar with per-frame `p_pre` and `p_ext` internals
- the single-pass dense CSV with fuller logged sub-metrics
- the updated `path_quality_focus` GIF that now exposes the `p_ext` milestone search, milestone scores, and planner outcomes directly in the visualization

The earlier `metric_iter24_*_door_contact_geom` reruns for episodes 0 and 1 remain the trusted baseline for the repaired oven metrics.

That rerun fixes the main simulator-state bugs that were still contaminating the oven metrics:

1. The reveal-to-retrieve transition used to occur too late, effectively at grasp time.
2. The visibility metric used to drop to zero around frame 232 even when the tray grasp region was clearly visible in `wrist_left`.
3. `p_pre` stayed near zero until grasp.
4. Extraction labels could flicker or drift because oracle rollouts were not restoring the simulator state exactly.
5. The old dense runner's restore-heavy path could still bias later frames after an oracle call.

The current code addresses those issues by:

- decoding RLBench mask PNGs correctly before converting them back to simulator handles
- scoring visibility directly from mask-handle agreement instead of the old depth/z heuristic
- inferring tray mask handles from grasp-region projections
- deriving a late-window pregrasp approach template instead of accidentally including frame-8 arm poses
- adding explicit `pregrasp_progress`, `pregrasp_distance`, `pregrasp_speed`, and `phase_score`
- making the repair path batch frames sequentially per worker so late-frame rows do not drift
- snapshotting and restoring exact arm joints, gripper joints, and the full grasped-object subtree
- supporting and now preferring `--independent-replay` for the authoritative dense study
- tightening `y_pre` so it stays on once the retriever is clearly inside the pregrasp corridor
- retuning `phase_score` so it tracks the reveal-to-retrieve handoff instead of generic early motion
- recomputing intervention validity from isolated per-frame env replays instead of the old live-cache path
- sampling intervention states earlier in the reveal phase so pre-ready extract checks are not contaminated by borderline near-ready states
- confirming extraction feasibility with repeated planner checks inside the extract oracle so one lucky planner sample is less likely to flip a label

The old `iter4_*`, `iter6_*`, `iter19_*`, and `iter22_*` outputs are still useful historical checkpoints, but the current authoritative outputs are:

- `artifacts/results/metric_iter24_ep0_door_contact_geom/`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/`
- `artifacts/results/metric_iter29_ep0_single_pass_full_logging_fixed_templates/`

The main new fix in `iter24` is the assisted-door contact scoring inside `p_pre`:

- the old `ignore_collisions=True` branch treated oven-door contact as name-whitelisted and only checked the final door angle change
- the new scorer traces door contacts step-by-step, estimates the local door-surface normal from simulator geometry, scores whether the retriever is sliding along the door or pushing it open, and penalizes direct head-on contact or door-closing motion
- this specifically removes the false closed-door `p_pre` spike in episode 0 around frames `43-56` without collapsing the later pregrasp rise once the door is actually opening

The current repo state should therefore be treated as the repaired benchmark snapshot with geometry-aware door assistance, not the final metric design.

Brief caveat: the current `y_ready` label still gates on low oven-door angular speed after extraction feasibility persists. In this task, the retriever arm can legitimately nudge the door while already committing to retrieval, so `y_ready` can still switch later than the true reveal-to-retrieve boundary. For the current oven benchmark, `y_ready` should therefore not be treated as a decisive validation metric or a trusted phase-switch target.

The oven task also has a highly structured reveal-to-retrieve handoff in the expert demos: both arms reposition, the revealer opens and clears the door, then the retriever commits. Because that phase pattern is so standardized, good results on this task are most useful as a task-specific smoke test or a "does the adaptor beat a base finetune here?" check, not as strong evidence of general reveal-and-retrieve reasoning.

## What Is In This Upload

- `code/rr_label_study/`
  - Core metric code.
  - Dense replay, visibility scoring, pregrasp/extraction oracles, keyframe extraction, intervention checks, and summary metric computation.
- `code/scripts/`
  - Study runners and helpers.
  - `run_oven_label_study.py`: dense/keyframe study runner.
  - `launch_parallel_oven_label_study.py`: multi-display worker launcher.
  - `recompute_oven_pregrasp_parallel.py`: targeted dense rerun for repaired `p_pre` labels.
  - `run_oven_pregrasp_batch.py`: sequential per-worker pregrasp recomputation helper.
  - `refresh_saved_oven_study.py`: recompute keyframes, per-episode metrics, intervention stats, and summary JSONs from saved dense CSVs after metric-code changes.
  - `run_oven_single_frame.py`: single-frame recomputation helper.
  - `run_oven_frame_batch.py`: new sequential batch recomputation helper used to avoid late-frame drift.
  - `repair_oven_episode_dense.py`: batched repair pass for suspicious dense rows.
  - `render_oven_metric_frame.py`: per-frame visualization renderer.
  - `render_oven_metric_gifs.py`: GIF renderer.
  - The visualization renderer now accepts either legacy `templates.pkl` files or the newer authoritative `templates.json` bundles.
- `artifacts/results/`
  - Full debug history, including stale runs and current validation outputs.
- `runtime_assets/`
  - Archived runtime assets needed to recreate this setup on another machine.
  - Includes the local oven-task dataset snapshot and the local `coppelia_sim` extraction used on this machine.
- `environment/`
  - Machine snapshot, env export, pip freeze, setup helpers, and dataset notes.
- `external/`
  - Local source snapshots of RLBench, PyRep, PerAct bimanual, and YARR used for this work.
- `MANIFEST.txt`
  - Flat file listing of the upload contents.

## Latest Metric Fixes

The latest code changes are in:

- `code/rr_label_study/oven_study.py`
- `code/scripts/recompute_oven_pregrasp_parallel.py`
- `code/scripts/run_oven_pregrasp_batch.py`
- `code/scripts/repair_oven_episode_dense.py`
- `code/scripts/run_oven_frame_batch.py`
- `code/scripts/render_oven_metric_frame.py`

The important changes are:

### 1. Visibility metric repair

- `_load_mask()` now rescales stored mask PNGs back to `[0, 1]` before calling `rgb_handles_to_mask`.
- Visibility is now computed by projecting grasp-region or whole-tray points into each camera and checking whether the decoded mask handle at the projected pixel matches the inferred tray handles.
- Template derivation now infers `mask_handle_ids` from reference frames near the actual pregrasp/grasp window.

This fixes the old failure where visibility dropped to zero even when the tray lip was visibly present in the wrist camera.

### 2. Pregrasp/path metric repair

- Template extraction now detects the pregrasp approach onset in a bounded late window before grasp instead of taking the first small negative slope in the entire episode.
- The current template approach frames for episode 0 are now:
  - `177, 187, 197, 208, 218, 229, 232`
- `p_pre` now uses the last few approach templates plus explicit geometric progress toward the pregrasp pose instead of only brittle planner success.
- `y_pre` now treats "already inside the pregrasp corridor" as success, which is appropriate for this oracle study.
- The assisted pregrasp branch no longer treats oven-door collisions as a binary whitelist:
  - it traces per-step door contacts under `ignore_collisions=True`
  - estimates a local door-surface normal from the contacted simulator shape
  - rewards tangential or door-opening contact
  - penalizes head-on or door-closing contact
  - requires a minimum geometry-aware door-contact quality before assisted `p_pre` credit is given

### 3. Replay/repair correctness

- The old isolated repair path replayed every suspicious frame from a fresh reset, which could corrupt late rows.
- The new helper `run_oven_frame_batch.py` computes frame rows sequentially inside a single env per worker.
- `repair_oven_episode_dense.py` now distributes frame batches, not individual frames, across displays.
- `SimulatorSnapshot` now restores:
  - arm joint trees and explicit joint positions
  - gripper joint trees and explicit joint positions
  - the full subtree under any grasped object
  - grasp attachments with the original release parent
- `ReplayCache` now keeps retrying stable grasp attachment while the demo gripper remains closed.

This fixed the major replay bug where post-oracle restores could leave the arm, gripper, or grasped tray in a subtly different state than the true demo frame.

### 4. Earlier phase signal

- The code now records:
  - `pregrasp_progress`
  - `pregrasp_distance`
  - `pregrasp_speed`
  - `phase_score`
- `phase_score` is now dominated by actual approach progress and `p_pre`, with a stricter threshold (`0.5`) so it no longer flips during the early reveal phase.
- `y_retrieve` is still oracle-like and monotone, but the metric side now has a cleaner approach-sensitive signal for early switching.

### 5. Independent replay

- `run_oven_label_study.py` already exposed `--independent-replay`.
- `launch_parallel_oven_label_study.py` now passes that flag through to worker runs.
- For the current oven study, independent replay is the trustworthy dense mode because it avoids cross-frame contamination from oracle rollouts.

### 6. Intervention validity repair

- The old intervention summary reused the dense-study replay cache, which could still corrupt post-ready extract checks.
- `_interventional_validity()` now evaluates each sampled intervention state from a fresh env/replay instance.
- `refresh_saved_oven_study.py` now supports `--dataset-root` so intervention metrics can be recomputed instead of copied forward from stale JSON.
- The refined intervention protocol now samples pre-ready states at `ready_onset-20` and `ready_onset-10` instead of `ready_onset-10` and `ready_onset-5`, which avoids counting borderline almost-ready states as generic reveal-phase interventions.

### 7. Extraction-oracle hardening

- `_extract_score_and_success()` now uses repeated planner checks before marking a milestone as feasible.
- The current configuration is intentionally modest:
  - `DEFAULT_PLAN_ATTEMPTS = 2`
  - `DEFAULT_PLAN_MIN_SUCCESSES = 2`
- This only hardens the extraction oracle, not the pregrasp score, so the dense study remains tractable while the noisy pre-ready extract successes are suppressed.

## Latest Validated Artifacts

The current trustworthy artifacts are:

- `artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.dense.csv`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.keyframes.csv`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/episode0.metrics.json`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/summary.json`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_all_metrics.gif`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_visibility_focus.gif`
- `artifacts/results/metric_iter24_ep0_door_contact_geom/visualizations/episode0_path_quality_focus.gif`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.dense.csv`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.keyframes.csv`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/episode1.metrics.json`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/summary.json`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_all_metrics.gif`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_visibility_focus.gif`
- `artifacts/results/metric_iter24_ep1_door_contact_geom/visualizations/episode1_path_quality_focus.gif`

- `artifacts/results/oven_episode0_iter4_templates/templates.json`
- `artifacts/results/oven_episode0_iter4_templates/templates.pkl`
- `artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv`
- `artifacts/results/oven_episode0_iter4_batch/frames/`
- `artifacts/results/oven_episode0_iter4_clean/iter4_targeted_comparison.csv`
- `artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv`
- `artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.keyframes.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.json`
- `artifacts/results/oven_episode0_iter6_independent_full/summary.json`
- `artifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.png`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.json`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif`
- `artifacts/results/manual_metric_checks/episode0_frame210_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame232_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame210_path.png`
- `artifacts/results/manual_metric_checks/episode6_frame230_path.png`
- `artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json`
- `artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_workers.json`

The `iter6_independent_full` CSVs and JSON summaries have been refreshed with the latest `phase_score` logic via `code/scripts/refresh_saved_oven_study.py`.

## Key Verified Findings

From the current independent-replay validation on episode 0:

- Visibility over the dense 170-234 window is clean:
  - min `three_view_visibility = 1.0`
  - min `full_view_visibility = 1.0`
- Pregrasp progress now rises well before grasp and stays predictive:
  - frame `210`: `pregrasp_progress β‰ˆ 0.451`, `p_pre β‰ˆ 0.185`, `y_pre = 0`
  - frame `215`: `pregrasp_progress β‰ˆ 0.568`, `p_pre β‰ˆ 0.375`, `y_pre = 1`
  - frame `220`: `pregrasp_progress β‰ˆ 0.702`, `p_pre β‰ˆ 0.496`, `y_pre = 1`
  - frame `225`: `pregrasp_progress β‰ˆ 0.847`, `p_pre β‰ˆ 0.559`, `y_pre = 1`
  - frame `230`: `pregrasp_progress β‰ˆ 0.950`, `p_pre β‰ˆ 0.654`, `y_pre = 1`
- Extraction feasibility is now separated from pregrasp:
  - frame `230`: `p_ext β‰ˆ 0.0007`, `y_ext = 0`
  - frame `232`: `p_ext = 1.0`, `y_ext = 1`
  - frame `234`: `p_ext = 1.0`, `y_ext = 1`
- In the refreshed full independent episode-0 run:
  - `ppre_cross_frame = 216`
  - `pext_cross_frame = 232`
  - `phase_cross_frame = 214`
  - `retrieve_cross_frame = 215`
  - `ready_cross_frame = 234`
  - `single_switch_rate = 1.0`
  - `reversion_rate = 0.0`
  - `auroc_ppre_ypre β‰ˆ 0.761`
  - `auprc_ppre_ypre β‰ˆ 0.903`
  - `auroc_pext_yext = 1.0`
  - `auprc_pext_yext = 1.0`
  - `auroc_phase_yretrieve = 1.0`
  - `auprc_phase_yretrieve = 1.0`
  - `f1_phase_yretrieve β‰ˆ 0.996`
  - `auroc_phase_yready β‰ˆ 0.998`
  - `f1_phase_yready β‰ˆ 0.905`
- In the refreshed isolated intervention check on episode 0:
  - pre-ready `open_more` increases `p_ext` on `2/2` sampled states
  - pre-ready `extract` succeeds on `0/2`
  - post-ready `extract` succeeds on `2/2`
  - post-ready `open_more` and `hold_open` both have low marginal gain on `2/2`
- The refreshed phase columns now place:
  - `first phase_switch` at frame `214`
  - `first y_retrieve` at frame `215`
  - `first y_ready` at frame `234`
- The refined 8-episode independent-replay smoke in `artifacts/results/iter12_parallel_smoke_8ep_refined/` shows:
  - `single_switch_rate = 1.0`
  - `reversion_rate = 0.0`
  - mean `auroc_ppre_ypre β‰ˆ 0.809`
  - mean `auprc_ppre_ypre β‰ˆ 0.924`
  - mean `auroc_pext_yext = 1.0`
  - mean `auprc_pext_yext = 1.0`
  - mean `f1_phase_yretrieve β‰ˆ 0.996`
  - mean `f1_phase_yready β‰ˆ 0.906`
  - mean dense boundary error to `y_retrieve β‰ˆ 0.88` frames
  - mean pre-ready extract success `= 0.0/2.0`
  - mean pre-ready wait extract success `= 0.0/2.0`
  - mean post-ready extract success `β‰ˆ 1.625/2.0`
- The main remaining limitation on this oven task is not a broken metric but task structure:
  - the grasp-region visibility metric is visually faithful but only weakly predictive because the tray lip is already visible early in many demos
  - time remains a very strong trivial baseline for `y_ext` on expert demos
  - `open_more` improves `p_ext` mainly near the reveal/retrieve boundary, not uniformly throughout the whole pre-ready window

See:

- `artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv`
- `artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv`
- `artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.metrics.json`
- `artifacts/results/manual_metric_checks/episode0_frame210_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame232_visibility.png`
- `artifacts/results/manual_metric_checks/episode0_frame210_path.png`
- `artifacts/results/manual_metric_checks/episode6_frame230_path.png`
- `artifacts/results/iter12_parallel_smoke_8ep_refined/parallel_summary.json`

## Artifact Guide

### Current artifacts

- `oven_episode0_iter3_templates/`
  - First regenerated template bundle after the mask/approach fixes.
- `oven_episode0_iter4_templates/`
  - Current template bundle with the corrected late-window approach onset.
- `oven_episode0_iter4_clean/`
  - Isolated targeted frame checks used while diagnosing the old per-frame repair drift.
- `oven_episode0_iter4_batch/`
  - Current batched sequential repair validation.
- `oven_episode0_iter4_dense_geom_170_234.csv`
  - Dense sequential geometry and visibility sweep across the reveal-to-retrieve boundary.

### Historical artifacts

- `oven_episode0_repaired_v1/`
  - Useful historical reference, but not the current authoritative artifact.
  - It still contains the old late transition and old visibility/path issues.
- `oven_episode0_full*/`, `oven_to240_*/`, `oven_episode0_independent_v*/`
  - Debugging history from earlier iterations.
- `parallel_smoke_2x10/`
  - Xvfb/worker parallelization smoke test.
- `oven_smoke_*`
  - Early smoke runs.

## Repository Map

Relevant entry points:

- `code/rr_label_study/oven_study.py`
- `code/scripts/run_oven_label_study.py`
- `code/scripts/launch_parallel_oven_label_study.py`
- `code/scripts/run_oven_single_frame.py`
- `code/scripts/run_oven_frame_batch.py`
- `code/scripts/repair_oven_episode_dense.py`
- `code/scripts/render_oven_metric_frame.py`
- `code/scripts/render_oven_metric_gifs.py`

Relevant current artifacts:

- `artifacts/results/oven_episode0_iter4_templates/templates.json`
- `artifacts/results/oven_episode0_iter4_batch/iter4_batch_comparison.csv`
- `artifacts/results/oven_episode0_iter4_dense_geom_170_234.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter6_independent_full/summary.json`
- `artifacts/results/oven_episode0_iter6_visual_checks/boundary_rgb_contact_sheet.png`
- `artifacts/results/oven_episode0_iter6_visual_checks/early_visibility_contact_sheet.png`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.dense.csv`
- `artifacts/results/oven_episode0_iter16_gif_suite/episode0.metrics.json`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_all_metrics.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_visibility_focus.gif`
- `artifacts/results/oven_episode0_iter16_gif_suite/visualizations/episode0_path_quality_focus.gif`

## Environment

This was run on:

- Ubuntu `22.04.5`
- Kernel `6.8.0-65-generic`
- `96` CPU cores visible
- `503 GiB` RAM visible
- `NVIDIA A40`

See:

- `environment/system_info.txt`
- `environment/repo_revisions.txt`
- `environment/conda_env_rlbench.yml`
- `environment/pip_freeze_rlbench.txt`

## Upstream Repos Used

Exact revisions are recorded in `environment/repo_revisions.txt`.

The local run used:

- `markusgrotz/RLBench`
- `markusgrotz/PyRep`
- `markusgrotz/peract_bimanual`
- `markusgrotz/YARR`

Those source snapshots are included under `external/`.

## Reproducing On The Same Hardware Class

1. Read `environment/dataset_notes.txt`.
2. Run `environment/setup_same_hardware.sh /workspace`.
3. Source `environment/activate_rlbench_runtime.sh /workspace`.
4. Run the dense study:

```bash
python /workspace/VLAdaptorBench_upload/code/scripts/run_oven_label_study.py \
  --dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
  --result-dir /workspace/tmp_run \
  --max-episodes 1 \
  --checkpoint-stride 16 \
  --template-episode-index 0 \
  --independent-replay
```

5. If you want to repair suspicious frames in parallel with the new batched path:

```bash
python /workspace/VLAdaptorBench_upload/code/scripts/repair_oven_episode_dense.py \
  --dataset-root /workspace/data/bimanual_take_tray_out_of_oven_train_128 \
  --episode-dir /workspace/data/bimanual_take_tray_out_of_oven_train_128/all_variations/episodes/episode0 \
  --input-dense-csv /workspace/tmp_run/episode0.dense.csv \
  --output-dir /workspace/tmp_run_repaired \
  --checkpoint-stride 16 \
  --num-workers 4 \
  --base-display 170
```

## Important Note

The full 100-episode independent-replay run is not yet the authoritative artifact in this upload. The current repository state documents the repaired metric code, the exact snapshot/restore fixes, and the episode-0 independent validation that is required before scaling to the full study.

## Dataset Note

The RLBench demonstration dataset itself is not re-uploaded here. This repository contains the study code and generated artifacts only. The expected dataset path is documented in `environment/dataset_notes.txt`.

CoppeliaSim binaries are not included. The setup helpers expect a local extraction at `/workspace/coppelia_sim`.