lsnu commited on
Commit
de2fd70
·
verified ·
1 Parent(s): e7d8e79

Add files using upload-large-folder tool

Browse files
MODEL_INDEX.md CHANGED
@@ -1,5 +1,36 @@
1
  # Model Index
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ## R3D Proxy Runs
4
 
5
  | Run | Config | Seed | Checkpoint | Summary | Benchmark | Diagnostics |
 
1
  # Model Index
2
 
3
+ ## 2026-03-25/26 Additions
4
+
5
+ ### Handoff Proxy Checkpoints
6
+
7
+ | Run | Checkpoint | Summary | Report |
8
+ | --- | --- | --- | --- |
9
+ | spatial handoff | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_seed17/summary.json` | `artifacts/reports/reveal_handoff_compare_serious/reveal_benchmark.json` |
10
+ | compact handoff | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_seed17/summary.json` | `artifacts/reports/reveal_handoff_compact_train_probe/reveal_benchmark.json` |
11
+ | compact-phase handoff | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_seed17/summary.json` | `artifacts/reports/reveal_phase_compare_serious_compact/reveal_benchmark.json` |
12
+ | spatial-phase handoff | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_phase_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_phase_seed17/summary.json` | `artifacts/reports/reveal_phase_compare_serious_spatial_compactwm/reveal_benchmark.json` |
13
+
14
+ ### RLBench Current Checkpoints
15
+
16
+ | Run | Checkpoint | Related files |
17
+ | --- | --- | --- |
18
+ | subset3 valid9 | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_valid9/checkpoint_best.pt` | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_valid9/checkpoint_stable.pt` |
19
+ | subset3 common23 | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_common23/checkpoint_best.pt` | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_common23/checkpoint_stable.pt` |
20
+ | lift-ball wide | `artifacts/outputs/rlbench_current/rlbench_lift_ball_backbone_only_clip_current_wide/checkpoint_best.pt` | `artifacts/outputs/rlbench_current/rlbench_lift_ball_backbone_only_clip_current_wide/checkpoint_stable.pt` |
21
+ | push-box step1 | `artifacts/outputs/rlbench_current/rlbench_push_box_backbone_only_clip_step1/checkpoint_best.pt` | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1/rollout_eval.json`, `artifacts/reports/rlbench_push_box_knn_step1_ep5_top1_dense/rollout_eval.json` |
22
+
23
+ ### RLBench Result Files
24
+
25
+ | Artifact | File |
26
+ | --- | --- |
27
+ | lift-ball wide, one-step replanning | `artifacts/reports/rlbench_lift_ball_wide_len160_ep1_ik_c1/rollout_eval.json` |
28
+ | push-box step1, one-step replanning | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1/rollout_eval.json` |
29
+ | push-box step1, one-step replanning, `delta_scale=0.05` | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1_s005/rollout_eval.json` |
30
+ | push-box kNN, `episodes=1` | `artifacts/reports/rlbench_push_box_knn_step1_ep1/rollout_eval.json` |
31
+ | push-box kNN, `episodes=5`, `top_k=5` | `artifacts/reports/rlbench_push_box_knn_step1_ep5/rollout_eval.json` |
32
+ | push-box kNN, `episodes=5`, `top_k=1`, dense bank | `artifacts/reports/rlbench_push_box_knn_step1_ep5_top1_dense/rollout_eval.json` |
33
+
34
  ## R3D Proxy Runs
35
 
36
  | Run | Config | Seed | Checkpoint | Summary | Benchmark | Diagnostics |
README.md CHANGED
@@ -9,99 +9,164 @@ tags:
9
 
10
  # VLAarchtests
11
 
12
- Update uploaded from the `/workspace` runpod session dated `2026-03-25 UTC`.
13
 
14
- ## Updated Paths
15
 
16
  - `code/reveal_vla_bimanual/`
17
- - `tests/`
18
- - `environment/`
19
  - `artifacts/data/reveal_proxy/`
 
 
 
20
  - `artifacts/outputs/r3d_handoff/`
 
21
  - `artifacts/outputs/r3d_handoff_phase/`
22
- - `results/2026-03-25-runpod/`
23
-
24
- ## Primary Source Changes
25
-
26
- - Geometry path and camera-pose propagation updates:
27
- - `code/reveal_vla_bimanual/models/backbones.py`
28
- - `code/reveal_vla_bimanual/models/multiview_fusion.py`
29
- - `code/reveal_vla_bimanual/models/policy.py`
30
- - Spatial memory and world-model updates:
31
- - `code/reveal_vla_bimanual/models/observation_memory.py`
32
- - `code/reveal_vla_bimanual/models/reveal_head.py`
33
- - `code/reveal_vla_bimanual/models/world_model.py`
34
- - Semantic candidate and planner updates:
35
- - `code/reveal_vla_bimanual/models/action_decoder.py`
36
- - `code/reveal_vla_bimanual/models/planner.py`
37
- - Loss, dataset, and simulator updates:
38
- - `code/reveal_vla_bimanual/train/losses.py`
39
- - `code/reveal_vla_bimanual/sim_reveal/dataset.py`
40
- - `code/reveal_vla_bimanual/sim_reveal/procedural_envs.py`
41
- - Evaluation and RLBench tooling updates:
42
- - `code/reveal_vla_bimanual/eval/run_reveal_benchmark.py`
43
- - `code/reveal_vla_bimanual/eval/run_teacher_audit.py`
44
- - `code/reveal_vla_bimanual/eval/run_rlbench_rollout_eval.py`
45
- - `code/reveal_vla_bimanual/eval/run_peract2_task_sweep.py`
46
- - `code/reveal_vla_bimanual/eval/compare_rlbench_sweeps.py`
47
- - `code/reveal_vla_bimanual/scripts/run_rlbench_handoff_eval.sh`
48
-
49
- ## Validation
50
-
51
- - Test command:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  - `PYTHONPATH=/workspace/VLAarchtests_work/code/reveal_vla_bimanual python -m pytest -q /workspace/VLAarchtests_work/tests`
53
- - Result:
54
  - `33 passed`
55
 
56
- ## Generated Datasets
 
 
57
 
58
- - `artifacts/data/reveal_proxy/proxy_train_clip224_v6_rgbd_stage3_phase.pt`
59
- - `artifacts/data/reveal_proxy/proxy_val_clip224_v6_rgbd_stage3_phase.pt`
 
 
 
 
 
 
 
60
 
61
- ## Generated Checkpoints
62
 
63
- - `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_seed17/`
64
- - `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_seed17/`
65
- - `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_seed17/`
66
- - `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_phase_seed17/`
 
 
 
 
 
67
 
68
- ## Raw Result Summary
69
 
70
- ### Proxy Serious Comparisons
71
 
72
- | File | Reference mean success | Compared mean success |
73
- | --- | ---: | ---: |
74
- | `results/2026-03-25-runpod/reports/reveal_handoff_compare_serious/reveal_benchmark.json` | 0.583333 | 0.216667 |
75
- | `results/2026-03-25-runpod/reports/reveal_handoff_compare_serious_compact/reveal_benchmark.json` | 0.583333 | 0.520000 |
76
- | `results/2026-03-25-runpod/reports/reveal_phase_compare_serious_compact/reveal_benchmark.json` | 0.583333 | 0.513333 |
77
- | `results/2026-03-25-runpod/reports/reveal_phase_compare_serious_spatial_compactwm/reveal_benchmark.json` | 0.583333 | 0.493333 |
78
 
79
- ### Proxy Ablations
80
 
81
- - Full ablation matrix:
82
- - `results/2026-03-25-runpod/reports/reveal_phase_ablations_compact/ablations.json`
83
- - Teacher audit:
84
- - `results/2026-03-25-runpod/reports/reveal_teacher_audit_serious/teacher_audit.json`
85
 
86
- ### RLBench
87
 
88
- | File | Mean success |
89
- | --- | ---: |
90
- | `results/2026-03-25-runpod/reports/peract2_baseline_ep1/baseline_rgbd_seed17_plan_split/rollout_eval.json` | 0.000000 |
91
- | `results/2026-03-25-runpod/reports/peract2_spatial_full_ep1/spatial_phase_seed17_noplan_split/rollout_eval.json` | 0.000000 |
92
- | `results/2026-03-25-runpod/reports/peract2_spatial_full_ep1/spatial_phase_seed17_plan_split/rollout_eval.json` | 0.000000 |
93
 
94
- ## Detailed Raw Index
95
 
96
- - `results/2026-03-25-runpod/README.md`
 
 
 
 
97
 
98
- ## Environment Recreation
99
 
100
- - `environment/README.md`
101
  - `environment/setup_same_machine.sh`
102
  - `environment/validate_same_machine.sh`
 
103
  - `environment/runtime_env_vars.sh`
 
 
104
  - `environment/upstream_revisions.txt`
 
105
  - `environment/rlbench_env_export.yaml`
106
  - `environment/rlbench_env_explicit.txt`
107
  - `environment/rlbench_pip_freeze.txt`
 
 
 
 
 
 
9
 
10
  # VLAarchtests
11
 
12
+ Bundle uploaded from `/workspace` runpod sessions dated `2026-03-25 UTC` and `2026-03-26 UTC`.
13
 
14
+ ## Top-Level Contents
15
 
16
  - `code/reveal_vla_bimanual/`
17
+ - project code used for the proxy and RLBench runs in this bundle
 
18
  - `artifacts/data/reveal_proxy/`
19
+ - proxy dataset bundles used by the handoff runs
20
+ - `artifacts/outputs/r3d/`
21
+ - previously uploaded R3D proxy outputs already present in the bundle
22
  - `artifacts/outputs/r3d_handoff/`
23
+ - handoff proxy checkpoints
24
  - `artifacts/outputs/r3d_handoff_phase/`
25
+ - phase-supervised handoff proxy checkpoints
26
+ - `artifacts/outputs/rlbench_current/`
27
+ - RLBench checkpoints from the current session
28
+ - `artifacts/reports/`
29
+ - proxy and RLBench result files copied from `/workspace/reports`
30
+ - `environment/`
31
+ - same-machine setup files and validation helpers
32
+ - `tests/`
33
+ - local test suite
34
+ - `handoff/instructions.md`
35
+ - instruction file used for the handoff work
36
+ - `MODEL_INDEX.md`
37
+ - checkpoint and result index
38
+ - `results/session_results_20260326.md`
39
+ - raw result tables for the `2026-03-25/26` work
40
+
41
+ ## Code Added Or Updated
42
+
43
+ ### Core model, memory, planner, and dataset paths
44
+
45
+ - `code/reveal_vla_bimanual/models/backbones.py`
46
+ - `code/reveal_vla_bimanual/models/multiview_fusion.py`
47
+ - `code/reveal_vla_bimanual/models/observation_memory.py`
48
+ - `code/reveal_vla_bimanual/models/reveal_head.py`
49
+ - `code/reveal_vla_bimanual/models/world_model.py`
50
+ - `code/reveal_vla_bimanual/models/action_decoder.py`
51
+ - `code/reveal_vla_bimanual/models/planner.py`
52
+ - `code/reveal_vla_bimanual/models/policy.py`
53
+ - `code/reveal_vla_bimanual/train/losses.py`
54
+ - `code/reveal_vla_bimanual/sim_reveal/dataset.py`
55
+ - `code/reveal_vla_bimanual/sim_reveal/procedural_envs.py`
56
+ - `code/reveal_vla_bimanual/sim_rlbench/dataset.py`
57
+
58
+ ### Training and evaluation paths
59
+
60
+ - `code/reveal_vla_bimanual/train/run_rlbench_experiment.py`
61
+ - `code/reveal_vla_bimanual/eval/run_reveal_benchmark.py`
62
+ - `code/reveal_vla_bimanual/eval/run_ablations.py`
63
+ - `code/reveal_vla_bimanual/eval/run_teacher_audit.py`
64
+ - `code/reveal_vla_bimanual/eval/run_rlbench_rollout_eval.py`
65
+ - `code/reveal_vla_bimanual/eval/run_rlbench_knn_eval.py`
66
+
67
+ ### Added or updated training configs
68
+
69
+ - `code/reveal_vla_bimanual/train/configs/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact.yaml`
70
+ - `code/reveal_vla_bimanual/train/configs/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial.yaml`
71
+ - `code/reveal_vla_bimanual/train/configs/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase.yaml`
72
+ - `code/reveal_vla_bimanual/train/configs/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_phase.yaml`
73
+ - `code/reveal_vla_bimanual/train/configs/rlbench_subset3_backbone_only_clip_current_valid9.yaml`
74
+ - `code/reveal_vla_bimanual/train/configs/rlbench_subset3_backbone_only_clip_current_common23.yaml`
75
+ - `code/reveal_vla_bimanual/train/configs/rlbench_lift_ball_backbone_only_clip_current_wide.yaml`
76
+ - `code/reveal_vla_bimanual/train/configs/rlbench_lift_ball_backbone_only_clip_step1.yaml`
77
+ - `code/reveal_vla_bimanual/train/configs/rlbench_push_box_backbone_only_clip_step1.yaml`
78
+
79
+ ### Test files
80
+
81
+ The staged `tests/` directory contains `32` test modules plus `conftest.py`, including:
82
+
83
+ - geometry and camera rotation coverage
84
+ - phase-label and candidate-ranking coverage
85
+ - planner gradient-flow and reocclusion gating coverage
86
+ - world-model null-rollout, field-consistency, and task-adapter coverage
87
+ - proxy scripted benchmark and teacher-audit coverage
88
+
89
+ ## Verification
90
+
91
+ - local test command:
92
  - `PYTHONPATH=/workspace/VLAarchtests_work/code/reveal_vla_bimanual python -m pytest -q /workspace/VLAarchtests_work/tests`
93
+ - result:
94
  - `33 passed`
95
 
96
+ ## Raw Result Files
97
+
98
+ ### Proxy and handoff results
99
 
100
+ - `artifacts/reports/reveal_smoke_mod/reveal_benchmark.json`
101
+ - `artifacts/reports/reveal_smoke_nogeom/reveal_benchmark.json`
102
+ - `artifacts/reports/reveal_smoke_noplanner/reveal_benchmark.json`
103
+ - `artifacts/reports/reveal_handoff_compare_serious/reveal_benchmark.json`
104
+ - `artifacts/reports/reveal_handoff_compare_serious_compact/reveal_benchmark.json`
105
+ - `artifacts/reports/reveal_phase_compare_serious_compact/reveal_benchmark.json`
106
+ - `artifacts/reports/reveal_phase_compare_serious_spatial_compactwm/reveal_benchmark.json`
107
+ - `artifacts/reports/reveal_phase_ablations_compact/ablations.json`
108
+ - `artifacts/reports/reveal_teacher_audit_serious/teacher_audit.json`
109
 
110
+ ### RLBench result files
111
 
112
+ - `artifacts/reports/rlbench_dual_buttons_baseline_len100_ep1_ik_rescale/rollout_eval.json`
113
+ - `artifacts/reports/rlbench_dual_buttons_common23_len100_ep1_ik_rescale/rollout_eval.json`
114
+ - `artifacts/reports/rlbench_push_box_common23_len100_ep1_ik_rescale/rollout_eval.json`
115
+ - `artifacts/reports/rlbench_lift_ball_wide_len160_ep1_ik_c1/rollout_eval.json`
116
+ - `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1/rollout_eval.json`
117
+ - `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1_s005/rollout_eval.json`
118
+ - `artifacts/reports/rlbench_push_box_knn_step1_ep1/rollout_eval.json`
119
+ - `artifacts/reports/rlbench_push_box_knn_step1_ep5/rollout_eval.json`
120
+ - `artifacts/reports/rlbench_push_box_knn_step1_ep5_top1_dense/rollout_eval.json`
121
 
122
+ ## Raw Result Tables
123
 
124
+ ### Proxy serious runs
125
 
126
+ | Artifact | File | Raw values |
127
+ | --- | --- | --- |
128
+ | spatial handoff vs released baseline | `artifacts/reports/reveal_handoff_compare_serious/reveal_benchmark.json` | baseline mean success `0.5833`, handoff mean success `0.2167` |
129
+ | spatial-trained checkpoint with compact world model vs released baseline | `artifacts/reports/reveal_handoff_compare_serious_compact/reveal_benchmark.json` | baseline mean success `0.5833`, handoff mean success `0.5200` |
130
+ | compact-phase vs released baseline | `artifacts/reports/reveal_phase_compare_serious_compact/reveal_benchmark.json` | baseline mean success `0.5833`, compact-phase mean success `0.5133` |
131
+ | spatial-phase with compact world model vs released baseline | `artifacts/reports/reveal_phase_compare_serious_spatial_compactwm/reveal_benchmark.json` | baseline mean success `0.5833`, spatial-phase compact-world-model mean success `0.4933` |
132
 
133
+ ### Proxy ablations
134
 
135
+ | Artifact | File | Raw values |
136
+ | --- | --- | --- |
137
+ | compact-phase ablations | `artifacts/reports/reveal_phase_ablations_compact/ablations.json` | full `0.5133`, `no_geometry` `0.5133`, `no_spatial_memory` `0.4967`, `compact_world_model` `0.5133`, `no_planner` `0.4333`, `gaussian_candidates_only` `0.4667`, `no_task_head` `0.5133`, `no_support_mode_conditioning` `0.5133` |
 
138
 
139
+ ### RLBench direct-policy runs
140
 
141
+ | Artifact | File | Raw values |
142
+ | --- | --- | --- |
143
+ | lift-ball wide checkpoint, one-step replanning | `artifacts/reports/rlbench_lift_ball_wide_len160_ep1_ik_c1/rollout_eval.json` | mean success `0.0`, mean return `0.0`, path recoveries `[148]`, noop fallbacks `[11]` |
144
+ | push-box step-1 checkpoint, one-step replanning | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1/rollout_eval.json` | mean success `0.0`, mean return `0.0`, path recoveries `[177]`, noop fallbacks `[0]` |
145
+ | push-box step-1 checkpoint, one-step replanning, `delta_scale=0.05` | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1_s005/rollout_eval.json` | mean success `0.0`, mean return `0.0`, path recoveries `[180]`, noop fallbacks `[0]` |
146
 
147
+ ### RLBench retrieval runs
148
 
149
+ | Artifact | File | Raw values |
150
+ | --- | --- | --- |
151
+ | push-box kNN, `bank_stride=4`, `top_k=5`, `time_window=8`, `episodes=1` | `artifacts/reports/rlbench_push_box_knn_step1_ep1/rollout_eval.json` | mean success `1.0`, mean return `1.0`, bank size `2815` |
152
+ | push-box kNN, `bank_stride=4`, `top_k=5`, `time_window=8`, `episodes=5` | `artifacts/reports/rlbench_push_box_knn_step1_ep5/rollout_eval.json` | successes `[0.0, 1.0, 0.0, 0.0, 0.0]`, mean success `0.2`, bank size `2815` |
153
+ | push-box kNN, `bank_stride=1`, `top_k=1`, `time_window=4`, `episodes=5` | `artifacts/reports/rlbench_push_box_knn_step1_ep5_top1_dense/rollout_eval.json` | successes `[0.0, 0.0, 1.0, 1.0, 0.0]`, mean success `0.4`, bank size `11259` |
154
 
155
+ ## Environment Recreation Files
156
 
 
157
  - `environment/setup_same_machine.sh`
158
  - `environment/validate_same_machine.sh`
159
+ - `environment/run_peract2_13_rollouts.sh`
160
  - `environment/runtime_env_vars.sh`
161
+ - `environment/hardware_snapshot.txt`
162
+ - `environment/glxinfo_B.txt`
163
  - `environment/upstream_revisions.txt`
164
+ - `environment/system_packages_same_machine.txt`
165
  - `environment/rlbench_env_export.yaml`
166
  - `environment/rlbench_env_explicit.txt`
167
  - `environment/rlbench_pip_freeze.txt`
168
+ - `environment/reveal_env_export.yaml`
169
+ - `environment/reveal_env_explicit.txt`
170
+ - `environment/reveal_pip_freeze.txt`
171
+
172
+ Detailed raw tables for the `2026-03-25/26` work are in `results/session_results_20260326.md`.
environment/hf_cli_version.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ 1.8.0
handoff/instructions.md ADDED
@@ -0,0 +1,717 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Developer handoff: structured bimanual reveal-and-retrieve under elastic occlusion
2
+
3
+ Repo target: `lsnu/VLAarchtests` (current `main`, latest post-fix state). This handoff is written against the current `elastic_reveal` stack, not the older intermediate variants.
4
+
5
+ ## 1. Project introduction
6
+
7
+ This project is a structured bimanual policy stack for reveal-and-retrieve tasks under partial observability and deformable or elastic occlusion. The eventual real-world targets are three Dobot X-trainer environments. The first is dense live foliage with hidden fake snails, where one arm must create and maintain a canopy gap while the other arm retrieves the target safely. The second is bag opening and retrieval, where one arm must open and hold the bag mouth while the other arm retrieves the target item. The third is suitcase or folded-cloth retrieval, where one arm must slightly lift and stabilize clothing layers while the other arm retrieves a hidden item without destroying the fold structure.
8
+
9
+ The current repo already contains the right broad decomposition for this task family. It has a multi-view visual backbone, RGB-D support, an explicit reveal state head, observation memory, a compact world model, a coordinated bimanual action decoder, and a planner. The problem is not the structural idea. The problem is that several important pieces are only partially wired, too compact, or only validated on teacher-shaped proxy data. The current code is a good scaffold. It is not yet strong enough to justify “beats SOTA” claims on either public benchmarks or the three target task families.
10
+
11
+ The current public evidence should be read narrowly. The most credible positive result in the repo is that RGB-D helps on the proxy benchmark. The planner, world model, and role-symmetry components are not yet validated strongly enough to claim they are the source of the gains. The RLBench / PerAct2 integration is also still mostly a launch and plumbing layer, not a mature benchmark suite.
12
+
13
+ This handoff therefore has one purpose. Keep the structured reveal-and-retrieve idea, but harden the architecture and evaluation until there is a realistic chance of beating strong bimanual baselines on the three target environments.
14
+
15
+ ## 2. Current repo status (what exists, what is missing)
16
+
17
+ The current core files are:
18
+
19
+ `code/reveal_vla_bimanual/models/backbones.py`
20
+ `code/reveal_vla_bimanual/models/multiview_fusion.py`
21
+ `code/reveal_vla_bimanual/models/observation_memory.py`
22
+ `code/reveal_vla_bimanual/models/reveal_head.py`
23
+ `code/reveal_vla_bimanual/models/world_model.py`
24
+ `code/reveal_vla_bimanual/models/action_decoder.py`
25
+ `code/reveal_vla_bimanual/models/planner.py`
26
+ `code/reveal_vla_bimanual/models/policy.py`
27
+ `code/reveal_vla_bimanual/train/losses.py`
28
+ `code/reveal_vla_bimanual/sim_reveal/dataset.py`
29
+ `code/reveal_vla_bimanual/sim_reveal/procedural_envs.py`
30
+ `code/reveal_vla_bimanual/eval/run_reveal_benchmark.py`
31
+ `code/reveal_vla_bimanual/eval/run_rlbench_rollout_eval.py`
32
+ `code/reveal_vla_bimanual/eval/run_peract2_task_sweep.py`
33
+
34
+ The current proxy benchmark already uses the correct three abstract task types (`foliage`, `bag`, `cloth`). That is good. The current dataset code also has explicit no-leak assertions, which is also good.
35
+
36
+ The current weaknesses are specific and fixable.
37
+
38
+ First, the geometry path is only partially wired. The backbone produces `depth_tokens`, `geometry_tokens`, and `camera_tokens`, but the policy only forwards RGB, depth, and camera tokens into fusion. The explicit `geometry_tokens` are dropped before fusion. In addition, camera geometry is incomplete. The current depth adapter encodes intrinsics and camera translation, but not an equally explicit camera rotation representation. For three-camera reveal tasks this is a real omission.
39
+
40
+ Second, memory is too pooled and too global. The current memory path reduces scene history to pooled tokens before write decisions and bank updates. That is a novelty-gated summary memory. It is not a spatial occlusion memory. That is not enough for “hold the opening”, “the target is still probably behind this flap”, or “reveal progress will collapse if the revealer arm releases now”.
41
+
42
+ Third, the world model is too compact. It is useful as a scaffold, but not as the state-transition core for elastic foliage, bag apertures, or layered cloth. It currently rolls a compact hidden state rather than a spatial field state. That makes it too weak for counterfactual planning over opening persistence, reocclusion, and safe actor insertion.
43
+
44
+ Fourth, the planner is not trained on hard enough candidates. The current proxy data generation uses the teacher chunk and mostly Gaussian perturbations around it. That is enough to test ranking near a teacher, but not enough to teach the planner the actual failure modes that matter in these tasks (premature retrieval, releasing the opening, over-disturbing the scene, lifting the wrong cloth edge, etc.).
45
+
46
+ Fifth, the state head is still too generic. It predicts a useful set of reveal-related fields, but it does not yet expose the right task-specific latent variables for foliage, bag, and folded cloth. Those tasks are not the same. They share the same reveal-and-retrieve pattern, but they do not share the same dominant failure modes.
47
+
48
+ Sixth, the test suite is mostly contract-level. Those tests are useful, but they do not yet prove that the structured components work behaviorally. The RLBench side is similar. The launch smoke is only a plumbing check. The actual rollout evaluator exists, but it needs to become the main public benchmark path.
49
+
50
+ ## 3. The main design decision
51
+
52
+ Do not collapse this into a generic monolithic VLA. That is not the likely win condition for these tasks.
53
+
54
+ The highest-probability path is a stronger visual backbone plus an explicit structured reveal-and-retrieve stack. The reason is simple. Your target tasks are asymmetric, partially observable, persistence-sensitive, and reocclusion-sensitive. One arm often has to create and maintain a temporary affordance that only exists because of that arm’s continued state. Generic end-to-end BC can sometimes imitate the behavior, but these tasks strongly reward explicit representations of opening quality, hold persistence, target belief, reocclusion risk, and actor feasibility.
55
+
56
+ The structured architecture should stay. It should just become spatial, task-aware, and evaluated honestly.
57
+
58
+ ## 4. Mandatory code changes
59
+
60
+ ### 4.1 Fix and strengthen the geometry path
61
+
62
+ Files to change:
63
+
64
+ `models/backbones.py`
65
+ `models/multiview_fusion.py`
66
+ `models/policy.py`
67
+ `tests/test_rgbd_forward_contract.py` (extend)
68
+ Add new tests: `tests/test_geometry_tokens_propagate.py`, `tests/test_camera_rotation_geometry.py`
69
+
70
+ Exact changes:
71
+
72
+ In `models/policy.py`, update the image encoding path so that `geometry_tokens` are passed from `backbone.encode_images(..., return_aux=True)` into the fusion module. Right now the policy forwards `rgb_tokens`, `depth_tokens`, and `camera_tokens`, but not `geometry_tokens`. This should be corrected first because it is an actual information-drop bug.
73
+
74
+ In `models/multiview_fusion.py`, update the fusion interface to accept explicit `geometry_tokens`. The geometry attention path should fuse from a real concatenation or gated combination of `[depth_tokens, geometry_tokens, camera_tokens]`, rather than synthesizing “geometry” only from the surviving depth and camera paths. Keep the existing gated cross-attention pattern, but make the geometry path explicit and inspectable.
75
+
76
+ In `models/backbones.py`, upgrade `DepthPatchAdapter` so that geometry features include camera orientation. Use a 6D rotation representation or a normalized quaternion plus translation. Also add per-patch viewing ray directions derived from intrinsics and camera pose. The three target environments all rely on view geometry and persistent multi-view correspondence. The current translation-only pose treatment is too weak.
77
+
78
+ Add config flags that actually do something. The current `use_camera_geometry` style config needs to gate a real path, not just exist as a dormant option. Add separate switches for `use_depth_tokens`, `use_geometry_tokens`, and `use_camera_pose_tokens` so ablations are clean.
79
+
80
+ Why this matters: the foliage and bag tasks are especially sensitive to camera geometry because small apparent gaps can be fake from one viewpoint and usable from another. The actor feasibility estimate should depend on geometry, not just appearance.
81
+
82
+ ### 4.2 Replace pooled novelty memory with spatial reveal memory
83
+
84
+ Files to change:
85
+
86
+ `models/observation_memory.py`
87
+ `models/policy.py`
88
+ `models/reveal_head.py`
89
+ Add new tests: `tests/test_spatial_memory_occlusion_persistence.py`, `tests/test_memory_slot_write_gating.py`, `tests/test_reocclusion_memory_regression.py`
90
+
91
+ Exact changes:
92
+
93
+ Keep the current memory modules as a fallback baseline, but add a new default path that stores low-resolution spatial memory instead of only pooled history summaries. The simplest realistic version is a two-branch memory:
94
+
95
+ 1. scene memory: a small bank of view-conditioned or canonicalized spatial tokens for persistent geometry and support structure;
96
+ 2. belief memory: a spatial target-belief / reveal-state memory that carries uncertainty explicitly.
97
+
98
+ The memory does not need to be large. An 8×8 or 12×12 field token grid per view (or a shared canonical field) is enough. The key requirement is that the write gate becomes spatial or slot-wise, not global only. The model must be able to update “the mouth is open here” without overwriting “the target is probably still here”.
99
+
100
+ Add explicit channels or latent heads for:
101
+ - newly revealed regions
102
+ - still-visible regions
103
+ - reoccluded regions
104
+ - persistent hold or opening quality
105
+ - target belief uncertainty
106
+
107
+ The world model and planner should consume this spatial memory directly. Do not average it away before planning.
108
+
109
+ Why this matters: a reveal-and-retrieve policy that forgets where the useful opening was, or where the hidden object probably still is, will look competent in one-step imitation and fail in multi-step retrieval.
110
+
111
+ ### 4.3 Replace the compact world model with a spatial rollout model
112
+
113
+ Files to change:
114
+
115
+ `models/world_model.py`
116
+ `models/policy.py`
117
+ `train/losses.py`
118
+ Add new tests: `tests/test_world_model_null_rollout.py`, `tests/test_world_model_identity_rollout.py`, `tests/test_world_model_field_consistency.py`, `tests/test_world_model_task_adapter.py`
119
+
120
+ Exact changes:
121
+
122
+ Keep the current compact GRU world model only as an ablation. The default model should become a spatial latent rollout over field tokens or low-resolution maps. A realistic implementation is a ConvGRU or a token-wise recurrent transformer over a low-resolution field state. The world-model state should contain at least:
123
+
124
+ - target belief field
125
+ - visibility or reveal field
126
+ - actor feasibility / corridor field
127
+ - opening quality or hold quality field
128
+ - persistence field
129
+ - disturbance / damage risk field
130
+ - reocclusion risk field
131
+ - support stability field
132
+
133
+ Add task conditioning directly into the world model. A learned task embedding (`foliage`, `bag`, `cloth`) should modulate the transition. The dynamics are not the same and should not be forced into one unstructured transition model.
134
+
135
+ Retain explicit ablation modes inside `models/world_model.py`:
136
+ - `identity_rollout`
137
+ - `null_rollout`
138
+ - `compact_rollout` (the current baseline)
139
+ - `spatial_rollout` (new default)
140
+
141
+ These ablations must be real and deterministic. The world-model ablation confusion in the current repo shows why this needs to be explicit and unit-tested.
142
+
143
+ Why this matters: the planner can only beat a simple decoder if its counterfactual rollouts capture persistence and collapse. Without a spatial world model, the “maintain opening while actor advances” pattern will be under-modeled.
144
+
145
+ ### 4.4 Make the reveal head task-aware
146
+
147
+ Files to change:
148
+
149
+ `models/reveal_head.py`
150
+ `train/losses.py`
151
+ `sim_reveal/dataset.py`
152
+ `sim_reveal/procedural_envs.py`
153
+ Add new tests: `tests/test_task_conditioned_head_shapes.py`, `tests/test_task_metric_monotonicity.py`
154
+
155
+ Exact changes:
156
+
157
+ Add a task embedding to the reveal head. Keep the shared trunk, but use task-specific adapters or low-rank heads for the final outputs. The head should still produce common fields, but each task must also expose the state variables that actually matter.
158
+
159
+ For foliage, add:
160
+ - gap width or reveal corridor width
161
+ - canopy strain / damage risk
162
+ - occluder return tendency (reocclusion after release)
163
+ - target visibility confidence under flexible occluders
164
+
165
+ For bag, add:
166
+ - mouth aperture width or area
167
+ - rim endpoint or rim grasp quality
168
+ - hold quality
169
+ - rim slip risk
170
+ - insertable actor corridor
171
+
172
+ For cloth or suitcase, add:
173
+ - layer separation quality
174
+ - fold-preservation score
175
+ - insertion corridor
176
+ - top-layer stability
177
+ - “lift too much” risk
178
+
179
+ The current generic fields (`actor_feasibility_field`, `persistence_field`, `risk_field`, `uncertainty_field`, `reocclusion`) are useful, but they are not enough. The planner needs the task-specific variables because the right action for bag opening is not the right action for layered cloth.
180
+
181
+ ### 4.5 Replace Gaussian candidate noise with semantic macro candidates plus continuous refinement
182
+
183
+ Files to change:
184
+
185
+ `models/action_decoder.py`
186
+ `models/planner.py`
187
+ `sim_reveal/dataset.py`
188
+ `sim_reveal/procedural_envs.py`
189
+ Add new tests: `tests/test_candidate_macro_coverage.py`, `tests/test_planner_reocclusion_gating.py`, `tests/test_proposal_semantic_diversity.py`
190
+
191
+ Exact changes:
192
+
193
+ Keep the current proposal mechanism as a fallback. The default candidate set should become a set of semantic macro modes, each refined by continuous deltas.
194
+
195
+ The candidate vocabulary should be task-aware.
196
+
197
+ For foliage:
198
+ - `sweep_left`
199
+ - `sweep_right`
200
+ - `pin_canopy`
201
+ - `widen_gap`
202
+ - `maintain_gap`
203
+ - `insert_actor`
204
+ - `retrieve`
205
+
206
+ For bag:
207
+ - `pin_left_rim`
208
+ - `pin_right_rim`
209
+ - `widen_mouth`
210
+ - `maintain_mouth`
211
+ - `probe_inside`
212
+ - `insert_actor`
213
+ - `retrieve`
214
+
215
+ For cloth:
216
+ - `lift_edge`
217
+ - `separate_layer`
218
+ - `stabilize_fold`
219
+ - `maintain_lift`
220
+ - `insert_actor`
221
+ - `retrieve`
222
+
223
+ Represent these as discrete proposal tokens or a macro head in `action_decoder.py`, then produce continuous chunk deltas conditioned on the chosen macro. The planner should shortlist across macro families first and refine within each family second. That prevents “all candidates are tiny perturbations around the same wrong idea”.
224
+
225
+ In `models/planner.py`, add hard feasibility gates before utility aggregation. Do not let the planner prefer “retrieve now” if actor feasibility, hold quality, or support stability are below threshold. Use worst-step or CVaR-style penalties for reocclusion and collapse, rather than only mean penalties. These tasks fail on bad tails, not just on averages.
226
+
227
+ Why this matters: the current planner is too dependent on easy local ranking. Real reveal-and-retrieve requires semantically different plans, not just slightly different noise vectors.
228
+
229
+ ### 4.6 Change the loss stack to supervise what actually matters
230
+
231
+ Files to change:
232
+
233
+ `train/losses.py`
234
+ `train/trainer.py` (if needed for logging)
235
+ Add new tests: `tests/test_candidate_ranking_loss.py`, `tests/test_phase_labels_not_action_only.py`, `tests/test_planner_gradient_flow.py`
236
+
237
+ Exact changes:
238
+
239
+ Reduce dependence on heuristic phase labels inferred from the current action chunk. That heuristic is acceptable for early bootstrapping, but it should not remain the main source of phase supervision. Prefer simulator-side phase or subgoal labels where available. If those are not reliable, phase should be a weak auxiliary, not a strong driver.
240
+
241
+ Add pairwise or listwise ranking loss over candidate action chunks using actual rollout utility labels. These labels should come from simulated outcomes, not just from “teacher is first, noise is worse”.
242
+
243
+ Add consistency losses:
244
+ - predicted opening quality should correlate with rollout persistence
245
+ - predicted reocclusion should correlate with actual collapse after release
246
+ - predicted uncertainty should be calibrated against outcome uncertainty or visibility error
247
+
248
+ Lower the relative weight of pure behavior cloning once ranking and rollout supervision are reliable. This project should not stay as BC-with-many-auxiliaries.
249
+
250
+ ## 5. Mandatory data-generation changes
251
+
252
+ Files to change:
253
+
254
+ `sim_reveal/dataset.py`
255
+ `sim_reveal/procedural_envs.py`
256
+ Add new tests: `tests/test_dataset_hard_negative_presence.py`, `tests/test_no_leak_with_new_labels.py`, `tests/test_teacher_audit.py`
257
+
258
+ Exact changes:
259
+
260
+ The dataset generation path must stop relying on teacher-plus-Gaussian-noise as the dominant source of planner candidates. Keep the teacher as one source, but add hard negative families that reflect actual task failures.
261
+
262
+ Required negative families for all three tasks:
263
+
264
+ 1. premature retrieve: actor attempts retrieval before corridor and hold quality are sufficient;
265
+ 2. reveal-with-release: revealer creates an opening but fails to maintain it;
266
+ 3. over-disturbance: revealer opens aggressively but causes collapse or damage risk;
267
+ 4. wrong-side or wrong-edge reveal: the opening is created in a useless place;
268
+ 5. delayed actor entry: revealer holds too long and wastes time or destabilizes the scene;
269
+ 6. actor path through weak corridor: actor enters where access exists visually but not safely.
270
+
271
+ Required task-specific negative families:
272
+
273
+ For foliage:
274
+ - swipe that increases visibility briefly but induces immediate reocclusion;
275
+ - push direction that hides the target from the actor side;
276
+ - gap on the wrong side of the target.
277
+
278
+ For bag:
279
+ - one-rim lift that slips instead of widening the mouth;
280
+ - opening wide enough visually but not stable enough for actor insertion;
281
+ - actor reaches through the fabric instead of through the aperture.
282
+
283
+ For cloth:
284
+ - lift too high and destroy fold structure;
285
+ - lift the wrong layer;
286
+ - retrieve path that drags clothing and unfolds the stack.
287
+
288
+ The dataset should record candidate-level rollout outcomes for every candidate chunk:
289
+ - success
290
+ - reveal achieved
291
+ - visibility AUC
292
+ - hold persistence
293
+ - reocclusion rate
294
+ - disturbance cost
295
+ - fold-preservation (cloth)
296
+ - mouth aperture / hold quality (bag)
297
+ - damage proxy / gap width (foliage)
298
+
299
+ This candidate-level outcome table should be the source of planner labels.
300
+
301
+ Also add a teacher audit report. The current teacher is a useful bootstrap, but it is not enough to assume it is good. The audit should compare the teacher against reveal-only, retrieve-only, no-hold, and random policy baselines on the current proxy suite.
302
+
303
+ ## 6. Small but mandatory engineering cleanups
304
+
305
+ These changes do not change model quality directly, but they reduce evaluation ambiguity and future regressions.
306
+
307
+ In `tests/conftest.py`, remove the hardcoded `/workspace/VLAarchtests/code/reveal_vla_bimanual` path. Replace it with a path derived from `Path(__file__).resolve()` so tests run anywhere.
308
+
309
+ In `eval/run_rlbench_rollout_eval.py`, preserve richer episode traces. Save chosen macro mode, planner scores, confidence, predicted reocclusion, path recoveries, noop fallbacks, and whether support-mode conditioning was enabled.
310
+
311
+ In `eval/run_reveal_benchmark.py`, stop using only the default 24 episodes for serious comparisons. Keep 24 as a smoke benchmark, but add a “serious” mode at 100 or 200 episodes per proxy.
312
+
313
+ In `eval/run_reveal_benchmark.py`, explicitly report `chunk_commit_steps` and do not leave the main reveal benchmark at a commit horizon of zero by default. These tasks are not purely one-step reactive.
314
+
315
+ In the eval reporting utilities, add bootstrap confidence intervals and paired-seed comparisons. The differences you care about are often a few percentage points. Unpaired noisy comparisons are not enough.
316
+
317
+ ## 7. Exact new tests to verify the implementation
318
+
319
+ The current repo has contract tests. Keep them. Add the following behavioral tests.
320
+
321
+ ### 7.1 Geometry and fusion tests
322
+
323
+ `tests/test_geometry_tokens_propagate.py`
324
+
325
+ Construct a tiny batch with fixed RGB and depth. Modify only camera rotation. Verify that:
326
+ 1. `geometry_tokens` change,
327
+ 2. the fused scene representation changes when geometry is enabled,
328
+ 3. the fused scene representation does not change when geometry is disabled.
329
+
330
+ `tests/test_camera_rotation_geometry.py`
331
+
332
+ Use two cameras with identical translation and different rotation. Verify that the policy representation is rotation-sensitive after the geometry fix. This should fail on the current code and pass after the change.
333
+
334
+ ### 7.2 Spatial memory tests
335
+
336
+ `tests/test_spatial_memory_occlusion_persistence.py`
337
+
338
+ Use a scripted proxy sequence where the target is briefly visible, then fully occluded, then visible again. Verify that belief memory retains a localized target belief during occlusion and sharpens it after reappearance. This should test both persistence and uncertainty.
339
+
340
+ `tests/test_memory_slot_write_gating.py`
341
+
342
+ Feed a scene where only the opening region changes. Verify that only a minority of memory slots or cells update. This prevents global overwriting.
343
+
344
+ `tests/test_reocclusion_memory_regression.py`
345
+
346
+ Create a scripted “open then release” sequence. Verify that memory tracks reocclusion and that predicted hold quality declines.
347
+
348
+ ### 7.3 World-model tests
349
+
350
+ `tests/test_world_model_null_rollout.py`
351
+
352
+ Assert that `null_rollout` returns an exact or near-exact identity state and does not apply unintended updates.
353
+
354
+ `tests/test_world_model_identity_rollout.py`
355
+
356
+ Assert that `identity_rollout` preserves state across steps while leaving logging fields consistent.
357
+
358
+ `tests/test_world_model_field_consistency.py`
359
+
360
+ Roll out one deterministic proxy step and compare predicted next-step fields against simulator privileged fields. Enforce MAE thresholds per field, not only a single scalar.
361
+
362
+ `tests/test_world_model_task_adapter.py`
363
+
364
+ Use the same initial field state with different task embeddings. Verify that transitions differ in a consistent way. This catches dead task-conditioning code paths.
365
+
366
+ ### 7.4 Candidate and planner tests
367
+
368
+ `tests/test_candidate_macro_coverage.py`
369
+
370
+ Verify that the proposal generator returns at least one candidate from each required macro family when requested.
371
+
372
+ `tests/test_planner_reocclusion_gating.py`
373
+
374
+ Create a scripted case where one candidate retrieves immediately but causes opening collapse, and another candidate maintains the opening first. Verify that the planner picks the maintain-first plan.
375
+
376
+ `tests/test_proposal_semantic_diversity.py`
377
+
378
+ Do not measure diversity only by vector distance. Also verify macro-family diversity and rollout outcome diversity.
379
+
380
+ ### 7.5 Task-head tests
381
+
382
+ `tests/test_task_conditioned_head_shapes.py`
383
+
384
+ Verify output presence and shapes for all common fields and all task-specific fields.
385
+
386
+ `tests/test_task_metric_monotonicity.py`
387
+
388
+ Use small synthetic perturbations:
389
+ - increase aperture in bag: `opening_quality` should increase;
390
+ - increase canopy gap in foliage: `actor_feasibility` should increase;
391
+ - over-lift cloth: `fold_preservation` should decrease.
392
+
393
+ These are not full scientific tests, but they catch dead or miswired heads quickly.
394
+
395
+ ### 7.6 Dataset and leakage tests
396
+
397
+ `tests/test_dataset_hard_negative_presence.py`
398
+
399
+ Sample dataset items and verify that candidate sets contain hard negative families, not just teacher-centered noise.
400
+
401
+ `tests/test_no_leak_with_new_labels.py`
402
+
403
+ Extend the no-leak assertions to cover all new task-specific labels and maps. The proxy dataset must keep using rendered observations only on the input side.
404
+
405
+ `tests/test_teacher_audit.py`
406
+
407
+ Require the teacher to beat random, retrieve-only, and reveal-only on the proxy metrics. If the teacher itself is weak, the whole planner training signal is questionable.
408
+
409
+ ### 7.7 Scripted proxy behavior suite
410
+
411
+ Add a new deterministic behavioral test suite, for example under `tests/test_proxy_scripted_bench.py`.
412
+
413
+ This suite should include 10 to 20 deterministic seeds per task with hand-designed initial states. The expected winner should be obvious.
414
+
415
+ Required scripted cases:
416
+ - bag: `maintain_mouth` should beat `retrieve` immediately on hold persistence and success;
417
+ - foliage: `pin_canopy` should beat `random_swipe` on reocclusion and visibility AUC;
418
+ - cloth: `stabilize_fold` should beat `lift_high` on fold-preservation and success.
419
+
420
+ The full model does not need to be perfect on these, but the planner should select the intended candidate at least 80 percent of the time.
421
+
422
+ ## 8. Exact benchmark plan to estimate performance
423
+
424
+ Separate the benchmarks into two layers. The first layer verifies that the implementation behaves correctly. The second estimates real performance against baselines.
425
+
426
+ ### 8.1 Layer A: implementation-verification benchmarks
427
+
428
+ These are not publication benchmarks. They are gates.
429
+
430
+ Run the full unit and integration suite after every architecture milestone:
431
+
432
+ ```bash
433
+ PYTHONPATH=code/reveal_vla_bimanual pytest tests -q
434
+ ```
435
+
436
+ After the new behavioral tests are added, require all of the following before moving on:
437
+ - all geometry propagation tests pass;
438
+ - the scripted proxy suite passes;
439
+ - world-model null and identity ablations pass exactly;
440
+ - candidate macro coverage passes;
441
+ - no-leak assertions pass with new task fields.
442
+
443
+ Then run a deterministic proxy smoke benchmark on fixed seeds (for example 10 per task) to catch obvious regressions:
444
+
445
+ ```bash
446
+ cd code/reveal_vla_bimanual
447
+ python -m eval.run_reveal_benchmark \
448
+ --model full=/abs/path/checkpoint.pt \
449
+ --episodes 10 \
450
+ --proxies foliage bag cloth \
451
+ --chunk-commit-steps 4 \
452
+ --output-root /abs/path/reports/reveal_smoke
453
+ ```
454
+
455
+ This benchmark is only for regression detection. It is not a performance claim.
456
+
457
+ ### 8.2 Layer B: strengthened proxy benchmark (main task-aligned benchmark now)
458
+
459
+ This should become the main internal benchmark until real teleop data exists.
460
+
461
+ Use the existing `foliage`, `bag`, and `cloth` proxies, but strengthen them and evaluate seriously:
462
+ - at least 100 deterministic seeds per proxy for final comparisons;
463
+ - paired-seed evaluation across all ablations;
464
+ - chunk commit horizons of at least 4, and also report a 0/2/4 sweep once;
465
+ - no teacher involvement during evaluation.
466
+
467
+ Run the base benchmark:
468
+
469
+ ```bash
470
+ cd code/reveal_vla_bimanual
471
+ python -m eval.run_reveal_benchmark \
472
+ --model full=/abs/path/checkpoint.pt \
473
+ --episodes 100 \
474
+ --proxies foliage bag cloth \
475
+ --chunk-commit-steps 4 \
476
+ --output-root /abs/path/reports/reveal_full
477
+ ```
478
+
479
+ Run required paired ablations from the same checkpoint family or retrained checkpoints:
480
+ - no geometry tokens
481
+ - no spatial memory
482
+ - compact world model instead of spatial
483
+ - no planner
484
+ - planner with Gaussian candidates only
485
+ - no task-conditioned head
486
+ - no support-mode conditioning
487
+
488
+ The proxy benchmark must report at least these metrics:
489
+ - retrieve success
490
+ - reveal success
491
+ - target visibility AUC
492
+ - actor-feasibility AUC
493
+ - hold persistence
494
+ - reocclusion rate
495
+ - disturbance cost
496
+ - planner top-1 on candidate rollouts
497
+ - world-model next-step MAE
498
+ - uncertainty calibration
499
+ - candidate ranking NDCG
500
+
501
+ Add task-specific metrics:
502
+ - foliage: gap width, damage proxy, release-collapse rate
503
+ - bag: aperture width or area, rim slip rate, insertion success
504
+ - cloth: fold-preservation score, layer separation quality, drag-induced disturbance
505
+
506
+ Acceptance gate for continuing toward public baseline comparison:
507
+ - the full model should beat the current repo’s RGB-D baseline on mean proxy success and on at least two of the three proxies;
508
+ - planner-on should beat planner-off on at least two of the three proxies and on hard-negative candidate ranking;
509
+ - spatial world model should beat compact and null rollouts on persistence and reocclusion prediction;
510
+ - task-conditioned head should beat generic head on at least one task-specific metric per target task.
511
+
512
+ ### 8.3 Layer C: RLBench / PerAct2 bimanual rollout benchmark
513
+
514
+ The repo already has the right hook for this. Use `run_rlbench_rollout_eval.py` and `run_peract2_task_sweep.py` as the main public benchmark entry points. Do not treat `run_peract2_launch_smoke.py` as evaluation. It is only a launch check.
515
+
516
+ Run the full existing PerAct2 13-task split from `sim_rlbench/task_splits.py::PERACT2_BIMANUAL_TASKS`:
517
+
518
+ ```bash
519
+ cd code/reveal_vla_bimanual
520
+ python -m eval.run_peract2_task_sweep \
521
+ --checkpoint /abs/path/checkpoint.pt \
522
+ --output-root /abs/path/reports/peract2_13 \
523
+ --episodes-per-task 25 \
524
+ --episode-length 20 \
525
+ --resolution 224 \
526
+ --chunk-commit-steps 4 \
527
+ --allow-unsupervised-planning \
528
+ --headless
529
+ ```
530
+
531
+ Also run direct single-task evaluations when debugging:
532
+
533
+ ```bash
534
+ cd code/reveal_vla_bimanual
535
+ python -m eval.run_rlbench_rollout_eval \
536
+ --checkpoint /abs/path/checkpoint.pt \
537
+ --output-dir /abs/path/reports/rlbench_debug \
538
+ --tasks RightOpenDrawer \
539
+ --episodes-per-task 25 \
540
+ --episode-length 20 \
541
+ --resolution 224 \
542
+ --plan \
543
+ --chunk-commit-steps 4 \
544
+ --allow-unsupervised-planning \
545
+ --headless
546
+ ```
547
+
548
+ This benchmark is not a direct match to the three target tasks, but it is the main public bimanual sanity check. It measures whether the structured modifications hurt or help general bimanual competence.
549
+
550
+ Required comparisons on this benchmark:
551
+ - current repo best checkpoint
552
+ - full improved model
553
+ - no-planner ablation
554
+ - compact world model ablation
555
+ - no geometry ablation
556
+ - no task-conditioning ablation
557
+
558
+ If external baseline code is available, evaluate against:
559
+ - PerAct2
560
+ - InterACT
561
+ - VoxAct-B
562
+ - AnyBimanual
563
+
564
+ If compute allows, also compare against foundation-scale baselines as a separate category:
565
+ - TwinVLA
566
+ - RDT-1B
567
+
568
+ Fairness requirements:
569
+ - same camera setup if possible (front plus both wrists);
570
+ - same resolution;
571
+ - same episode length and reset policy;
572
+ - same task list;
573
+ - same number of evaluation episodes;
574
+ - report whether baselines use extra large-scale pretraining.
575
+
576
+ This benchmark should report:
577
+ - per-task success
578
+ - mean success
579
+ - mean return
580
+ - path recoveries
581
+ - noop fallbacks
582
+ - plan-on vs plan-off
583
+ - per-episode planner traces for error analysis
584
+
585
+ ### 8.4 Layer D: deformable-manipulation public benchmarks
586
+
587
+ You do not yet have custom teleop data, so the closest public matches for bag and cloth should be used now.
588
+
589
+ Recommended benchmarks:
590
+ - DeformableRavens
591
+ - SoftGym cloth tasks
592
+ - DaXBench cloth tasks
593
+
594
+ The exact subset should be chosen based on available tasks, but the mapping is straightforward. Bag-like opening and insertion tasks are the closest public proxy for the bag environment. Cloth lifting, separation, and manipulation tasks are the closest public proxy for the suitcase environment. There is no equally good public foliage benchmark, so the strengthened foliage proxy will remain the main stand-in until custom data exists.
595
+
596
+ Required evaluation protocol:
597
+ - same observation modalities across methods;
598
+ - same action horizon where possible;
599
+ - same random seeds;
600
+ - same episode budgets;
601
+ - report both success and task-specific deformation metrics.
602
+
603
+ Add at least these extra metrics on the deformable benchmarks:
604
+ - opening quality or aperture quality
605
+ - hold persistence under actor motion
606
+ - reocclusion or collapse rate
607
+ - disturbance cost
608
+ - fold-preservation or structural-preservation score
609
+
610
+ ### 8.5 Layer E: optional exploratory / active-perception benchmark
611
+
612
+ If EFM-10 or BAP code and data are actually available when implementation starts, add them. That benchmark is conceptually close to your task family because it measures exploratory plus focused manipulation under occlusion. Do not block the project on it if code is not readily usable.
613
+
614
+ ### 8.6 Layer F: optional broad generalization benchmark
615
+
616
+ If time allows, add RoboTwin 2.0 as a general bimanual breadth check. It is not a direct target-task match, but it is useful for checking whether the structured reveal-and-retrieve bias damages general bimanual transfer.
617
+
618
+ ## 9. Baseline strategy
619
+
620
+ There are two baseline groups and they should not be mixed carelessly.
621
+
622
+ The first group is matched-data or matched-setting baselines. These are the most useful for fair engineering comparison. Use PerAct2, InterACT, VoxAct-B, and AnyBimanual if code is available in a compatible evaluation setting.
623
+
624
+ The second group is foundation-scale baselines. These are useful, but they are not apples-to-apples unless you disclose the pretraining and model scale difference clearly. Use TwinVLA and RDT-1B in this category if compute allows.
625
+
626
+ Do not declare victory because the improved model beats the current repo checkpoint. That is a necessary condition, not the target claim.
627
+
628
+ ## 10. Acceptance criteria for “ready to collect real data”
629
+
630
+ Do not move into expensive teleop collection until all of the following are true.
631
+
632
+ First, the geometry and spatial memory tests pass and stay green for multiple checkpoints.
633
+
634
+ Second, the strengthened proxy benchmark shows that the full model beats the current repo baseline convincingly. The minimum bar should be improvement in overall proxy success plus improvement on at least two of the three task types.
635
+
636
+ Third, planner-on must beat planner-off on hard-negative ranking and on task success. If the planner does not beat the decoder baseline, then the explicit planning stack is not yet earning its complexity.
637
+
638
+ Fourth, the spatial world model must beat compact and null baselines on persistence and reocclusion prediction. If it does not, the planning story is still too weak.
639
+
640
+ Fifth, the improved model should at least match strong public baselines on the RLBench / PerAct2 suite, and ideally exceed them on the tasks most related to opening, holding, uncovering, and coordinated retrieval. If it is significantly behind there, the architecture is still too immature.
641
+
642
+ ## 11. Recommended implementation order
643
+
644
+ Phase 1 should fix information flow and evaluation trustworthiness. Implement geometry propagation, camera orientation encoding, and path cleanup in `tests/conftest.py`. Then add the new geometry tests and rerun the current proxy benchmark.
645
+
646
+ Phase 2 should add task-aware semantic candidates and hard-negative data generation. This is the fastest path to making the planner meaningful without yet rewriting the full memory and world model stack.
647
+
648
+ Phase 3 should add task-conditioned reveal outputs and the strengthened proxy metrics. At this stage the proxy benchmark should start reflecting the real task failure modes.
649
+
650
+ Phase 4 should replace pooled memory and compact rollout with the new spatial memory and spatial world model. This is the biggest change and should only happen after the eval harness can tell whether it helped.
651
+
652
+ Phase 5 should run the full internal ablation suite, then RLBench / PerAct2, then deformable public benchmarks, and only then decide whether the architecture is strong enough to justify real-data collection.
653
+
654
+ ## 12. What to avoid
655
+
656
+ Do not treat launch smoke as performance evaluation.
657
+
658
+ Do not keep teacher-centered Gaussian candidates as the main planner supervision source.
659
+
660
+ Do not remove task structure in favor of a generic monolithic BC model unless the structured architecture clearly fails. Nothing in the current repo proves that.
661
+
662
+ Do not use only mean success. These tasks need persistence, reocclusion, and structural-preservation metrics.
663
+
664
+ Do not claim the current planner or current world model are validated. They are not, yet.
665
+
666
+ ## 13. Minimal first patch set (the first pull request)
667
+
668
+ If only one implementation sprint is possible before deeper refactors, the first pull request should contain exactly this:
669
+
670
+ 1. fix `geometry_tokens` propagation from backbone to fusion to policy output;
671
+ 2. add camera rotation encoding in `DepthPatchAdapter`;
672
+ 3. add `tests/test_geometry_tokens_propagate.py` and `tests/test_camera_rotation_geometry.py`;
673
+ 4. replace hardcoded path logic in `tests/conftest.py`;
674
+ 5. extend `run_reveal_benchmark.py` reporting to save `chunk_commit_steps`, bootstrap confidence intervals, and paired-seed summaries;
675
+ 6. add semantic macro candidates in `action_decoder.py` without yet deleting the Gaussian fallback;
676
+ 7. add hard negative candidate generation in `sim_reveal/procedural_envs.py`;
677
+ 8. add the deterministic scripted proxy benchmark suite.
678
+
679
+ This first patch set will not make the model SOTA. It will make the repo trustworthy enough to support the larger refactor.
680
+
681
+ ## 14. Reference links
682
+
683
+ Repo root:
684
+ https://huggingface.co/lsnu/VLAarchtests/tree/main
685
+
686
+ Core files:
687
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/backbones.py
688
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/multiview_fusion.py
689
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/observation_memory.py
690
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/reveal_head.py
691
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/world_model.py
692
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/action_decoder.py
693
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/planner.py
694
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/models/policy.py
695
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/train/losses.py
696
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/sim_reveal/dataset.py
697
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/sim_reveal/procedural_envs.py
698
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/eval/run_reveal_benchmark.py
699
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/eval/run_rlbench_rollout_eval.py
700
+ https://huggingface.co/lsnu/VLAarchtests/blob/main/code/reveal_vla_bimanual/eval/run_peract2_task_sweep.py
701
+
702
+ Public benchmark / baseline references to align against:
703
+ PerAct2 / RLBench2 bimanual benchmark: https://bimanual.github.io/
704
+ InterACT: https://dannyran123.github.io/interact/
705
+ VoxAct-B: https://voxact-b.github.io/
706
+ AnyBimanual: https://anybimanual.github.io/
707
+ TwinVLA: https://twinvla.github.io/
708
+ RDT-1B: https://rdt-robotics.github.io/rdt-robotics/
709
+ DeformableRavens: https://deformableravens.github.io/
710
+ SoftGym: https://sites.google.com/view/softgym/home
711
+ DaXBench: https://daxbench.github.io/
712
+ EFM / BAP: https://efmanipulation.github.io/
713
+ RoboTwin 2.0: https://robotwin-platform.github.io/
714
+
715
+ ## 15. Final recommendation
716
+
717
+ The architecture should be pursued, but only in a narrower and more explicit form: task-structured bimanual reveal-and-retrieve under elastic occlusion. The current repo is close enough to that idea to be worth continuing. The most important next step is not collecting real data yet. It is making the geometry path real, making the planner learn from hard failure cases, and making the world model spatial enough that “maintain the opening while the other arm retrieves” is something the system can actually predict rather than merely imitate.
results/session_results_20260326.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Session Results 2026-03-26
2
+
3
+ ## Verification
4
+
5
+ | Item | Value |
6
+ | --- | --- |
7
+ | command | `PYTHONPATH=/workspace/VLAarchtests_work/code/reveal_vla_bimanual python -m pytest -q /workspace/VLAarchtests_work/tests` |
8
+ | result | `33 passed` |
9
+
10
+ ## Proxy Checkpoints
11
+
12
+ | Run | Checkpoint | Summary |
13
+ | --- | --- | --- |
14
+ | spatial handoff | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_seed17/summary.json` |
15
+ | compact handoff | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_seed17/summary.json` |
16
+ | compact-phase handoff | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_seed17/summary.json` |
17
+ | spatial-phase handoff | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_phase_seed17/checkpoint_best.pt` | `artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_spatial_phase_seed17/summary.json` |
18
+
19
+ ## Proxy Raw Result Files
20
+
21
+ | Artifact | File |
22
+ | --- | --- |
23
+ | smoke full | `artifacts/reports/reveal_smoke_mod/reveal_benchmark.json` |
24
+ | smoke no geometry | `artifacts/reports/reveal_smoke_nogeom/reveal_benchmark.json` |
25
+ | smoke no planner | `artifacts/reports/reveal_smoke_noplanner/reveal_benchmark.json` |
26
+ | serious spatial handoff compare | `artifacts/reports/reveal_handoff_compare_serious/reveal_benchmark.json` |
27
+ | serious compact-world-model compare | `artifacts/reports/reveal_handoff_compare_serious_compact/reveal_benchmark.json` |
28
+ | serious compact-phase compare | `artifacts/reports/reveal_phase_compare_serious_compact/reveal_benchmark.json` |
29
+ | serious spatial-phase compact-world-model compare | `artifacts/reports/reveal_phase_compare_serious_spatial_compactwm/reveal_benchmark.json` |
30
+ | compact-phase ablations | `artifacts/reports/reveal_phase_ablations_compact/ablations.json` |
31
+ | teacher audit | `artifacts/reports/reveal_teacher_audit_serious/teacher_audit.json` |
32
+
33
+ ## Proxy Raw Metrics
34
+
35
+ | File | Raw values |
36
+ | --- | --- |
37
+ | `artifacts/reports/reveal_smoke_mod/reveal_benchmark.json` | mean success `0.60` |
38
+ | `artifacts/reports/reveal_smoke_nogeom/reveal_benchmark.json` | mean success `0.60` |
39
+ | `artifacts/reports/reveal_smoke_noplanner/reveal_benchmark.json` | mean success `0.60` |
40
+ | `artifacts/reports/reveal_handoff_compare_serious/reveal_benchmark.json` | baseline mean success `0.5833`, handoff mean success `0.2167` |
41
+ | `artifacts/reports/reveal_handoff_compare_serious_compact/reveal_benchmark.json` | baseline mean success `0.5833`, handoff mean success `0.5200` |
42
+ | `artifacts/reports/reveal_phase_compare_serious_compact/reveal_benchmark.json` | baseline mean success `0.5833`, compact-phase mean success `0.5133` |
43
+ | `artifacts/reports/reveal_phase_compare_serious_spatial_compactwm/reveal_benchmark.json` | baseline mean success `0.5833`, spatial-phase compact-world-model mean success `0.4933` |
44
+
45
+ ## Compact-Phase Ablation Metrics
46
+
47
+ Source: `artifacts/reports/reveal_phase_ablations_compact/ablations.json`
48
+
49
+ | Setting | Mean success |
50
+ | --- | ---: |
51
+ | full compact-phase | 0.5133 |
52
+ | `no_geometry` | 0.5133 |
53
+ | `no_spatial_memory` | 0.4967 |
54
+ | `compact_world_model` | 0.5133 |
55
+ | `no_planner` | 0.4333 |
56
+ | `gaussian_candidates_only` | 0.4667 |
57
+ | `no_task_head` | 0.5133 |
58
+ | `no_support_mode_conditioning` | 0.5133 |
59
+
60
+ ## RLBench Checkpoints
61
+
62
+ | Run | Checkpoint | Summary or source |
63
+ | --- | --- | --- |
64
+ | subset3 valid9 | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_valid9/checkpoint_best.pt` | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_valid9/checkpoint_stable.pt` |
65
+ | subset3 common23 | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_common23/checkpoint_best.pt` | `artifacts/outputs/rlbench_current/rlbench_subset3_backbone_only_clip_current_common23/checkpoint_stable.pt` |
66
+ | lift-ball wide | `artifacts/outputs/rlbench_current/rlbench_lift_ball_backbone_only_clip_current_wide/checkpoint_best.pt` | `artifacts/outputs/rlbench_current/rlbench_lift_ball_backbone_only_clip_current_wide/checkpoint_stable.pt` |
67
+ | push-box step1 | `artifacts/outputs/rlbench_current/rlbench_push_box_backbone_only_clip_step1/checkpoint_best.pt` | epoch-0 checkpoint with history stored in the checkpoint file |
68
+
69
+ ## RLBench Direct-Policy Result Files
70
+
71
+ | Artifact | File |
72
+ | --- | --- |
73
+ | dual-buttons baseline, IK rescale | `artifacts/reports/rlbench_dual_buttons_baseline_len100_ep1_ik_rescale/rollout_eval.json` |
74
+ | dual-buttons common23, IK rescale | `artifacts/reports/rlbench_dual_buttons_common23_len100_ep1_ik_rescale/rollout_eval.json` |
75
+ | push-box common23, IK rescale | `artifacts/reports/rlbench_push_box_common23_len100_ep1_ik_rescale/rollout_eval.json` |
76
+ | lift-ball wide, one-step replanning | `artifacts/reports/rlbench_lift_ball_wide_len160_ep1_ik_c1/rollout_eval.json` |
77
+ | push-box step1, one-step replanning | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1/rollout_eval.json` |
78
+ | push-box step1, one-step replanning, `delta_scale=0.05` | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1_s005/rollout_eval.json` |
79
+
80
+ ## RLBench Direct-Policy Raw Metrics
81
+
82
+ | File | Mean success | Mean return | Path recoveries | Noop fallbacks |
83
+ | --- | ---: | ---: | --- | --- |
84
+ | `artifacts/reports/rlbench_dual_buttons_baseline_len100_ep1_ik_rescale/rollout_eval.json` | 0.0 | 0.0 | `[2]` | `[98]` |
85
+ | `artifacts/reports/rlbench_dual_buttons_common23_len100_ep1_ik_rescale/rollout_eval.json` | 0.0 | 0.0 | `[96]` | `[4]` |
86
+ | `artifacts/reports/rlbench_push_box_common23_len100_ep1_ik_rescale/rollout_eval.json` | 0.0 | 0.0 | `[98]` | `[1]` |
87
+ | `artifacts/reports/rlbench_lift_ball_wide_len160_ep1_ik_c1/rollout_eval.json` | 0.0 | 0.0 | `[148]` | `[11]` |
88
+ | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1/rollout_eval.json` | 0.0 | 0.0 | `[177]` | `[0]` |
89
+ | `artifacts/reports/rlbench_push_box_step1_ep1_ik_c1_s005/rollout_eval.json` | 0.0 | 0.0 | `[180]` | `[0]` |
90
+
91
+ ## RLBench Retrieval Result Files
92
+
93
+ | Artifact | File |
94
+ | --- | --- |
95
+ | push-box kNN, `bank_stride=4`, `top_k=5`, `time_window=8`, `episodes=1` | `artifacts/reports/rlbench_push_box_knn_step1_ep1/rollout_eval.json` |
96
+ | push-box kNN, `bank_stride=4`, `top_k=5`, `time_window=8`, `episodes=5` | `artifacts/reports/rlbench_push_box_knn_step1_ep5/rollout_eval.json` |
97
+ | push-box kNN, `bank_stride=1`, `top_k=1`, `time_window=4`, `episodes=5` | `artifacts/reports/rlbench_push_box_knn_step1_ep5_top1_dense/rollout_eval.json` |
98
+
99
+ ## RLBench Retrieval Raw Metrics
100
+
101
+ | File | Bank size | Successes | Mean success | Mean return | Path recoveries | Noop fallbacks |
102
+ | --- | ---: | --- | ---: | ---: | --- | --- |
103
+ | `artifacts/reports/rlbench_push_box_knn_step1_ep1/rollout_eval.json` | 2815 | `[1.0]` | 1.0 | 1.0 | `[1]` | `[0]` |
104
+ | `artifacts/reports/rlbench_push_box_knn_step1_ep5/rollout_eval.json` | 2815 | `[0.0, 1.0, 0.0, 0.0, 0.0]` | 0.2 | 0.2 | `[0, 0, 45, 15, 0]` | `[0, 0, 0, 0, 0]` |
105
+ | `artifacts/reports/rlbench_push_box_knn_step1_ep5_top1_dense/rollout_eval.json` | 11259 | `[0.0, 0.0, 1.0, 1.0, 0.0]` | 0.4 | 0.4 | `[10, 0, 0, 0, 0]` | `[0, 0, 0, 0, 0]` |