lsnu commited on
Commit
63a70c7
·
verified ·
1 Parent(s): a0b57b7

Rerun fixed null-rollout world-model ablation

Browse files
MODEL_INDEX.md CHANGED
@@ -27,6 +27,7 @@
27
  | stage1 dummy `no_planner` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_planner/reveal_benchmark.json` |
28
  | stage1 dummy `no_role_symmetry` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_role_symmetry/reveal_benchmark.json` |
29
  | stage2 dummy `no_world_model` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark.json` |
 
30
  | stage2 dummy `short_history` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_short_history/reveal_benchmark.json` |
31
  | stage3 clip RGB-D `no_depth` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/benchmark_no_depth/reveal_benchmark.json` |
32
 
 
27
  | stage1 dummy `no_planner` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_planner/reveal_benchmark.json` |
28
  | stage1 dummy `no_role_symmetry` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_role_symmetry/reveal_benchmark.json` |
29
  | stage2 dummy `no_world_model` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark.json` |
30
+ | stage2 dummy `no_world_model` pre-fix backup | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json` |
31
  | stage2 dummy `short_history` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_short_history/reveal_benchmark.json` |
32
  | stage3 clip RGB-D `no_depth` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/benchmark_no_depth/reveal_benchmark.json` |
33
 
README.md CHANGED
@@ -116,7 +116,7 @@ Bundle uploaded from the `/workspace` runpod session dated `2026-03-25 UTC`.
116
 
117
  Full artifact roots are indexed in `MODEL_INDEX.md`.
118
 
119
- Note: the stored `stage2 dummy no_world_model` row above was produced before the `2026-03-25` null-rollout ablation fix in `ElasticRevealBimanualPolicy`. The raw artifact is retained unchanged, but it should be rerun before using it as a fair world-model comparison.
120
 
121
  ## Raw Training Summaries
122
 
 
116
 
117
  Full artifact roots are indexed in `MODEL_INDEX.md`.
118
 
119
+ Note: the `stage2 dummy no_world_model` row above reflects the `2026-03-25` post-fix null-rollout rerun. Pre-fix copies are retained as `reveal_benchmark_pre_null_rollout_fix.json` and `reveal_benchmark_pre_null_rollout_fix.md` under each `benchmark_no_world_model` seed directory.
120
 
121
  ## Raw Training Summaries
122
 
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "full": {
3
+ "per_task_success": {
4
+ "foliage_proxy": 0.4166666666666667,
5
+ "bag_proxy": 0.5416666666666666,
6
+ "cloth_proxy": 0.6666666666666666
7
+ },
8
+ "mean_success": 0.5416666666666666,
9
+ "visibility_integral": 34.65096331967248,
10
+ "corridor_availability": 0.8933400412400564,
11
+ "reocclusion_rate": 0.0,
12
+ "persistence_horizon_mae": 2.6348470987268464,
13
+ "disturbance_cost": 0.36164701517878306
14
+ }
15
+ }
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reveal Proxy Benchmark
2
+
3
+ ## full
4
+ - checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/checkpoint_best.pt
5
+ - mean_success: 0.542
6
+ - visibility_integral: 34.651
7
+ - corridor_availability: 0.893
8
+ - reocclusion_rate: 0.000
9
+ - persistence_horizon_mae: 2.635
10
+ - disturbance_cost: 0.362
11
+ - foliage_proxy_success: 0.417
12
+ - bag_proxy_success: 0.542
13
+ - cloth_proxy_success: 0.667
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "full": {
3
+ "per_task_success": {
4
+ "foliage_proxy": 0.5,
5
+ "bag_proxy": 0.5416666666666666,
6
+ "cloth_proxy": 0.6666666666666666
7
+ },
8
+ "mean_success": 0.5694444444444443,
9
+ "visibility_integral": 33.861522571908104,
10
+ "corridor_availability": 0.8863558504316542,
11
+ "reocclusion_rate": 0.0,
12
+ "persistence_horizon_mae": 1.6200438848336538,
13
+ "disturbance_cost": 0.2896964028477669
14
+ }
15
+ }
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reveal Proxy Benchmark
2
+
3
+ ## full
4
+ - checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/checkpoint_best.pt
5
+ - mean_success: 0.569
6
+ - visibility_integral: 33.862
7
+ - corridor_availability: 0.886
8
+ - reocclusion_rate: 0.000
9
+ - persistence_horizon_mae: 1.620
10
+ - disturbance_cost: 0.290
11
+ - foliage_proxy_success: 0.500
12
+ - bag_proxy_success: 0.542
13
+ - cloth_proxy_success: 0.667
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "full": {
3
+ "per_task_success": {
4
+ "foliage_proxy": 0.4166666666666667,
5
+ "bag_proxy": 0.5416666666666666,
6
+ "cloth_proxy": 0.625
7
+ },
8
+ "mean_success": 0.5277777777777778,
9
+ "visibility_integral": 31.244046566387016,
10
+ "corridor_availability": 0.8636231190628476,
11
+ "reocclusion_rate": 0.00798611111111111,
12
+ "persistence_horizon_mae": 2.825085285899754,
13
+ "disturbance_cost": 0.3346485110103256
14
+ }
15
+ }
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reveal Proxy Benchmark
2
+
3
+ ## full
4
+ - checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/checkpoint_best.pt
5
+ - mean_success: 0.528
6
+ - visibility_integral: 31.244
7
+ - corridor_availability: 0.864
8
+ - reocclusion_rate: 0.008
9
+ - persistence_horizon_mae: 2.825
10
+ - disturbance_cost: 0.335
11
+ - foliage_proxy_success: 0.417
12
+ - bag_proxy_success: 0.542
13
+ - cloth_proxy_success: 0.625
results/phase_tracking.md CHANGED
@@ -83,10 +83,10 @@ Date closed: `2026-03-25 UTC`
83
  - `short_history`: `0.5463` mean success, delta `0.0000`
84
  - Gate decisions:
85
  - hard success gate `>= 0.60`: fail
86
- - `no_world_model` must hurt: not interpretable from the stored artifact alone; the recorded `no_world_model` run predates the `2026-03-25` null-rollout ablation fix and should be rerun for a fair comparison
87
  - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
88
  - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
89
- - Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the world-model ablation needs a post-fix rerun before it can be interpreted fairly.
90
 
91
  ## Phase 3
92
 
 
83
  - `short_history`: `0.5463` mean success, delta `0.0000`
84
  - Gate decisions:
85
  - hard success gate `>= 0.60`: fail
86
+ - `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
87
  - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
88
  - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
89
+ - Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.
90
 
91
  ## Phase 3
92