Rerun fixed null-rollout world-model ablation

Browse files

Files changed (9) hide show

MODEL_INDEX.md +1 -0
README.md +1 -1
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json +15 -0
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md +13 -0
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json +15 -0
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md +13 -0
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json +15 -0
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md +13 -0
results/phase_tracking.md +2 -2

MODEL_INDEX.md CHANGED Viewed

@@ -27,6 +27,7 @@
 | stage1 dummy `no_planner` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_planner/reveal_benchmark.json` |
 | stage1 dummy `no_role_symmetry` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_role_symmetry/reveal_benchmark.json` |
 | stage2 dummy `no_world_model` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark.json` |
 | stage2 dummy `short_history` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_short_history/reveal_benchmark.json` |
 | stage3 clip RGB-D `no_depth` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/benchmark_no_depth/reveal_benchmark.json` |

 | stage1 dummy `no_planner` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_planner/reveal_benchmark.json` |
 | stage1 dummy `no_role_symmetry` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_role_symmetry/reveal_benchmark.json` |
 | stage2 dummy `no_world_model` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark.json` |
+| stage2 dummy `no_world_model` pre-fix backup | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json` |
 | stage2 dummy `short_history` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_short_history/reveal_benchmark.json` |
 | stage3 clip RGB-D `no_depth` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/benchmark_no_depth/reveal_benchmark.json` |

README.md CHANGED Viewed

@@ -116,7 +116,7 @@ Bundle uploaded from the `/workspace` runpod session dated `2026-03-25 UTC`.
 Full artifact roots are indexed in `MODEL_INDEX.md`.
-Note: the stored `stage2 dummy no_world_model` row above was produced before the `2026-03-25` null-rollout ablation fix in `ElasticRevealBimanualPolicy`. The raw artifact is retained unchanged, but it should be rerun before using it as a fair world-model comparison.
 ## Raw Training Summaries

 Full artifact roots are indexed in `MODEL_INDEX.md`.
+Note: the `stage2 dummy no_world_model` row above reflects the `2026-03-25` post-fix null-rollout rerun. Pre-fix copies are retained as `reveal_benchmark_pre_null_rollout_fix.json` and `reveal_benchmark_pre_null_rollout_fix.md` under each `benchmark_no_world_model` seed directory.
 ## Raw Training Summaries

artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "full": {
+    "per_task_success": {
+      "foliage_proxy": 0.4166666666666667,
+      "bag_proxy": 0.5416666666666666,
+      "cloth_proxy": 0.6666666666666666
+    },
+    "mean_success": 0.5416666666666666,
+    "visibility_integral": 34.65096331967248,
+    "corridor_availability": 0.8933400412400564,
+    "reocclusion_rate": 0.0,
+    "persistence_horizon_mae": 2.6348470987268464,
+    "disturbance_cost": 0.36164701517878306
+  }
+}

artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# Reveal Proxy Benchmark
+## full
+- checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/checkpoint_best.pt
+- mean_success: 0.542
+- visibility_integral: 34.651
+- corridor_availability: 0.893
+- reocclusion_rate: 0.000
+- persistence_horizon_mae: 2.635
+- disturbance_cost: 0.362
+- foliage_proxy_success: 0.417
+- bag_proxy_success: 0.542
+- cloth_proxy_success: 0.667

artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "full": {
+    "per_task_success": {
+      "foliage_proxy": 0.5,
+      "bag_proxy": 0.5416666666666666,
+      "cloth_proxy": 0.6666666666666666
+    },
+    "mean_success": 0.5694444444444443,
+    "visibility_integral": 33.861522571908104,
+    "corridor_availability": 0.8863558504316542,
+    "reocclusion_rate": 0.0,
+    "persistence_horizon_mae": 1.6200438848336538,
+    "disturbance_cost": 0.2896964028477669
+  }
+}

artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# Reveal Proxy Benchmark
+## full
+- checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/checkpoint_best.pt
+- mean_success: 0.569
+- visibility_integral: 33.862
+- corridor_availability: 0.886
+- reocclusion_rate: 0.000
+- persistence_horizon_mae: 1.620
+- disturbance_cost: 0.290
+- foliage_proxy_success: 0.500
+- bag_proxy_success: 0.542
+- cloth_proxy_success: 0.667

artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "full": {
+    "per_task_success": {
+      "foliage_proxy": 0.4166666666666667,
+      "bag_proxy": 0.5416666666666666,
+      "cloth_proxy": 0.625
+    },
+    "mean_success": 0.5277777777777778,
+    "visibility_integral": 31.244046566387016,
+    "corridor_availability": 0.8636231190628476,
+    "reocclusion_rate": 0.00798611111111111,
+    "persistence_horizon_mae": 2.825085285899754,
+    "disturbance_cost": 0.3346485110103256
+  }
+}

artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# Reveal Proxy Benchmark
+## full
+- checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/checkpoint_best.pt
+- mean_success: 0.528
+- visibility_integral: 31.244
+- corridor_availability: 0.864
+- reocclusion_rate: 0.008
+- persistence_horizon_mae: 2.825
+- disturbance_cost: 0.335
+- foliage_proxy_success: 0.417
+- bag_proxy_success: 0.542
+- cloth_proxy_success: 0.625

results/phase_tracking.md CHANGED Viewed

@@ -83,10 +83,10 @@ Date closed: `2026-03-25 UTC`
   - `short_history`: `0.5463` mean success, delta `0.0000`
 - Gate decisions:
   - hard success gate `>= 0.60`: fail
-  - `no_world_model` must hurt: not interpretable from the stored artifact alone; the recorded `no_world_model` run predates the `2026-03-25` null-rollout ablation fix and should be rerun for a fair comparison
   - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
   - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
-- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the world-model ablation needs a post-fix rerun before it can be interpreted fairly.
 ## Phase 3

   - `short_history`: `0.5463` mean success, delta `0.0000`
 - Gate decisions:
   - hard success gate `>= 0.60`: fail
+  - `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
   - full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
   - state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
+- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.
 ## Phase 3