Rerun fixed null-rollout world-model ablation
Browse files- MODEL_INDEX.md +1 -0
- README.md +1 -1
- artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json +15 -0
- artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md +13 -0
- artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json +15 -0
- artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md +13 -0
- artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json +15 -0
- artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md +13 -0
- results/phase_tracking.md +2 -2
MODEL_INDEX.md
CHANGED
|
@@ -27,6 +27,7 @@
|
|
| 27 |
| stage1 dummy `no_planner` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_planner/reveal_benchmark.json` |
|
| 28 |
| stage1 dummy `no_role_symmetry` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_role_symmetry/reveal_benchmark.json` |
|
| 29 |
| stage2 dummy `no_world_model` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark.json` |
|
|
|
|
| 30 |
| stage2 dummy `short_history` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_short_history/reveal_benchmark.json` |
|
| 31 |
| stage3 clip RGB-D `no_depth` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/benchmark_no_depth/reveal_benchmark.json` |
|
| 32 |
|
|
|
|
| 27 |
| stage1 dummy `no_planner` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_planner/reveal_benchmark.json` |
|
| 28 |
| stage1 dummy `no_role_symmetry` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage1_dummy_seed13/benchmark_no_role_symmetry/reveal_benchmark.json` |
|
| 29 |
| stage2 dummy `no_world_model` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark.json` |
|
| 30 |
+
| stage2 dummy `no_world_model` pre-fix backup | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json` |
|
| 31 |
| stage2 dummy `short_history` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_short_history/reveal_benchmark.json` |
|
| 32 |
| stage3 clip RGB-D `no_depth` | `artifacts/outputs/r3d/proxy_interaction_r3d_stage3_clip_rgbd_seed17/benchmark_no_depth/reveal_benchmark.json` |
|
| 33 |
|
README.md
CHANGED
|
@@ -116,7 +116,7 @@ Bundle uploaded from the `/workspace` runpod session dated `2026-03-25 UTC`.
|
|
| 116 |
|
| 117 |
Full artifact roots are indexed in `MODEL_INDEX.md`.
|
| 118 |
|
| 119 |
-
Note: the
|
| 120 |
|
| 121 |
## Raw Training Summaries
|
| 122 |
|
|
|
|
| 116 |
|
| 117 |
Full artifact roots are indexed in `MODEL_INDEX.md`.
|
| 118 |
|
| 119 |
+
Note: the `stage2 dummy no_world_model` row above reflects the `2026-03-25` post-fix null-rollout rerun. Pre-fix copies are retained as `reveal_benchmark_pre_null_rollout_fix.json` and `reveal_benchmark_pre_null_rollout_fix.md` under each `benchmark_no_world_model` seed directory.
|
| 120 |
|
| 121 |
## Raw Training Summaries
|
| 122 |
|
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"full": {
|
| 3 |
+
"per_task_success": {
|
| 4 |
+
"foliage_proxy": 0.4166666666666667,
|
| 5 |
+
"bag_proxy": 0.5416666666666666,
|
| 6 |
+
"cloth_proxy": 0.6666666666666666
|
| 7 |
+
},
|
| 8 |
+
"mean_success": 0.5416666666666666,
|
| 9 |
+
"visibility_integral": 34.65096331967248,
|
| 10 |
+
"corridor_availability": 0.8933400412400564,
|
| 11 |
+
"reocclusion_rate": 0.0,
|
| 12 |
+
"persistence_horizon_mae": 2.6348470987268464,
|
| 13 |
+
"disturbance_cost": 0.36164701517878306
|
| 14 |
+
}
|
| 15 |
+
}
|
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reveal Proxy Benchmark
|
| 2 |
+
|
| 3 |
+
## full
|
| 4 |
+
- checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed21/checkpoint_best.pt
|
| 5 |
+
- mean_success: 0.542
|
| 6 |
+
- visibility_integral: 34.651
|
| 7 |
+
- corridor_availability: 0.893
|
| 8 |
+
- reocclusion_rate: 0.000
|
| 9 |
+
- persistence_horizon_mae: 2.635
|
| 10 |
+
- disturbance_cost: 0.362
|
| 11 |
+
- foliage_proxy_success: 0.417
|
| 12 |
+
- bag_proxy_success: 0.542
|
| 13 |
+
- cloth_proxy_success: 0.667
|
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"full": {
|
| 3 |
+
"per_task_success": {
|
| 4 |
+
"foliage_proxy": 0.5,
|
| 5 |
+
"bag_proxy": 0.5416666666666666,
|
| 6 |
+
"cloth_proxy": 0.6666666666666666
|
| 7 |
+
},
|
| 8 |
+
"mean_success": 0.5694444444444443,
|
| 9 |
+
"visibility_integral": 33.861522571908104,
|
| 10 |
+
"corridor_availability": 0.8863558504316542,
|
| 11 |
+
"reocclusion_rate": 0.0,
|
| 12 |
+
"persistence_horizon_mae": 1.6200438848336538,
|
| 13 |
+
"disturbance_cost": 0.2896964028477669
|
| 14 |
+
}
|
| 15 |
+
}
|
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reveal Proxy Benchmark
|
| 2 |
+
|
| 3 |
+
## full
|
| 4 |
+
- checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed22/checkpoint_best.pt
|
| 5 |
+
- mean_success: 0.569
|
| 6 |
+
- visibility_integral: 33.862
|
| 7 |
+
- corridor_availability: 0.886
|
| 8 |
+
- reocclusion_rate: 0.000
|
| 9 |
+
- persistence_horizon_mae: 1.620
|
| 10 |
+
- disturbance_cost: 0.290
|
| 11 |
+
- foliage_proxy_success: 0.500
|
| 12 |
+
- bag_proxy_success: 0.542
|
| 13 |
+
- cloth_proxy_success: 0.667
|
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.json
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"full": {
|
| 3 |
+
"per_task_success": {
|
| 4 |
+
"foliage_proxy": 0.4166666666666667,
|
| 5 |
+
"bag_proxy": 0.5416666666666666,
|
| 6 |
+
"cloth_proxy": 0.625
|
| 7 |
+
},
|
| 8 |
+
"mean_success": 0.5277777777777778,
|
| 9 |
+
"visibility_integral": 31.244046566387016,
|
| 10 |
+
"corridor_availability": 0.8636231190628476,
|
| 11 |
+
"reocclusion_rate": 0.00798611111111111,
|
| 12 |
+
"persistence_horizon_mae": 2.825085285899754,
|
| 13 |
+
"disturbance_cost": 0.3346485110103256
|
| 14 |
+
}
|
| 15 |
+
}
|
artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/benchmark_no_world_model/reveal_benchmark_pre_null_rollout_fix.md
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reveal Proxy Benchmark
|
| 2 |
+
|
| 3 |
+
## full
|
| 4 |
+
- checkpoint: /workspace/VLAarchtests/artifacts/outputs/r3d/proxy_interaction_r3d_stage2_dummy_seed23/checkpoint_best.pt
|
| 5 |
+
- mean_success: 0.528
|
| 6 |
+
- visibility_integral: 31.244
|
| 7 |
+
- corridor_availability: 0.864
|
| 8 |
+
- reocclusion_rate: 0.008
|
| 9 |
+
- persistence_horizon_mae: 2.825
|
| 10 |
+
- disturbance_cost: 0.335
|
| 11 |
+
- foliage_proxy_success: 0.417
|
| 12 |
+
- bag_proxy_success: 0.542
|
| 13 |
+
- cloth_proxy_success: 0.625
|
results/phase_tracking.md
CHANGED
|
@@ -83,10 +83,10 @@ Date closed: `2026-03-25 UTC`
|
|
| 83 |
- `short_history`: `0.5463` mean success, delta `0.0000`
|
| 84 |
- Gate decisions:
|
| 85 |
- hard success gate `>= 0.60`: fail
|
| 86 |
-
- `no_world_model` must hurt:
|
| 87 |
- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
|
| 88 |
- state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
|
| 89 |
-
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the
|
| 90 |
|
| 91 |
## Phase 3
|
| 92 |
|
|
|
|
| 83 |
- `short_history`: `0.5463` mean success, delta `0.0000`
|
| 84 |
- Gate decisions:
|
| 85 |
- hard success gate `>= 0.60`: fail
|
| 86 |
+
- `no_world_model` must hurt: fail; the `2026-03-25` post-fix null-rollout rerun remained at `0.5463`, drop `0.0000`
|
| 87 |
- full memory must stop losing to short history: hard gate passes narrowly because full equals short-history; preferred gate fails because full does not beat short-history
|
| 88 |
- state metrics should improve over phase 1: fail, reocclusion rate increased (`0.0000 -> 0.0121`), persistence MAE worsened (`1.9553 -> 2.2358`), and calibration worsened
|
| 89 |
+
- Takeaway: the expanded state/memory path did not validate on the dummy proxy benchmark. Planner classification improved, but the post-fix null-rollout rerun still left mean success unchanged.
|
| 90 |
|
| 91 |
## Phase 3
|
| 92 |
|