VLAarchtests / regression /baselines.md
lsnu's picture
Add files using upload-large-folder tool
16405f2 verified
# Regression Baselines
Snapshot source: `/workspace/VLAarchtests/README.md` plus the committed artifact JSONs under `/workspace/VLAarchtests/artifacts/outputs`.
## Proxy benchmarks
- Dummy action-history benchmark:
- interaction: `0.5278` from `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_interaction_actionhist_commit4/reveal_benchmark.json`
- backbone: `0.5556` from `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_old_no_leak_baselines_commit4/reveal_benchmark.json`
- reveal: `0.5417` from `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_old_no_leak_baselines_commit4/reveal_benchmark.json`
- CLIP action-history benchmark:
- interaction_clip: `0.3056` from `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_interaction_clip_commit4_compare/reveal_benchmark.json`
- backbone_clip: `0.3333` from `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_clip_baselines_commit4/reveal_benchmark.json`
- reveal_clip: `0.2083` from `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_clip_baselines_commit4/reveal_benchmark.json`
## Action-history ablations
- full_model: `0.5278`
- no_interaction_head: `0.3889`
- no_world_model: `0.5278`
- no_planner: `0.5278`
- no_role_tokens: `0.5278`
- short_history: `0.5417`
JSON path: `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/ablation_none_actionhist/ablations.json`
## Diagnostics
- planner_top1_accuracy: `0.1985`
- planner_regret: `0.2120`
JSON path: `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/proxy_interaction_state_actionhist/diagnostics/proxy_diagnostics.json`
## Integration baselines
- RLBench open-drawer rollout:
- mean_success: `0.0`
- error: `"A path could not be found because the target is outside of workspace."`
- JSON: `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/rlbench_open_drawer_rollout_eval_commit4_rerun/rollout_eval.json`
- PerAct2 13-task sweep:
- no-plan mean_success: `0.0`
- planner mean_success: `0.0`
- JSON roots:
- `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/peract2_13_rollout_noplan_split/rollout_eval.json`
- `/workspace/VLAarchtests/artifacts/outputs/interaction_debug/peract2_13_rollout_plan_split/rollout_eval.json`