VLAarchtests / regression /baselines.md
lsnu's picture
Add files using upload-large-folder tool
16405f2 verified

Regression Baselines

Snapshot source: /workspace/VLAarchtests/README.md plus the committed artifact JSONs under /workspace/VLAarchtests/artifacts/outputs.

Proxy benchmarks

  • Dummy action-history benchmark:
    • interaction: 0.5278 from /workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_interaction_actionhist_commit4/reveal_benchmark.json
    • backbone: 0.5556 from /workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_old_no_leak_baselines_commit4/reveal_benchmark.json
    • reveal: 0.5417 from /workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_old_no_leak_baselines_commit4/reveal_benchmark.json
  • CLIP action-history benchmark:
    • interaction_clip: 0.3056 from /workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_interaction_clip_commit4_compare/reveal_benchmark.json
    • backbone_clip: 0.3333 from /workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_clip_baselines_commit4/reveal_benchmark.json
    • reveal_clip: 0.2083 from /workspace/VLAarchtests/artifacts/outputs/interaction_debug/reveal_eval_clip_baselines_commit4/reveal_benchmark.json

Action-history ablations

  • full_model: 0.5278
  • no_interaction_head: 0.3889
  • no_world_model: 0.5278
  • no_planner: 0.5278
  • no_role_tokens: 0.5278
  • short_history: 0.5417

JSON path: /workspace/VLAarchtests/artifacts/outputs/interaction_debug/ablation_none_actionhist/ablations.json

Diagnostics

  • planner_top1_accuracy: 0.1985
  • planner_regret: 0.2120

JSON path: /workspace/VLAarchtests/artifacts/outputs/interaction_debug/proxy_interaction_state_actionhist/diagnostics/proxy_diagnostics.json

Integration baselines

  • RLBench open-drawer rollout:
    • mean_success: 0.0
    • error: "A path could not be found because the target is outside of workspace."
    • JSON: /workspace/VLAarchtests/artifacts/outputs/interaction_debug/rlbench_open_drawer_rollout_eval_commit4_rerun/rollout_eval.json
  • PerAct2 13-task sweep:
    • no-plan mean_success: 0.0
    • planner mean_success: 0.0
    • JSON roots:
      • /workspace/VLAarchtests/artifacts/outputs/interaction_debug/peract2_13_rollout_noplan_split/rollout_eval.json
      • /workspace/VLAarchtests/artifacts/outputs/interaction_debug/peract2_13_rollout_plan_split/rollout_eval.json