X-VLA fine-tuned on RoboTwin stack_bowls_two

X-VLA policy (879M params) fine-tuned on dual-arm bowl-stacking task in RoboTwin 2.0 simulator.

Training data

  • Source: RoboTwin stack_bowls_two task, demo_clean config.
  • Episodes: 300.
  • Frames: ~94k @ effective 16.67 Hz.
  • Images: native 240x320 (no offline resize; aspect-preserving letterbox via model's resize_with_pad=[224,224]).
  • State / Action: 16-D dual-arm EEF, auto-padded to 20-D by X-VLA action_mode=auto.
  • Language instruction: fixed "stack the bowls" for all episodes.

Training config

  • Batch size 16, 20000 steps, bf16, cosine warmup 1000 / decay 20000.
  • Base: lerobot/xvla-base (full fine-tune, VLM + transformer + soft prompts all unfrozen).
  • chunk_size=32, n_action_steps=32, num_denoising_steps=10.
  • rename_map: dual_cam_global -> image, cam_wrist_65 -> image2, cam_wrist_75 -> image3.

Evaluation (RoboTwin sim, max_steps=400, 10 episodes)

Success rate: 4/10 (40%) with task_text="stack the bowls" and --skip_resize.

Episode Result Steps
0 FAIL 400 (timeout)
1 FAIL 400
2 SUCCESS 320
3 FAIL 400
4 FAIL 400
5 FAIL 400
6 SUCCESS 398
7 SUCCESS 259
8 FAIL 400
9 SUCCESS 340

Usage

from lerobot.policies.xvla.modeling_xvla import XVLAPolicy
policy = XVLAPolicy.from_pretrained("arrow-hf/xvla-robotwin-stack-bowls-two-40pct")

At inference, feed native-resolution images (e.g., 240x320 from RoboTwin D435) — the model's internal resize_with_pad handles target shape with letterbox.

Downloads last month
52
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Video Preview
loading