X-VLA fine-tuned on RoboTwin stack_bowls_two

X-VLA policy (879M params) fine-tuned on dual-arm bowl-stacking task in RoboTwin 2.0 simulator.

Training data

Source: RoboTwin stack_bowls_two task, demo_clean config.
Episodes: 300.
Frames: ~94k @ effective 16.67 Hz.
Images: native 240x320 (no offline resize; aspect-preserving letterbox via model's resize_with_pad=[224,224]).
State / Action: 16-D dual-arm EEF, auto-padded to 20-D by X-VLA action_mode=auto.
Language instruction: fixed "stack the bowls" for all episodes.

Training config

Batch size 16, 20000 steps, bf16, cosine warmup 1000 / decay 20000.
Base: lerobot/xvla-base (full fine-tune, VLM + transformer + soft prompts all unfrozen).
chunk_size=32, n_action_steps=32, num_denoising_steps=10.
rename_map: dual_cam_global -> image, cam_wrist_65 -> image2, cam_wrist_75 -> image3.

Evaluation (RoboTwin sim, max_steps=400, 10 episodes)

Success rate: 4/10 (40%) with task_text="stack the bowls" and --skip_resize.

Episode	Result	Steps
0	FAIL	400 (timeout)
1	FAIL	400
2	SUCCESS	320
3	FAIL	400
4	FAIL	400
5	FAIL	400
6	SUCCESS	398
7	SUCCESS	259
8	FAIL	400
9	SUCCESS	340

Usage

from lerobot.policies.xvla.modeling_xvla import XVLAPolicy
policy = XVLAPolicy.from_pretrained("arrow-hf/xvla-robotwin-stack-bowls-two-40pct")

At inference, feed native-resolution images (e.g., 240x320 from RoboTwin D435) — the model's internal resize_with_pad handles target shape with letterbox.

Downloads last month: 2

Safetensors

Model size

0.9B params

Tensor type

BF16

Video Preview

Robotics