File size: 4,843 Bytes
07a2660 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
language:
- en
library_name: lerobot
pipeline_tag: robotics
tags:
- vision-language-action
- imitation-learning
- lerobot
inference: false
license: apache-2.0
---
# X-VLA (LeRobot)
X-VLA is a Vision-Language-Action foundation model that uses soft prompts to handle cross-embodiment and cross-domain robot control within a unified Transformer architecture.
A fine-tuned dexterous manipulation model trained on the high-quality Soft-FOLD cloth folding dataset. Achieves 100% success rate over 2 hours of continuous cloth folding..
**Original paper:** [X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model](https://arxiv.org/abs/2510.10274)
**Reference implementation:** https://github.com/2toinf/X-VLA
**LeRobot implementation:** Follows the original reference code for compatibility.
## Model description
- **Inputs:** images (multi-view), proprio/state, optional language instruction
- **Outputs:** continuous actions
- **Training objective:** flow matching
- **Action representation:** continuous
- **Intended use:** Base model to fine tune on your specific use case
## Quick start (inference on a real batch)
### Installation
```bash
pip install "lerobot[xvla]"
```
For full installation details (including optional video dependencies such as ffmpeg for torchcodec), see the official documentation: https://huggingface.co/docs/lerobot/installation
### Load model + dataset, run `select_action`
```python
import torch
from lerobot.datasets.lerobot_dataset import LeRobotDataset
from lerobot.policies.factory import make_pre_post_processors
# Swap this import per-policy
from lerobot.policies.xvla.modeling_xvla import XVLAPolicy
# load a policy
model_id = "lerobot/xvla-folding" # <- swap checkpoint
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
policy = XVLAPolicy.from_pretrained(model_id).to(device).eval()
preprocess, postprocess = make_pre_post_processors(
policy.config,
model_id,
preprocessor_overrides={"device_processor": {"device": str(device)}},
)
# load a lerobotdataset (we will replace with a simpler dataset)
dataset = LeRobotDataset("lerobot/libero")
# pick an episode
episode_index = 0
# each episode corresponds to a contiguous range of frame indices
from_idx = dataset.meta.episodes["dataset_from_index"][episode_index]
to_idx = dataset.meta.episodes["dataset_to_index"][episode_index]
# get a single frame from that episode (e.g. the first frame)
frame_index = from_idx
frame = dict(dataset[frame_index])
batch = preprocess(frame)
with torch.inference_mode():
pred_action = policy.select_action(batch)
# use your policy postprocess, this post process the action
# for instance unnormalize the actions, detokenize it etc..
pred_action = postprocess(pred_action)
```
## Training step (loss + backward)
If you’re training / fine-tuning, you typically call `forward(...)` to get a loss and then:
```python
policy.train()
batch = dict(dataset[0])
batch = preprocess(batch)
loss, outputs = policy.forward(batch)
loss.backward()
```
> Notes:
>
> - Some policies expose `policy(**batch)` or return a dict; keep this snippet aligned with the policy API.
> - Use your trainer script (`lerobot-train`) for full training loops.
## How to train / fine-tune
```bash
lerobot-train \
--dataset.repo_id=${HF_USER}/<dataset> \
--output_dir=./outputs/[RUN_NAME] \
--job_name=[RUN_NAME] \
--policy.repo_id=${HF_USER}/<desired_policy_repo_id> \
--policy.path=lerobot/[BASE_CHECKPOINT] \
--policy.dtype=bfloat16 \
--policy.device=cuda \
--steps=100000 \
--batch_size=4
```
Add policy-specific flags below:
- `-policy.chunk_size=...`
- `-policy.n_action_steps=...`
- `-policy.max_action_tokens=...`
- `-policy.gradient_checkpointing=true`
## Real-World Inference & Evaluation
You can use the `record` script from [**`lerobot-record`**](https://github.com/huggingface/lerobot/blob/main/src/lerobot/scripts/lerobot_record.py) with a policy checkpoint as input, to run inference and evaluate your policy.
For instance, run this command or API example to run inference and record 10 evaluation episodes:
```
lerobot-record \
--robot.type=so100_follower \
--robot.port=/dev/ttyACM1 \
--robot.cameras="{ up: {type: opencv, index_or_path: /dev/video10, width: 640, height: 480, fps: 30}, side: {type: intelrealsense, serial_number_or_name: 233522074606, width: 640, height: 480, fps: 30}}" \
--robot.id=my_awesome_follower_arm \
--display_data=false \
--dataset.repo_id=${HF_USER}/eval_so100 \
--dataset.single_task="Put lego brick into the transparent box" \
# <- Teleop optional if you want to teleoperate in between episodes \
# --teleop.type=so100_leader \
# --teleop.port=/dev/ttyACM0 \
# --teleop.id=my_awesome_leader_arm \
--policy.path=${HF_USER}/my_policy
``` |