Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,104 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- microsoft/Florence-2-large
|
| 5 |
+
tags:
|
| 6 |
+
- robotics
|
| 7 |
+
- vla
|
| 8 |
+
pipeline_tag: robotics
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# X-VLA 0.9B (Google-Robot Edition)
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
**Repository:** [2toINF/X-VLA](https://github.com/2toinf/X-VLA)
|
| 15 |
+
|
| 16 |
+
**Authors:** [2toINF](https://github.com/2toINF) | **License:** Apache 2.0
|
| 17 |
+
|
| 18 |
+
**Paper:** *Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274))
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
## 🚀 Overview
|
| 22 |
+
|
| 23 |
+
Successful generalist **Vision-Language-Action (VLA)** models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets.
|
| 24 |
+
To facilitate and leverage the heterogeneity in rich robotic data sources, **X-VLA** introduces a **Soft Prompt approach** with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing **separate sets of learnable embeddings** for each distinct embodiment.
|
| 25 |
+
|
| 26 |
+
These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively.
|
| 27 |
+
Our architecture—**a clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformers**—achieves superior scalability and simplicity.
|
| 28 |
+
|
| 29 |
+
Trained on **Bridge Data** and evaluated across **six simulations** and **three real-world robots**, the 0.9B-parameter X-VLA simultaneously achieves **state-of-the-art performance** across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks.
|
| 30 |
+
|
| 31 |
+
🌐 **Project Website:** [https://thu-air-dream.github.io/X-VLA/](https://thu-air-dream.github.io/X-VLA/)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
<video controls autoplay loop muted playsinline width="720">
|
| 35 |
+
<source src="https://huggingface.co/2toINF/X-VLA-0.9B-WidowX/resolve/main/demo.mp4" type="video/mp4">
|
| 36 |
+
</video>
|
| 37 |
+
|
| 38 |
+
## ⚙️ Usage
|
| 39 |
+
### 🔹 Load the model
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
from transformers import AutoModel
|
| 43 |
+
|
| 44 |
+
model = AutoModel.from_pretrained(
|
| 45 |
+
"2toINF/X-VLA-WidowX",
|
| 46 |
+
trust_remote_code=True
|
| 47 |
+
)
|
| 48 |
+
```
|
| 49 |
+
### 🔹 Start FastAPI server
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
from transformers import AutoProcessor
|
| 53 |
+
processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
|
| 54 |
+
model.run(processor, host="0.0.0.0", port=8000)
|
| 55 |
+
```
|
| 56 |
+
### 🔹 Client-server evaluation
|
| 57 |
+
|
| 58 |
+
You can run the provided evaluation client from our GitHub:
|
| 59 |
+
👉 [2toINF/X-VLA – Client & Server Code](https://github.com/2toINF/X-VLA)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## 🧩 Architecture
|
| 63 |
+
|
| 64 |
+
| Component | Role |
|
| 65 |
+
| :-------------------------------- | :------------------------------------------------------------------------- |
|
| 66 |
+
| **Florence 2 Encoder** | Vision-Language representation backbone (encoder-only). |
|
| 67 |
+
| **SoftPromptedTransformer** | Flow-matching action denoiser using learnable soft prompts per embodiment. |
|
| 68 |
+
| **Action Hub** | Defines action spaces, masking rules, pre/post-processing, and losses. |
|
| 69 |
+
|
| 70 |
+
## 🧠 Training Summary
|
| 71 |
+
|
| 72 |
+
| Setting | Value |
|
| 73 |
+
| :---------------- | :---------------------------------------------- |
|
| 74 |
+
| Training Data | Bridge Data V2 |
|
| 75 |
+
| Parameters | ≈ 0.9 B |
|
| 76 |
+
| Action Mode | `ee6d` |
|
| 77 |
+
| Precision | BP16 |
|
| 78 |
+
| Framework | PyTorch + Transformers |
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
## 🪪 License
|
| 82 |
+
```
|
| 83 |
+
Copyright 2025 2toINF (https://github.com/2toINF)
|
| 84 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 85 |
+
you may not use this file except in compliance with the License.
|
| 86 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 87 |
+
```
|
| 88 |
+
---
|
| 89 |
+
## 📚 Citation
|
| 90 |
+
```bibtex
|
| 91 |
+
@article{zheng2025x,
|
| 92 |
+
title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
|
| 93 |
+
author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
|
| 94 |
+
and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
|
| 95 |
+
journal = {arXiv preprint arXiv:2510.10274},
|
| 96 |
+
year = {2025}
|
| 97 |
+
}
|
| 98 |
+
```
|
| 99 |
+
---
|
| 100 |
+
## 🌐 Links
|
| 101 |
+
|
| 102 |
+
- 📄 **Paper:** [arXiv 2510.10274](https://arxiv.org/abs/2510.10274)
|
| 103 |
+
- 💻 **Code & Client/Server:** [GitHub – 2toINF/X-VLA](https://github.com/2toINF/X-VLA)
|
| 104 |
+
- 🤖 **Model Hub:** [Hugging Face – 2toINF/X-VLA](https://huggingface.co/collections/2toINF/x-vla)
|