2toINF commited on
Commit
890e079
·
verified ·
1 Parent(s): b8eb3f2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - microsoft/Florence-2-large
5
+ tags:
6
+ - robotics
7
+ - vla
8
+ pipeline_tag: robotics
9
+ datasets:
10
+ - Facebear/XVLA-Soft-Fold
11
+ ---
12
+
13
+ # X-VLA 0.9B (Soft Fold Edition)
14
+
15
+
16
+ **Repository:** [2toINF/X-VLA](https://github.com/2toinf/X-VLA)
17
+
18
+ **Authors:** [2toINF](https://github.com/2toINF) | **License:** Apache 2.0
19
+
20
+ **Paper:** *Zheng et al., 2025, “X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model”* ([arXiv:2510.10274](https://arxiv.org/pdf/2510.10274))
21
+
22
+
23
+ ## 🚀 Overview
24
+
25
+ Successful generalist **Vision-Language-Action (VLA)** models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets.
26
+ To facilitate and leverage the heterogeneity in rich robotic data sources, **X-VLA** introduces a **Soft Prompt approach** with minimally added parameters: we infuse prompt-learning concepts into cross-embodiment robot learning, introducing **separate sets of learnable embeddings** for each distinct embodiment.
27
+
28
+ These embodiment-specific prompts empower VLA models to exploit cross-embodiment features effectively.
29
+ Our architecture—**a clean, flow-matching-based VLA design relying exclusively on soft-prompted standard Transformers**—achieves superior scalability and simplicity.
30
+
31
+ Trained on **Bridge Data** and evaluated across **six simulations** and **three real-world robots**, the 0.9B-parameter X-VLA simultaneously achieves **state-of-the-art performance** across diverse benchmarks, demonstrating flexible dexterity and fast adaptation across embodiments, environments, and tasks.
32
+
33
+ 🌐 **Project Website:** [https://thu-air-dream.github.io/X-VLA/](https://thu-air-dream.github.io/X-VLA/)
34
+
35
+
36
+ <video controls autoplay loop muted playsinline width="720">
37
+ <source src="https://huggingface.co/2toINF/X-VLA-0.9B-WidowX/resolve/main/demo.mp4" type="video/mp4">
38
+ </video>
39
+
40
+ ## ⚙️ Usage
41
+ ### 🔹 Load the model
42
+
43
+ ```python
44
+ from transformers import AutoModel
45
+
46
+ model = AutoModel.from_pretrained(
47
+ "2toINF/X-VLA-WidowX",
48
+ trust_remote_code=True
49
+ )
50
+ ```
51
+ ### 🔹 Start FastAPI server
52
+
53
+ ```python
54
+ from transformers import AutoProcessor
55
+ processor = AutoProcessor.from_pretrained("2toINF/X-VLA-WidowX", trust_remote_code=True)
56
+ model.run(processor, host="0.0.0.0", port=8000)
57
+ ```
58
+ ### 🔹 Client-server evaluation
59
+
60
+ You can run the provided evaluation client from our GitHub:
61
+ 👉 [2toINF/X-VLA – Client &amp; Server Code](https://github.com/2toINF/X-VLA)
62
+
63
+
64
+ ## 🧩 Architecture
65
+
66
+ | Component | Role |
67
+ | :-------------------------------- | :------------------------------------------------------------------------- |
68
+ | **Florence 2 Encoder** | Vision-Language representation backbone (encoder-only). |
69
+ | **SoftPromptedTransformer** | Flow-matching action denoiser using learnable soft prompts per embodiment. |
70
+ | **Action Hub** | Defines action spaces, masking rules, pre/post-processing, and losses. |
71
+
72
+ ## 🧠 Training Summary
73
+
74
+ | Setting | Value |
75
+ | :---------------- | :---------------------------------------------- |
76
+ | Training Data | Bridge Data V2 |
77
+ | Parameters | ≈ 0.9 B |
78
+ | Action Mode | `ee6d` |
79
+ | Precision | BP16 |
80
+ | Framework | PyTorch + Transformers |
81
+
82
+ ---
83
+ ## 🪪 License
84
+ ```
85
+ Copyright 2025 2toINF (https://github.com/2toINF)
86
+ Licensed under the Apache License, Version 2.0 (the "License");
87
+ you may not use this file except in compliance with the License.
88
+ http://www.apache.org/licenses/LICENSE-2.0
89
+ ```
90
+ ---
91
+ ## 📚 Citation
92
+ ```bibtex
93
+ @article{zheng2025x,
94
+ title = {X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model},
95
+ author = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui
96
+ and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and others},
97
+ journal = {arXiv preprint arXiv:2510.10274},
98
+ year = {2025}
99
+ }
100
+ ```
101
+ ---
102
+ ## 🌐 Links
103
+
104
+ - 📄 **Paper:** [arXiv 2510.10274](https://arxiv.org/abs/2510.10274)
105
+ - 💻 **Code & Client/Server:** [GitHub – 2toINF/X-VLA](https://github.com/2toINF/X-VLA)
106
+ - 🤖 **Model Hub:** [Hugging Face – 2toINF/X-VLA-0.9B-WidowX](https://huggingface.co/2toINF/X-VLA-0.9B-WidowX)