VLA-Adapter commited on
Commit
6f1cc10
·
verified ·
1 Parent(s): f76275d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +233 -3
README.md CHANGED
@@ -1,3 +1,233 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - Vision-Language-Action
5
+ - OpenHelix Team
6
+ base_model:
7
+ - Qwen/Qwen2.5-0.5B
8
+ language:
9
+ - en
10
+ pipeline_tag: robotics
11
+ ---
12
+
13
+
14
+ <p align="center">
15
+ <img src="https://huggingface.co/datasets/VLA-Adapter/Figures/resolve/main/Logo.png" width="1000"/>
16
+ <p>
17
+
18
+
19
+ # Model Card for VLA-Adapter Libero-Long
20
+ VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model trained on Libero-Long.
21
+ - 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
22
+ - 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
23
+ - 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
24
+
25
+ ## Model Details
26
+ We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
27
+ action models. The VLA-Adapter VLM follows the Prismatic-VLM architecture, using only a very small backbone
28
+ (Qwen2.5-0.5B) for the LLM. On common robotics benchmarks, it surpasses open-source VLA models with 8.5B,
29
+ 7B, 4B, 3B, and 2B backbones.
30
+
31
+ **Input:** Models input image and text.
32
+
33
+ **Output:** Models generate action only.
34
+
35
+ **Model Architecture:** The VLA-Adapter consists of a VLM for receiving and processing image and text
36
+ information and a policy for generating actions. We systematically analyzed the benefits that the VLM
37
+ provides to different types of policy conditions and determined a unified framework. We then utilized
38
+ our designed Bridge Attention module to fuse the conditions generated by the VLM with the initial action
39
+ information in the policy, bridging the gap between VL and A to the greatest extent possible.
40
+ This resulted in a high-performance VLA model on a tiny-scale backbone.
41
+
42
+
43
+ ### Success Rate Comparison
44
+ <table>
45
+ <tr>
46
+ <td><strong>Category</strong>
47
+ </td>
48
+ <td><strong>Methods</strong>
49
+ </td>
50
+ <td><strong>Scale</strong>
51
+ </td>
52
+ <td><strong>LIBERO-Spatial</strong>
53
+ </td>
54
+ <td><strong>LIBERO-Object</strong>
55
+ </td>
56
+ <td><strong>LIBERO-Goal</strong>
57
+ </td>
58
+ <td><strong>LIBERO-Long</strong>
59
+ </td>
60
+ <td><strong>Avg.</strong>
61
+ </td>
62
+ </tr>
63
+ <tr>
64
+ <td rowspan="10">Large-scale</td>
65
+ <td>FlowVLA (Zhong et al., 2025)</td>
66
+ <td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td>
67
+ </tr>
68
+
69
+ <tr>
70
+ <td>OpenVLA (Kim et al., 2024)</td>
71
+ <td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td>
72
+ </tr>
73
+
74
+ <tr>
75
+ <td>OpenVLA-OFT (Kim et al., 2025)</td>
76
+ <td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><b>94.5</b></td><td><b>97.1</b></td>
77
+ </tr>
78
+
79
+ <tr>
80
+ <td>UniVLA (Bu et al., 2025)</td>
81
+ <td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td>
82
+ </tr>
83
+
84
+ <tr>
85
+ <td>CoT-VLA (Zhao et al., 2025)</td>
86
+ <td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td>
87
+ </tr>
88
+
89
+ <tr>
90
+ <td>WorldVLA (Cen et al., 2025)</td>
91
+ <td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td>
92
+ </tr>
93
+
94
+ <tr>
95
+ <td>TraceVLA (Zheng et al., 2025)</td>
96
+ <td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td>
97
+ </tr>
98
+
99
+ <tr>
100
+ <td>MolmoAct (Lee et al., 2025)</td>
101
+ <td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td>
102
+ </tr>
103
+
104
+ <tr>
105
+ <td>ThinkAct (Huang et al., 2025)</td>
106
+ <td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td>
107
+ </tr>
108
+
109
+ <tr>
110
+ <td>PD-VLA (Song et al., 2025b)</td>
111
+ <td>7B</td><td>95.5 </td><td>96.7</td><td> 94.9</td><td> 91.7</td><td> 94.7</td>
112
+ </tr>
113
+
114
+ <tr>
115
+ <td rowspan="8">Small-scale</td>
116
+ <td>4D-VLA (Zhang et al., 2025)</td>
117
+ <td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td>
118
+ </tr>
119
+
120
+ <tr>
121
+ <td>SpatialVLA (Qu et al., 2025)</td>
122
+ <td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td>
123
+ </tr>
124
+
125
+ <tr>
126
+ <td>π0 (Black et al., 2025)</td>
127
+ <td>3B</td><td>96.8</td><td> <i><u>98.8*</u></i> </td><td>95.8</td><td> 85.2</td><td> 94.2</td>
128
+ </tr>
129
+
130
+ <tr>
131
+ <td>π0-FAST (Pertsch et al., 2025)</td>
132
+ <td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td>
133
+ </tr>
134
+
135
+ <tr>
136
+ <td>NORA (Hung et al., 2025)</td>
137
+ <td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td>
138
+ </tr>
139
+
140
+ <tr>
141
+ <td>SmolVLA (Shukor et al., 2025)</td>
142
+ <td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td>
143
+ </tr>
144
+
145
+ <tr>
146
+ <td>GR00T N1 (NVIDIA et al., 2025)</td>
147
+ <td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td>
148
+ </tr>
149
+
150
+ <tr>
151
+ <td>GraspVLA (Deng et al., 2025)</td>
152
+ <td>1.8B</td><td>-</td><td> 94.1 </td><td>91.2 </td><td>82.0</td><td> 89.1</td>
153
+ </tr>
154
+
155
+ <tr>
156
+ <td rowspan="4">Tiny-scale</td>
157
+ <td>Seer (Tian et al., 2025)</td>
158
+ <td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td>
159
+ </tr>
160
+
161
+ <tr>
162
+ <td>VLA-OS (Gao et al., 2025)</td>
163
+ <td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td>
164
+ </tr>
165
+
166
+ <tr>
167
+ <td>Diffusion Policy (Chi et al., 2023)</td>
168
+ <td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td>
169
+ </tr>
170
+
171
+ <tr>
172
+ <td><b>VLA-Adapter (Ours)</b></td>
173
+ <td><b>0.5B</b></td><td><b>97.8</b></td><td> <b>99.2</b> </td><td><i><u>97.2*</u></i></td><td> <i><u>94.0* </u></i></td><td><b>97.1</b></td>
174
+ </tr>
175
+
176
+ </table>
177
+
178
+ ### Effectiveness Comparison
179
+
180
+ <table>
181
+ <tr>
182
+ <td></td>
183
+ <td><strong>OpenVLA-OFT</strong></td>
184
+ <td><strong>VLA-Adapter</strong></td>
185
+ <td></td>
186
+ </tr>
187
+
188
+ <tr>
189
+ <td>Backbone</td>
190
+ <td>7B</td>
191
+ <td><strong>0.5B</strong></td>
192
+ <td>1/14×</td>
193
+ </tr>
194
+
195
+ <tr>
196
+ <td>Fine-Tuning Cost</td>
197
+ <td>304GPU·h</td>
198
+ <td><strong>8GPU·h</strong></td>
199
+ <td>1/38×</td>
200
+ </tr>
201
+
202
+ <tr>
203
+ <td>Training VRAM (8 batch)</td>
204
+ <td>62GB</td>
205
+ <td><strong>24.7GB</strong></td>
206
+ <td>0.4×</td>
207
+ </tr>
208
+
209
+ <tr>
210
+ <td>Throughput (8 chunk)</td>
211
+ <td>109.7Hz</td>
212
+ <td><strong>219.2Hz</strong></td>
213
+ <td>2×</td>
214
+ </tr>
215
+
216
+ <tr>
217
+ <td>Performance</td>
218
+ <td>97.1%</td>
219
+ <td><strong>97.1%</strong></td>
220
+ <td>Maintain</td>
221
+ </tr>
222
+ </table>
223
+
224
+ ## Citation instructions
225
+
226
+ ```BibTeX
227
+ @article{Wang2025VLAAdapter,
228
+ author = {Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
229
+ title = {VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
230
+ journal = {ArXiv},
231
+ year = {2025}
232
+ }
233
+ ```