VLA-Adapter commited on
Commit
9e21d60
·
verified ·
1 Parent(s): c6d6785

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -157
README.md CHANGED
@@ -21,6 +21,7 @@ VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model t
21
  - 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
22
  - 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
23
  - 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
 
24
 
25
  ## Model Details
26
  We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
@@ -43,201 +44,134 @@ This resulted in a high-performance VLA model on a tiny-scale backbone.
43
  ### Success Rate Comparison
44
  <table>
45
  <tr>
46
- <td><strong>Category</strong>
47
- </td>
48
- <td><strong>Methods</strong>
49
- </td>
50
- <td><strong>Scale</strong>
51
- </td>
52
- <td><strong>LIBERO-Spatial</strong>
53
- </td>
54
- <td><strong>LIBERO-Object</strong>
55
- </td>
56
- <td><strong>LIBERO-Goal</strong>
57
- </td>
58
- <td><strong>LIBERO-Long</strong>
59
- </td>
60
- <td><strong>Avg.</strong>
61
- </td>
62
- </tr>
63
- <tr>
64
- <td rowspan="11">Large-scale</td>
65
- <td>FlowVLA (Zhong et al., 2025)</td>
66
- <td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td>
67
  </tr>
68
 
69
- <tr>
70
- <td>UnifiedVLA (Wang et al., 2025)</td>
71
- <td>8.5B</td><td>95.4</td><td> <i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td>
72
- </tr>
73
 
74
- <tr>
75
- <td>OpenVLA (Kim et al., 2024)</td>
76
- <td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td>
77
- </tr>
78
 
79
- <tr>
80
- <td>OpenVLA-OFT (Kim et al., 2025)</td>
81
- <td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td>
82
- </tr>
83
 
84
- <tr>
85
- <td>UniVLA (Bu et al., 2025)</td>
86
- <td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td>
87
- </tr>
88
 
89
- <tr>
90
- <td>CoT-VLA (Zhao et al., 2025)</td>
91
- <td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td>
92
- </tr>
93
 
94
- <tr>
95
- <td>WorldVLA (Cen et al., 2025)</td>
96
- <td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td>
97
- </tr>
98
 
99
- <tr>
100
- <td>TraceVLA (Zheng et al., 2025)</td>
101
- <td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td>
102
- </tr>
103
 
104
- <tr>
105
- <td>MolmoAct (Lee et al., 2025)</td>
106
- <td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td>
107
- </tr>
108
 
109
- <tr>
110
- <td>ThinkAct (Huang et al., 2025)</td>
111
- <td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td>
112
- </tr>
113
 
114
- <tr>
115
- <td>PD-VLA (Song et al., 2025b)</td>
116
- <td>7B</td><td>95.5 </td><td>96.7</td><td> 94.9</td><td> 91.7</td><td> 94.7</td>
117
- </tr>
118
 
119
- <tr>
120
- <td rowspan="8">Small-scale</td>
121
- <td>4D-VLA (Zhang et al., 2025)</td>
122
- <td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td>
123
- </tr>
124
 
125
- <tr>
126
- <td>SpatialVLA (Qu et al., 2025)</td>
127
- <td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td>
128
- </tr>
129
-
130
- <tr>
131
- <td>π0 (Black et al., 2025)</td>
132
- <td>3B</td><td>96.8</td><td> <i><u>98.8*</u></i> </td><td>95.8</td><td> 85.2</td><td> 94.2</td>
133
- </tr>
134
 
135
- <tr>
136
- <td>π0-FAST (Pertsch et al., 2025)</td>
137
- <td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td>
138
- </tr>
139
 
140
- <tr>
141
- <td>NORA (Hung et al., 2025)</td>
142
- <td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td>
143
- </tr>
144
 
145
- <tr>
146
- <td>SmolVLA (Shukor et al., 2025)</td>
147
- <td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td>
148
- </tr>
149
 
150
- <tr>
151
- <td>GR00T N1 (NVIDIA et al., 2025)</td>
152
- <td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td>
153
- </tr>
154
 
155
- <tr>
156
- <td>GraspVLA (Deng et al., 2025)</td>
157
- <td>1.8B</td><td>-</td><td> 94.1 </td><td>91.2 </td><td>82.0</td><td> 89.1</td>
158
- </tr>
159
 
160
- <tr>
161
- <td rowspan="4">Tiny-scale</td>
162
- <td>Seer (Tian et al., 2025)</td>
163
- <td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td>
164
- </tr>
165
 
166
- <tr>
167
- <td>VLA-OS (Gao et al., 2025)</td>
168
- <td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td>
169
- </tr>
170
 
171
- <tr>
172
- <td>Diffusion Policy (Chi et al., 2023)</td>
173
- <td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td>
174
- </tr>
175
 
176
- <tr>
177
- <td><b>VLA-Adapter (Ours)</b></td>
178
- <td><b>0.5B</b></td><td><b>97.8</b></td><td> <b>99.2</b> </td><td><i><u>97.2*</u></i></td><td> <b>95.0</b></td><td><b>97.3</b></td>
179
- </tr>
180
 
181
- <tr>
182
- <td><b>VLA-Adapter (Pro)</b></td>
183
- <td><b>0.5B</b></td><td><b>97.8</b></td><td> <b>99.2</b> </td><td><i><u>97.2*</u></i></td><td> <b>96.4</b></td><td><b>97.3</b></td>
184
- </tr>
185
 
186
  </table>
187
 
188
- ### Effectiveness Comparison
189
 
190
  <table>
191
  <tr>
192
- <td></td>
193
- <td><strong>OpenVLA-OFT</strong></td>
194
- <td><strong>VLA-Adapter</strong></td>
195
- <td></td>
196
  </tr>
197
 
198
- <tr>
199
- <td>Backbone</td>
200
- <td>7B</td>
201
- <td><strong>0.5B</strong></td>
202
- <td>1/14×</td>
203
- </tr>
204
 
205
- <tr>
206
- <td>Fine-Tuning Cost</td>
207
- <td>304GPU·h</td>
208
- <td><strong>8GPU·h</strong></td>
209
- <td>1/38×</td>
210
- </tr>
211
-
212
- <tr>
213
- <td>Training VRAM (8 batch)</td>
214
- <td>62GB</td>
215
- <td><strong>24.7GB</strong></td>
216
- <td>0.4×</td>
217
- </tr>
218
 
219
- <tr>
220
- <td>Throughput (8 chunk)</td>
221
- <td>71.4Hz</td>
222
- <td><strong>219.2Hz</strong></td>
223
- <td>3×</td>
224
- </tr>
225
 
226
- <tr>
227
- <td>Performance</td>
228
- <td>97.1%</td>
229
- <td><strong>97.3%</strong></td>
230
- <td>Maintain</td>
231
- </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
  </table>
233
 
234
  ## Citation instructions
235
 
236
  ```BibTeX
237
- @article{Wang2025VLAAdapter,
238
- author = {Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
239
- title = {VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
240
- journal = {ArXiv},
241
- year = {2025}
242
  }
243
  ```
 
21
  - 💬 Project page: [https://vla-adapter.github.io/](https://vla-adapter.github.io/)
22
  - 🖥️ Dataset: [https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main](https://huggingface.co/datasets/openvla/modified_libero_rlds/tree/main)
23
  - 🤗 HuggingFace: [https://huggingface.co/VLA-Adapter](https://huggingface.co/VLA-Adapter)
24
+ - Github: [https://github.com/OpenHelix-Team/VLA-Adapter](https://github.com/OpenHelix-Team/VLA-Adapter)
25
 
26
  ## Model Details
27
  We have developed and released the VLA-Adapter family of VLA models, a series of fine-tuned generative
 
44
  ### Success Rate Comparison
45
  <table>
46
  <tr>
47
+ <td><strong>LIBERO</strong></td> <td><strong>Methods</strong></td>
48
+ <td><strong>Scale</strong></td> <td><strong>Spatial</strong></td>
49
+ <td><strong>Object</strong></td> <td><strong>Goal</strong></td>
50
+ <td><strong>Long</strong></td> <td><strong>Avg.</strong></td>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  </tr>
52
 
53
+ <tr><td rowspan="10">Large-scale</td><td>FlowVLA (Zhong et al., 2025)</td>
54
+ <td>8.5B</td><td>93.2</td><td>95.0</td><td>91.6</td><td>72.6</td><td>88.1</td></tr>
 
 
55
 
56
+ <tr><td>UnifiedVLA (Wang et al., 2025)</td>
57
+ <td>8.5B</td><td>95.4</td><td><i><u>98.8*</u></i></td><td> 93.6 </td><td>94.0 </td><td>95.5</td></tr>
 
 
58
 
59
+ <tr><td>OpenVLA (Kim et al., 2024)</td>
60
+ <td>7B</td><td>84.7</td><td>88.4</td><td>79.2</td><td>53.7</td><td>76.5</td></tr>
 
 
61
 
62
+ <tr><td>OpenVLA-OFT (Kim et al., 2025)</td>
63
+ <td>7B</td><td><i><u>97.6*</u></i></td><td>98.4</td><td><b>97.9</b></td><td><i><u>94.5*</u></i></td><td><i><u>97.1*</u></i></td></tr>
 
 
64
 
65
+ <tr><td>UniVLA (Bu et al., 2025)</td>
66
+ <td>7B</td><td>96.5</td><td> 96.8</td><td> 95.6 </td><td>92.0 </td><td>95.2</td></tr>
 
 
67
 
68
+ <tr><td>CoT-VLA (Zhao et al., 2025)</td>
69
+ <td>7B</td><td>87.5 </td><td>91.6 </td><td>87.6</td><td> 69.0</td><td> 81.1</td></tr>
 
 
70
 
71
+ <tr><td>WorldVLA (Cen et al., 2025)</td>
72
+ <td>7B</td><td>87.6</td><td> 96.2</td><td> 83.4</td><td> 60.0</td><td> 81.8</td></tr>
 
 
73
 
74
+ <tr><td>TraceVLA (Zheng et al., 2025)</td>
75
+ <td>7B</td><td>84.6</td><td> 85.2</td><td> 75.1</td><td> 54.1</td><td> 74.8</td></tr>
 
 
76
 
77
+ <tr><td>MolmoAct (Lee et al., 2025)</td>
78
+ <td>7B</td><td>87.0</td><td> 95.4 </td><td>87.6</td><td> 77.2 </td><td>86.6</td></tr>
 
 
79
 
80
+ <tr><td>ThinkAct (Huang et al., 2025)</td>
81
+ <td>7B</td><td>88.3 </td><td>91.4</td><td> 87.1</td><td> 70.9</td><td> 84.4</td></tr>
 
 
82
 
83
+ <tr><td rowspan="7">Small-scale</td><td>4D-VLA (Zhang et al., 2025)</td>
84
+ <td>4B</td><td>88.9</td><td> 95.2</td><td> 90.9</td><td> 79.1 </td><td>88.6</td></tr>
 
 
 
85
 
86
+ <tr><td>SpatialVLA (Qu et al., 2025)</td>
87
+ <td>4B</td><td>88.2</td><td> 89.9</td><td> 78.6</td><td> 55.5 </td><td>78.1</td></tr>
 
 
 
 
 
 
 
88
 
89
+ <tr><td>π0 (Black et al., 2024)</td>
90
+ <td>3B</td><td>96.8</td><td><i><u>98.8*</u></i></td><td>95.8</td><td> 85.2</td><td> 94.2</td></tr>
 
 
91
 
92
+ <tr><td>π0-FAST (Pertsch et al., 2025)</td>
93
+ <td>3B</td><td>96.4</td><td> 96.8 </td><td>88.6</td><td> 60.2</td><td> 85.5</td></tr>
 
 
94
 
95
+ <tr><td>NORA (Hung et al., 2025)</td>
96
+ <td>3B</td><td>92.2 </td><td>95.4 </td><td>89.4</td><td> 74.6 </td><td>87.9</td></tr>
 
 
97
 
98
+ <tr><td>SmolVLA (Shukor et al., 2025)</td>
99
+ <td>2.2B</td><td>93.0</td><td> 94.0 </td><td>91.0</td><td> 77.0 </td><td>88.8</td></tr>
 
 
100
 
101
+ <tr><td>GR00T N1 (NVIDIA et al., 2025)</td>
102
+ <td>2B</td><td>94.4</td><td> 97.6 </td><td>93.0 </td><td>90.6</td><td> 93.9</td></tr>
 
 
103
 
104
+ <tr><td rowspan="5">Tiny-scale</td><td>Seer (Tian et al., 2025)</td>
105
+ <td>0.57B</td><td>-</td><td> - </td><td>- </td><td>78.7</td><td> 78.7</td></tr>
 
 
 
106
 
107
+ <tr><td>VLA-OS (Gao et al., 2025)</td>
108
+ <td>0.5B</td><td>87.0 </td><td>96.5</td><td> 92.7 </td><td>66.0</td><td> 85.6</td></tr>
 
 
109
 
110
+ <tr><td>Diffusion Policy (Chi et al., 2023)</td>
111
+ <td>-</td><td>78.3</td><td> 92.5</td><td> 68.3 </td><td>50.5 </td><td>72.4</td></tr>
 
 
112
 
113
+ <tr><td><b>VLA-Adapter (Ours)</b></td>
114
+ <td><b>0.5B</b></td><td><b>97.8</b></td><td><b>99.2</b></td><td><i><u>97.2*</u></i></td><td> <b>95.0 </b></td><td><b>97.3</b></td></tr>
 
 
115
 
116
+ <tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
117
+ <td><b>0.5B</b></td><td><b><i>99.6</i></b></td><td><b><i>99.6</i></b> </td><td><b><i>98.2</i></b></td><td><b><i>96.4</i></b></td><td><b><i>98.5</i></b></td></tr>
 
 
118
 
119
  </table>
120
 
 
121
 
122
  <table>
123
  <tr>
124
+ <td><strong>CALVIN</strong></td> <td><strong>Methods</strong></td>
125
+ <td><strong>Scale</strong></td> <td><strong>1</strong></td>
126
+ <td><strong>2</strong></td> <td><strong>3</strong></td>
127
+ <td><strong>4</strong></td> <td><strong>5</strong></td> <td><strong>Avg. len</strong></td>
128
  </tr>
129
 
130
+ <tr><td rowspan="8">Large-scale</td><td>UniVLA (Bu et al., 2025) </td><td>7B </td><td>95.5 </td><td>85.8 </td><td>75.4</td><td> 66.9 </td><td>56.5 </td><td>3.80</tr>
 
 
 
 
 
131
 
132
+ <tr><td>OpenVLA (Kim et al., 2024) </td><td> 7B</td><td> 91.3</td><td> 77.8 </td><td>62.0 </td><td>52.1 </td><td>43.5</td><td> 3.27</td></tr>
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
+ <tr><td>OpenVLA-OFT (Kim et al., 2025)</td><td> 7B</td><td> 96.3</td><td> 89.1 </td><td>82.4</td><td> 75.8</td><td> 66.5</td><td> 4.10</td></tr>
 
 
 
 
 
135
 
136
+ <tr><td>VLAS (Zhao et al., 2025b) </td><td> 7B</td><td> 87.2 </td><td>64.2</td><td> 40.9 </td><td>28.1</td><td> 19.6 </td><td>2.40</td></tr>
137
+
138
+ <tr><td>LCB (Shentu et al., 2024) </td><td> 7B</td><td> 73.6 </td><td>50.2 </td><td>28.5 </td><td>16.0 </td><td>9.9 </td><td>1.78</td></tr>
139
+
140
+ <tr><td>RoboDual (Bu et al., 2024a) </td><td> 7B</td><td> 94.4</td><td> 82.7</td><td> 72.1</td><td> 62.4 </td><td>54.4</td><td> 3.66</td></tr>
141
+
142
+ <tr><td>OpenHelix (Cui et al., 2025) </td><td> 7B</td><td> <i><u>97.1*</u></i> </td><td>91.4 </td><td>82.8</td><td> 72.6</td><td> 64.1 </td><td>4.08</td></tr>
143
+
144
+ <tr><td>ReconVLA (Song et al., 2025c) </td><td> 7B</td><td> 95.6 </td><td>87.6 </td><td>76.9</td><td> 69.3</td><td> 64.1 </td><td>3.95</td></tr>
145
+
146
+ <tr><td rowspan="4">Small-scale</td><td>DeeR (Yue et al., 2024) </td><td> 3B</td><td> 86.2</td><td> 70.1 </td><td>51.8</td><td> 41.5</td><td> 30.4 </td><td>2.82</td></tr>
147
+
148
+ <tr><td>RoboFlamingo (Li et al., 2024b) </td><td> 3B</td><td> 82.4 </td><td>61.9</td><td> 46.6 </td><td>33.1</td><td> 23.5</td><td> 2.48</td></tr>
149
+
150
+ <tr><td>VPP (Hu et al., 2025)</td><td> 1.5B</td><td> 95.7</td><td> 91.2</td><td> <i><u>86.3*</u></i></td><td> <i><u>81.0*</u></i></td><td> <i><u>75.0*</u></i></td><td> <i><u>4.33*</u></i></td></tr>
151
+
152
+ <tr><td>SuSIE (Black et al., 2024)</td><td>1.3B</td><td> 87.0</td><td> 69.0</td><td> 49.0 </td><td>38.0</td><td> 26.0</td><td> 2.69</td></tr>
153
+
154
+ <tr><td rowspan="5">Tiny-scale</td><td>Seer-Large (Tian et al., 2025)</td><td>0.57B</td><td> 96.3 </td><td><i><u>91.6*</u></i></td><td> 86.1 </td><td>80.3 </td><td>74.0</td><td> 4.28</td></tr>
155
+
156
+ <tr><td>MoDE (Reuss et al., 2025) </td><td> 0.44B </td><td>96.2</td><td> 88.9</td><td> 81.1</td><td> 71.8 </td><td>63.5 </td><td>4.01</td></tr>
157
+
158
+ <tr><td>Seer (Tian et al., 2025) </td><td> 0.32B</td><td> 94.4 </td><td>87.2 </td><td>79.9 </td><td>72.2 </td><td>64.3</td><td> 3.98</td></tr>
159
+
160
+ <tr><td><b>VLA-Adapter (Ours)</b></td>
161
+ <td><b>0.5B</b></td><td><b><i>99.1</i></b> </td><td><b>94.6</b> </td><td><b>88.8</b></td><td> <b>82.8</b> </td><td><b>76.5</b> </td><td><b>4.42</b></td></tr>
162
+
163
+ <tr><td><b>VLA-Adapter-Pro (Ours)</b></td>
164
+ <td><b>0.5B</b></td><td><b>98.5</b></td><td><b><i>95.0</i></b> </td><td><b><i>90.5</i></b></td><td><b><i>85.3</i></b></td><td><b><i>80.0</i></b></td><td><b><i>4.50</i></b></td></tr>
165
+
166
  </table>
167
 
168
  ## Citation instructions
169
 
170
  ```BibTeX
171
+ @article{wang2025vlaadapter,
172
+ author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},
173
+ title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},
174
+ journal={arXiv preprint arXiv:2509.09372},
175
+ year={2025}
176
  }
177
  ```