Tavish9 commited on
Commit
01f5ea2
·
verified ·
1 Parent(s): 01b609f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +536 -9
README.md CHANGED
@@ -1,9 +1,536 @@
1
- ---
2
- license: mit
3
- base_model:
4
- - google/paligemma2-3b-pt-224
5
- tags:
6
- - VLA
7
- - Foundation Vision-language-action Model
8
- - Generalist Robot Policy
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - google/paligemma2-3b-pt-224
5
+ tags:
6
+ - VLA
7
+ - Foundation Vision-language-action Model
8
+ - Generalist Robot Policy
9
+ - robotics
10
+ language:
11
+ - en
12
+ pipeline_tag: image-text-to-text
13
+ library_name: transformers
14
+ ---
15
+
16
+ # SpatialVLA
17
+
18
+ SpatialVLA is a spatial-enhanced vision-language-action model trained on 1.1 Million real robot episodes. The code is purely huggingFace-based and concise, with efficient performance.
19
+
20
+ All SpatialVLA checkpoints, as well as our [training codebase](https://github.com/openvla/openvla) are released under an MIT License.
21
+
22
+ For full details, please read [our paper](https://arxiv.org/abs/2501.15830) and see [our project page](https://spatialvla.github.io/).
23
+
24
+ ## Model Details
25
+
26
+ ### Model Description
27
+
28
+ - **Developed by:** The SpatialVLA team consisting of researchers from Shanghai AI Laboratory, ShanghaiTech and TeleAI.
29
+ - **Model type:** Vision-language-action (language, image => robot actions)
30
+ - **Language(s) (NLP):** en
31
+ - **License:** MIT
32
+ - **Finetuned from model:** [paligemma2-3b-pt-224](https://huggingface.co/google/paligemma2-3b-pt-224)
33
+ - **Pretraining Dataset:** [Open X-Embodiment](https://robotics-transformer-x.github.io/) and [RH20T](https://rh20t.github.io/)
34
+ - **Repository:** [https://github.com/SpatialVLA/SpatialVLA](https://github.com/SpatialVLA/SpatialVLA)
35
+ - **Paper:** [SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model](https://arxiv.org/abs/2501.15830)
36
+ - **Project Page & Videos:** [https://spatialvla.github.io/](https://spatialvla.github.io/)
37
+
38
+ ## Uses
39
+
40
+ SpatialVLA relies solely on HuggingFace Transformers 🤗, making deployment extremely easy. If your environment supports `transformers >= 4.47.0`, you can directly use the following code to load the model and perform inference. (requires 8.5GB of GPU memory).
41
+
42
+ ### Direct Use
43
+
44
+ ```python
45
+ import torch
46
+ from PIL import Image
47
+ from transformers import AutoModel, AutoProcessor
48
+
49
+ model_name_or_path="IPEC-COMMUNITY/spatialvla-4b-224-pt"
50
+ processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
51
+
52
+ model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()
53
+
54
+ image = Image.open("example.png").convert("RGB")
55
+ prompt = "What action should the robot take to pick the cpu?"
56
+ inputs = processor(images=[image], text=prompt, return_tensors="pt")
57
+ generation_outputs = model.predict_action(inputs)
58
+
59
+ actions = processor.decode_actions(generation_outputs, unnorm_key="bridge_orig/1.0.0")
60
+ print(actions)
61
+ ```
62
+
63
+ ### Out-of-Scope Use
64
+
65
+ SpatialVLA models do not zero-shot generalize to new (unseen) robot embodiments, or setups that are not represented in the pretraining mix; in these cases, we suggest collecting a dataset of demonstrations on the desired setup, and fine-tuning SpatialVLA models instead.
66
+
67
+ ## How to Get Hands Dirty with the Model
68
+
69
+ If you want to use the model for fine-tuning or pre-training, you need to clone the [official repository](https://github.com/SpatialVLA/SpatialVLA) first.
70
+ ```bash
71
+ git clone https://github.com/SpatialVLA/SpatialVLA.git
72
+ ```
73
+
74
+ , then install the required packages and download the model from the Hugging Face model hub. The VLM backbone of SpatialVLA is PaLiGemma2, which requires transformers >= 4.47.0. Hence, create a Python environment with Python >= 3.10.
75
+ ```bash
76
+ conda create -n spatialvla python=3.10
77
+ conda activate spatialvla
78
+ ```
79
+
80
+ Install packages from `requirements.txt` file. Note that we use a customised `dlimp` to support seed setting for reproducibility. If you catch any problems, please manually install the dlimp form the [dlimp_custom](https://github.com/SpatialVLA/dlimp_custom).
81
+
82
+ ```bash
83
+ pip install -r requirements.txt
84
+ ```
85
+ ### Train from Scratch
86
+
87
+ SpatialVLA is pre-trained with 1.1 Million real-robot demonstrations from the OXE and RH20T dataset on a cluster of 64 A100 GPUs for abut 10 days, using a batch size of 2048. You can pre-train the model from scratch using the following command.
88
+
89
+ ```bash
90
+ # torchrun
91
+ bash scripts/spatialvla_4b_pretrain/torchrun_pretrain.sh
92
+
93
+ # or in a slurm cluster
94
+ bash scripts/spatialvla_4b_pretrain/slurm_pretrain.sh
95
+ ```
96
+
97
+ ### Fine-tuning
98
+
99
+ Most of our fine-tuning experiments are conducted using LoRA on 4 or 8 A100 GPUs.
100
+ You can use the following scripts for full-parameter or LoRA fine-tuning. For real-world experiments with small datasets, we prefer using LoRA for fine-tuning.
101
+
102
+ ```bash
103
+ # full fine-tuning
104
+ bash scripts/spatialvla_4b_finetune/finetune_full.sh
105
+
106
+ # LoRA fine-tuning
107
+ bash scripts/spatialvla_4b_finetune/finetune_lora.sh
108
+ ```
109
+
110
+ ## Evaluation
111
+
112
+ <details>
113
+ <summary>
114
+ SimplerEnv evaluation on Google Robot tasks.
115
+ </summary>
116
+ <table border="1" class="dataframe">
117
+ <thead>
118
+ <tr style="text-align: center;">
119
+ <th rowspan="2">Model</th>
120
+ <th colspan="4">Visual Matching</th>
121
+ <th colspan="4">Variant Aggregation</th>
122
+ </tr>
123
+ <tr style="text-align: center;">
124
+ <th>Pick Coke Can</th>
125
+ <th>Move Near</th>
126
+ <th>Open/Close Drawer</th>
127
+ <th>#Average</th>
128
+ <th>Pick Coke Can</th>
129
+ <th>Move Near</th>
130
+ <th>Open/Close Drawer</th>
131
+ <th>#Average</th>
132
+ </tr>
133
+ </thead>
134
+ <tbody>
135
+ <tr>
136
+ <td>RT-1 (Begin)</td>
137
+ <td>2.7%</td>
138
+ <td>5.0%</td>
139
+ <td>13.9%</td>
140
+ <td>6.8%</td>
141
+ <td>2.2%</td>
142
+ <td>4.0%</td>
143
+ <td>6.9%</td>
144
+ <td>4.2%</td>
145
+ </tr>
146
+ <tr>
147
+ <td>RT-1 (15%)</td>
148
+ <td>71.0%</td>
149
+ <td>35.4%</td>
150
+ <td>56.5%</td>
151
+ <td>60.2%</td>
152
+ <td>81.3%</td>
153
+ <td>44.6%</td>
154
+ <td>26.7%</td>
155
+ <td>56.2%</td>
156
+ </tr>
157
+ <tr>
158
+ <td>RT-1 (Converged)</td>
159
+ <td>85.7%</td>
160
+ <td>44.2%</td>
161
+ <td>73.0%</td>
162
+ <td>74.6%</td>
163
+ <td>89.8%</td>
164
+ <td>50.0%</td>
165
+ <td>32.3%</td>
166
+ <td>63.3%</td>
167
+ </tr>
168
+ <tr>
169
+ <td>HPT</td>
170
+ <td>56.0%</td>
171
+ <td>60.0%</td>
172
+ <td>24.0%</td>
173
+ <td>46.0%</td>
174
+ <td>--</td>
175
+ <td>--</td>
176
+ <td>31.0%</td>
177
+ <td>45.0%</td>
178
+ </tr>
179
+ <tr>
180
+ <td>TraceVLA</td>
181
+ <td>28.0%</td>
182
+ <td>53.7%</td>
183
+ <td>57.0%</td>
184
+ <td>42.0%</td>
185
+ <td>60.0%</td>
186
+ <td>56.4%</td>
187
+ <td>29.4%</td>
188
+ <td>39.6%</td>
189
+ </tr>
190
+ <tr>
191
+ <td>RT-1-X</td>
192
+ <td>56.7%</td>
193
+ <td>31.7%</td>
194
+ <td>59.7%</td>
195
+ <td>53.4%</td>
196
+ <td>49.0%</td>
197
+ <td>32.3%</td>
198
+ <td>35.3%</td>
199
+ <td>64.3%</td>
200
+ </tr>
201
+ <tr>
202
+ <td>RT-2-X</td>
203
+ <td>78.7%</td>
204
+ <td>77.9%</td>
205
+ <td>25.0%</td>
206
+ <td>60.7%</td>
207
+ <td>82.3%</td>
208
+ <td>79.2%</td>
209
+ <td>--</td>
210
+ <td>--</td>
211
+ </tr>
212
+ <tr>
213
+ <td>Octo-Base</td>
214
+ <td>17.0%</td>
215
+ <td>4.2%</td>
216
+ <td>22.7%</td>
217
+ <td>16.8%</td>
218
+ <td>0.6%</td>
219
+ <td>3.1%</td>
220
+ <td>1.1%</td>
221
+ <td>1.1%</td>
222
+ </tr>
223
+ <tr>
224
+ <td>OpenVLA</td>
225
+ <td>16.3%</td>
226
+ <td>46.2%</td>
227
+ <td>35.6%</td>
228
+ <td>27.7%</td>
229
+ <td>54.5%</td>
230
+ <td>47.7%</td>
231
+ <td>17.7%</td>
232
+ <td>39.8%</td>
233
+ </tr>
234
+ <tr>
235
+ <td>RoboVLM (zero-shot)</td>
236
+ <td>72.7%</td>
237
+ <td>66.3%</td>
238
+ <td>26.8%</td>
239
+ <td>56.3%</td>
240
+ <td>68.3%</td>
241
+ <td>56.0%</td>
242
+ <td>8.5%</td>
243
+ <td>46.3%</td>
244
+ </tr>
245
+ <tr>
246
+ <td>RoboVLM (fine-tuning)</td>
247
+ <td>77.3%</td>
248
+ <td>61.7%</td>
249
+ <td>43.5%</td>
250
+ <td>63.4%</td>
251
+ <td>75.6%</td>
252
+ <td>60.0%</td>
253
+ <td>10.6%</td>
254
+ <td>51.3%</td>
255
+ </tr>
256
+ <tr>
257
+ <td>SpatialVLA (zero-shot)</td>
258
+ <td><b>81.0%</b></td>
259
+ <td><b>69.6%</b></td>
260
+ <td><b>59.3%</b></td>
261
+ <td><b>71.9%</b></td>
262
+ <td><b>89.5%</b></td>
263
+ <td><b>71.7%</b></td>
264
+ <td>36.2%</td>
265
+ <td><b>68.8%</b></td>
266
+ </tr>
267
+ <tr>
268
+ <td>SpatialVLA (fine-tuning)</td>
269
+ <td><b>86.0%</b></td>
270
+ <td><b>77.9%</b></td>
271
+ <td>57.4%</td>
272
+ <td><b>75.1%</b></td>
273
+ <td>88.0%</td>
274
+ <td>72.7%</td>
275
+ <td>41.8%</td>
276
+ <td><b>70.7%</b></td>
277
+ </tr>
278
+ </tbody>
279
+ </table>
280
+
281
+ </details>
282
+
283
+
284
+ <details>
285
+ <summary>
286
+ SimplerEnv evaluation on WidowX Robot tasks.
287
+ </summary>
288
+ <table border="1" class="dataframe">
289
+ <thead>
290
+ <tr style="text-align: center;">
291
+ <th rowspan="2">Model</th>
292
+ <th colspan="2">Put Spoon on Towel</th>
293
+ <th colspan="2">Put Carrot on Plate</th>
294
+ <th colspan="2">Stack Green Block on Yellow Block</th>
295
+ <th colspan="2">Put Eggplant in Yellow Basket</th>
296
+ <th rowspan="2">#Overall Average</th>
297
+ </tr>
298
+ <tr style="text-align: center;">
299
+ <th>Grasp Spoon</th>
300
+ <th>Success</th>
301
+ <th>Grasp Carrot</th>
302
+ <th>Success</th>
303
+ <th>Grasp Green Block</th>
304
+ <th>Success</th>
305
+ <th>Grasp Eggplant</th>
306
+ <th>Success</th>
307
+ </tr>
308
+ </thead>
309
+ <tbody>
310
+ <tr>
311
+ <td>RT-1-X</td>
312
+ <td>16.7%</td>
313
+ <td>0.0%</td>
314
+ <td>20.8%</td>
315
+ <td>4.2%</td>
316
+ <td>8.3%</td>
317
+ <td>0.0%</td>
318
+ <td>0.0%</td>
319
+ <td>0.0%</td>
320
+ <td>1.1%</td>
321
+ </tr>
322
+ <tr>
323
+ <td>Octo-Base</td>
324
+ <td>34.7%</td>
325
+ <td>12.5%</td>
326
+ <td>52.8%</td>
327
+ <td>8.3%</td>
328
+ <td>31.9%</td>
329
+ <td>0.0%</td>
330
+ <td>66.7%</td>
331
+ <td>43.1%</td>
332
+ <td>16.0%</td>
333
+ </tr>
334
+ <tr>
335
+ <td>Octo-Small</td>
336
+ <td>77.8%</td>
337
+ <td>47.2%</td>
338
+ <td>27.8%</td>
339
+ <td>9.7%</td>
340
+ <td>40.3%</td>
341
+ <td>4.2%</td>
342
+ <td>87.5%</td>
343
+ <td>56.9%</td>
344
+ <td>30.0%</td>
345
+ </tr>
346
+ <tr>
347
+ <td>OpenVLA</td>
348
+ <td>4.1%</td>
349
+ <td>0.0%</td>
350
+ <td>33.3%</td>
351
+ <td>0.0%</td>
352
+ <td>12.5%</td>
353
+ <td>0.0%</td>
354
+ <td>8.3%</td>
355
+ <td>4.1%</td>
356
+ <td>1.0%</td>
357
+ </tr>
358
+ <tr>
359
+ <td>RoboVLM (zero-shot)</td>
360
+ <td>37.5%</td>
361
+ <td>20.8%</td>
362
+ <td>33.3%</td>
363
+ <td>25.0%</td>
364
+ <td>8.3%</td>
365
+ <td>8.3%</td>
366
+ <td>0.0%</td>
367
+ <td>0.0%</td>
368
+ <td>13.5%</td>
369
+ </tr>
370
+ <tr>
371
+ <td>RoboVLM (fine-tuning)</td>
372
+ <td>54.2%</td>
373
+ <td>29.2%</td>
374
+ <td>25.0%</td>
375
+ <td>25.0%</td>
376
+ <td>45.8%</td>
377
+ <td>12.5%</td>
378
+ <td>58.3%</td>
379
+ <td>58.3%</td>
380
+ <td>31.3%</td>
381
+ </tr>
382
+ <tr>
383
+ <td>SpatialVLA (zero-shot)</td>
384
+ <td><b>25.0%</b></td>
385
+ <td><b>20.8%</b></td>
386
+ <td><b>41.7%</b></td>
387
+ <td>20.8%</td>
388
+ <td><b>58.3%</b></td>
389
+ <td>25.0%</td>
390
+ <td><b>79.2%</b></td>
391
+ <td>70.8%</td>
392
+ <td><b>34.4%</b></td>
393
+ </tr>
394
+ <tr>
395
+ <td>SpatialVLA (fine-tuning)</td>
396
+ <td><b>20.8%</b></td>
397
+ <td>16.7%</td>
398
+ <td>29.2%</td>
399
+ <td>25.0%</td>
400
+ <td><b>62.5%</b></td>
401
+ <td>29.2%</td>
402
+ <td><b>100.0%</b></td>
403
+ <td><b>100.0%</b></td>
404
+ <td><b>42.7%</b></td>
405
+ </tr>
406
+ </tbody>
407
+ </table>
408
+ </details>
409
+
410
+ <details>
411
+ <summary>LIBERO Simulation Benchmark Results.</summary>
412
+ <table border="1" class="dataframe">
413
+ <thead>
414
+ <tr style="text-align: center;">
415
+ <th rowspan="2">Model</th>
416
+ <th colspan="2">LIBERO-Spatial</th>
417
+ <th colspan="2">LIBERO-Object</th>
418
+ <th colspan="2">LIBERO-Goal</th>
419
+ <th colspan="2">LIBERO-Long</th>
420
+ <th colspan="2">Average</th>
421
+ </tr>
422
+ <tr style="text-align: center;">
423
+ <th>SR (↑)</th>
424
+ <th>Rank (↓)</th>
425
+ <th>SR (↑)</th>
426
+ <th>Rank (↓)</th>
427
+ <th>SR (↑)</th>
428
+ <th>Rank (↓)</th>
429
+ <th>SR (↑)</th>
430
+ <th>Rank (↓)</th>
431
+ <th>SR (↑)</th>
432
+ <th>Rank (↓)</th>
433
+ </tr>
434
+ </thead>
435
+ <tbody>
436
+ <tr>
437
+ <td>Diffusion Policy from scratch</td>
438
+ <td>78.3 ± 1.1%</td>
439
+ <td>5</td>
440
+ <td><b>92.5 ± 0.7%</b></td>
441
+ <td>1</td>
442
+ <td>68.3 ± 1.2%</td>
443
+ <td>5</td>
444
+ <td>50.5 ± 1.3%</td>
445
+ <td>5</td>
446
+ <td>72.4 ± 0.7%</td>
447
+ <td>5</td>
448
+ </tr>
449
+ <tr>
450
+ <td>Octo fine-tuned</td>
451
+ <td>78.9 ± 1.0%</td>
452
+ <td>4</td>
453
+ <td>85.7 ± 0.9%</td>
454
+ <td>4</td>
455
+ <td><b>84.6 ± 0.9%</b></td>
456
+ <td>1</td>
457
+ <td>51.1 ± 1.3%</td>
458
+ <td>4</td>
459
+ <td>75.1 ± 0.6%</td>
460
+ <td>3</td>
461
+ </tr>
462
+ <tr>
463
+ <td>OpenVLA fine-tuned</td>
464
+ <td>84.7 ± 0.9%</td>
465
+ <td>2</td>
466
+ <td>88.4 ± 0.8%</td>
467
+ <td>3</td>
468
+ <td>79.2 ± 1.0%</td>
469
+ <td>2</td>
470
+ <td>53.7 ± 1.3%</td>
471
+ <td>3</td>
472
+ <td>76.5 ± 0.6%</td>
473
+ <td>2</td>
474
+ </tr>
475
+ <tr>
476
+ <td>TraceVLA fine-tuned</td>
477
+ <td>84.6 ± 0.2%</td>
478
+ <td>3</td>
479
+ <td>85.2 ± 0.4%</td>
480
+ <td>5</td>
481
+ <td>75.1 ± 0.3%</td>
482
+ <td>4</td>
483
+ <td>54.1 ± 1.0%</td>
484
+ <td>2</td>
485
+ <td>74.8 ± 0.5%</td>
486
+ <td>4</td>
487
+ </tr>
488
+ <tr>
489
+ <td>SpatialVLA fine-tuned</td>
490
+ <td><b>88.2 ± 0.5%</b></td>
491
+ <td>1</td>
492
+ <td>89.9 ± 0.7%</td>
493
+ <td>2</td>
494
+ <td>78.6 ± 0.6%</td>
495
+ <td>3</td>
496
+ <td><b>55.5 ± 1.0%</b></td>
497
+ <td>1</td>
498
+ <td><b>78.1 ± 0.7%</b></td>
499
+ <td>1</td>
500
+ </tr>
501
+ </tbody>
502
+ </table>
503
+
504
+ </details>
505
+
506
+ <details>
507
+ <summary>Zero-shot Robot Control Evaluation on WidowX Robot.</summary>
508
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6535045a910b844786a6642f/SUPyXwcdfnWranO04tulL.png" alt="perform">
509
+ </details>
510
+
511
+ <details>
512
+ <summary>Spatial Understanding Capability Evaluation..</summary>
513
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6535045a910b844786a6642f/g-EfM-6M7iM9IYryUTwLA.png" alt="perform">
514
+ </details>
515
+
516
+ <details>
517
+ <summary>Adapting to New Robot Setups on Franka Robot.</summary>
518
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6535045a910b844786a6642f/4Z_vjQvsDGUcHCwmBCtRa.png" alt="perform">
519
+ </details>
520
+
521
+
522
+ ## Citation
523
+
524
+ **BibTeX:**
525
+
526
+ ```BibTeX
527
+ @misc{qu2025spatialvlaexploringspatialrepresentations,
528
+ title={SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model},
529
+ author={Delin Qu and Haoming Song and Qizhi Chen and Yuanqi Yao and Xinyi Ye and Yan Ding and Zhigang Wang and JiaYuan Gu and Bin Zhao and Dong Wang and Xuelong Li},
530
+ year={2025},
531
+ eprint={2501.15830},
532
+ archivePrefix={arXiv},
533
+ primaryClass={cs.RO},
534
+ url={https://arxiv.org/abs/2501.15830},
535
+ }
536
+ ```