File size: 3,374 Bytes
5ad6790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: apache-2.0
language:
- en
- zh
library_name: transformers
tags:
- robotics
- vision-language-action
- reinforcement-learning
- embodied-ai
- openpi
- rlinf
pipeline_tag: reinforcement-learning
---

# SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching VLA Models

SA-VLA is a spatially-aware reinforcement learning approach for flow-matching Vision-Language-Action (VLA) models.  
It is developed on top of the RLinf framework and targets robust embodied manipulation with stronger spatial generalization.

- 📄 Paper: https://arxiv.org/abs/2602.00743  
- 🌐 Project Page: https://xupan.top/Projects/savla  
- 🧩 Codebase: https://github.com/TwSphinx54/SA-VLA  
- 🏗️ RL Framework: https://github.com/RLinf/RLinf

---

## Model Summary

SA-VLA fuses visual tokens and spatial tokens into geometry-aware embeddings, then optimizes the policy via:
1. **Step-level dense rewards**
2. **Spatially-conditioned exploration (SCAN)**
3. **RL fine-tuning on embodied benchmarks**

This repository provides model weights used in SA-VLA experiments.

---

## Intended Use

- RL fine-tuning and evaluation for embodied manipulation tasks
- Experiments on LIBERO / LIBERO-PLUS style benchmarks
- Research on spatial reasoning in VLA post-training

> For complete environment setup, training scripts, and benchmark integration, use the full code repository:
> https://github.com/TwSphinx54/SA-VLA

---

## Quick Start (with SA-VLA codebase)

### 1) Clone project
```bash
git clone https://github.com/TwSphinx54/SA-VLA.git
cd SA-VLA
```

### 2) Setup environment
Follow the RLinf setup in:
- `README.RLinf.md` (framework/environment)
- `scripts/setup_container.sh` (extra container setup)

### 3) Place weights
Put downloaded checkpoints under:
```text
weights/
```

### 4) Run training / evaluation
```bash
# RL training
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi_pi05

# Evaluation
bash examples/embodiment/eval_embodiment.sh libero_spatial_ppo_openpi_pi05_eval
```

---

## Recommended Weight Layout

```text
weights
|-- Pi05-LIBERO
|-- Pi05-VGGT-LIBERO-FUSER-SFT_BF16
`-- RLinf-Pi05-SFT
```

---

## Dataset Notes

The SA-VLA experiments rely on LIBERO-family data and benchmark configs.  
For subset/full-set switching, modify benchmark mapping in your OpenPi LIBERO installation as documented in the main repo.

---

## Limitations

- Requires non-trivial robotics simulation setup
- Performance depends on environment/version consistency
- Not intended for safety-critical real-world deployment without additional validation

---

## Citation

```bibtex
@misc{pan2026savlaspatiallyawareflowmatchingvisionlanguageaction,
  title={SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning},
  author={Xu Pan and Zhenglin Wan and Xingrui Yu and Xianwei Zheng and Youkai Ke and Ming Sun and Rui Wang and Ziwei Wang and Ivor Tsang},
  year={2026},
  eprint={2602.00743},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2602.00743}
}
```

---

## License

Apache-2.0

---

## Acknowledgments

Built upon:
- RLinf: https://github.com/RLinf/RLinf
- OpenPi: https://github.com/Physical-Intelligence/openpi
- LIBERO: https://github.com/Lifelong-Robot-Learning/LIBERO
- LIBERO-PLUS: https://github.com/sylvestf/LIBERO-plus
- VGGT: https://github.com/facebookresearch/vggt