SA-VLA / README.md
SSSSphinx's picture
Update README.md
5ad6790 verified
---
license: apache-2.0
language:
- en
- zh
library_name: transformers
tags:
- robotics
- vision-language-action
- reinforcement-learning
- embodied-ai
- openpi
- rlinf
pipeline_tag: reinforcement-learning
---
# SA-VLA: Spatially-Aware Reinforcement Learning for Flow-Matching VLA Models
SA-VLA is a spatially-aware reinforcement learning approach for flow-matching Vision-Language-Action (VLA) models.
It is developed on top of the RLinf framework and targets robust embodied manipulation with stronger spatial generalization.
- 📄 Paper: https://arxiv.org/abs/2602.00743
- 🌐 Project Page: https://xupan.top/Projects/savla
- 🧩 Codebase: https://github.com/TwSphinx54/SA-VLA
- 🏗️ RL Framework: https://github.com/RLinf/RLinf
---
## Model Summary
SA-VLA fuses visual tokens and spatial tokens into geometry-aware embeddings, then optimizes the policy via:
1. **Step-level dense rewards**
2. **Spatially-conditioned exploration (SCAN)**
3. **RL fine-tuning on embodied benchmarks**
This repository provides model weights used in SA-VLA experiments.
---
## Intended Use
- RL fine-tuning and evaluation for embodied manipulation tasks
- Experiments on LIBERO / LIBERO-PLUS style benchmarks
- Research on spatial reasoning in VLA post-training
> For complete environment setup, training scripts, and benchmark integration, use the full code repository:
> https://github.com/TwSphinx54/SA-VLA
---
## Quick Start (with SA-VLA codebase)
### 1) Clone project
```bash
git clone https://github.com/TwSphinx54/SA-VLA.git
cd SA-VLA
```
### 2) Setup environment
Follow the RLinf setup in:
- `README.RLinf.md` (framework/environment)
- `scripts/setup_container.sh` (extra container setup)
### 3) Place weights
Put downloaded checkpoints under:
```text
weights/
```
### 4) Run training / evaluation
```bash
# RL training
bash examples/embodiment/run_embodiment.sh libero_spatial_ppo_openpi_pi05
# Evaluation
bash examples/embodiment/eval_embodiment.sh libero_spatial_ppo_openpi_pi05_eval
```
---
## Recommended Weight Layout
```text
weights
|-- Pi05-LIBERO
|-- Pi05-VGGT-LIBERO-FUSER-SFT_BF16
`-- RLinf-Pi05-SFT
```
---
## Dataset Notes
The SA-VLA experiments rely on LIBERO-family data and benchmark configs.
For subset/full-set switching, modify benchmark mapping in your OpenPi LIBERO installation as documented in the main repo.
---
## Limitations
- Requires non-trivial robotics simulation setup
- Performance depends on environment/version consistency
- Not intended for safety-critical real-world deployment without additional validation
---
## Citation
```bibtex
@misc{pan2026savlaspatiallyawareflowmatchingvisionlanguageaction,
title={SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning},
author={Xu Pan and Zhenglin Wan and Xingrui Yu and Xianwei Zheng and Youkai Ke and Ming Sun and Rui Wang and Ziwei Wang and Ivor Tsang},
year={2026},
eprint={2602.00743},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.00743}
}
```
---
## License
Apache-2.0
---
## Acknowledgments
Built upon:
- RLinf: https://github.com/RLinf/RLinf
- OpenPi: https://github.com/Physical-Intelligence/openpi
- LIBERO: https://github.com/Lifelong-Robot-Learning/LIBERO
- LIBERO-PLUS: https://github.com/sylvestf/LIBERO-plus
- VGGT: https://github.com/facebookresearch/vggt