NS-VLA: Neuro-Symbolic Vision-Language-Action Model
Model Description
NS-VLA is a neuro-symbolic Vision-Language-Action framework that combines symbolic reasoning with neural control for robotic manipulation. The model introduces:
- Symbolic Encoder: Extracts structured manipulation primitives from vision-language inputs
- Symbolic Solver: Lightweight action generator with visual token sparsification
- Online RL: GRPO-based optimization with primitive-segmented rewards
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3-VL-2B + Symbolic Classifier + Action Generator |
| Parameters | ~2B (VLM backbone frozen) |
| Training | Stage I: Supervised Pretraining → Stage II: Online RL (GRPO) |
| Input | RGB image (224×224) + natural language instruction |
| Output | Continuous end-effector actions (chunked, H=8) |
Performance
| Benchmark | Setting | Success Rate (%) |
|---|---|---|
| LIBERO | Full demonstrations | 98.6 |
| LIBERO | 1-shot (one demo per task) | 69.1 |
| LIBERO-Plus | Zero-shot generalization | 79.4 |
| CALVIN ABC→D | Zero-shot 5-task chain | 91.2 |
Usage
⚠️ Note: Model weights will be released upon paper acceptance. Please check back soon.
# Example usage (coming soon)
from nsvla import NSVLAAgent
agent = NSVLAAgent.from_pretrained("Zuzuzzy/NS-VLA")
action = agent.predict(image=obs, instruction="pick up the red mug")
Citation
@article{zhu2026nsvla,
title={NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models},
author={Zhu, Ziyue and Wu, Shangyang and Zhao, Shuai and Zhao, Zhiqiu and Li, Shengjie and Wang, Yi and Li, Fang and Luo, Haoran},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}
License
This model is released under the Apache 2.0 License.