NS-VLA: Neuro-Symbolic Vision-Language-Action Model

arXiv GitHub Project Page

Model Description

NS-VLA is a neuro-symbolic Vision-Language-Action framework that combines symbolic reasoning with neural control for robotic manipulation. The model introduces:

  • Symbolic Encoder: Extracts structured manipulation primitives from vision-language inputs
  • Symbolic Solver: Lightweight action generator with visual token sparsification
  • Online RL: GRPO-based optimization with primitive-segmented rewards

Model Details

Property Value
Architecture Qwen3-VL-2B + Symbolic Classifier + Action Generator
Parameters ~2B (VLM backbone frozen)
Training Stage I: Supervised Pretraining → Stage II: Online RL (GRPO)
Input RGB image (224×224) + natural language instruction
Output Continuous end-effector actions (chunked, H=8)

Performance

Benchmark Setting Success Rate (%)
LIBERO Full demonstrations 98.6
LIBERO 1-shot (one demo per task) 69.1
LIBERO-Plus Zero-shot generalization 79.4
CALVIN ABC→D Zero-shot 5-task chain 91.2

Usage

⚠️ Note: Model weights will be released upon paper acceptance. Please check back soon.

# Example usage (coming soon)
from nsvla import NSVLAAgent

agent = NSVLAAgent.from_pretrained("Zuzuzzy/NS-VLA")
action = agent.predict(image=obs, instruction="pick up the red mug")

Citation

@article{zhu2026nsvla,
  title={NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models},
  author={Zhu, Ziyue and Wu, Shangyang and Zhao, Shuai and Zhao, Zhiqiu and Li, Shengjie and Wang, Yi and Li, Fang and Luo, Haoran},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading