NS-VLA: Neuro-Symbolic Vision-Language-Action Model

Model Description

NS-VLA is a neuro-symbolic Vision-Language-Action framework that combines symbolic reasoning with neural control for robotic manipulation. The model introduces:

Symbolic Encoder: Extracts structured manipulation primitives from vision-language inputs
Symbolic Solver: Lightweight action generator with visual token sparsification
Online RL: GRPO-based optimization with primitive-segmented rewards

Model Details

Property	Value
Architecture	Qwen3-VL-2B + Symbolic Classifier + Action Generator
Parameters	~2B (VLM backbone frozen)
Training	Stage I: Supervised Pretraining → Stage II: Online RL (GRPO)
Input	RGB image (224×224) + natural language instruction
Output	Continuous end-effector actions (chunked, H=8)

Performance

Benchmark	Setting	Success Rate (%)
LIBERO	Full demonstrations	98.6
LIBERO	1-shot (one demo per task)	69.1
LIBERO-Plus	Zero-shot generalization	79.4
CALVIN ABC→D	Zero-shot 5-task chain	91.2

Usage

⚠️ Note: Model weights will be released upon paper acceptance. Please check back soon.

# Example usage (coming soon)
from nsvla import NSVLAAgent

agent = NSVLAAgent.from_pretrained("Zuzuzzy/NS-VLA")
action = agent.predict(image=obs, instruction="pick up the red mug")

Citation

@article{zhu2026nsvla,
  title={NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models},
  author={Zhu, Ziyue and Wu, Shangyang and Zhao, Shuai and Zhao, Zhiqiu and Li, Shengjie and Wang, Yi and Li, Fang and Luo, Haoran},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics