license: mit
pipeline_tag: robotics
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
This repository contains the Vlaser model, introduced in the paper Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning.
Project Page: https://internvl.github.io/blog/2025-10-11-Vlaser/ Code: https://github.com/OpenGVLab/Vlaser/
Introduction
While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser -- a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.
News
2025-10-13: 🤖 We release Vlaser VLM model (Vlaser-2B and Vlaser-8B) as well as VLA model (Vlaser-2B-VLA) on 🤗Vlaser.2025-10-13: 🤖 We release the training and inference code of Vlaser VLM based on InternVL3.
Quick Start
For details on Vlaser VLM, please refer to the Vlaser VLM Quick Start Guide in the GitHub repository.
Citation
If you find this work helpful in your research, please consider citing our paper:
@article{luo2025visual,
title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
journal={arXiv preprint arXiv:2506.00123},
year={2025}
}