---
license: mit
pipeline_tag: robotics
---

# Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

This repository contains the Vlaser model, introduced in the paper [Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning](https://huggingface.co/papers/2510.11027).

**Project Page**: [https://internvl.github.io/blog/2025-10-11-Vlaser/](https://internvl.github.io/blog/2025-10-11-Vlaser/)
**Code**: [https://github.com/OpenGVLab/Vlaser/](https://github.com/OpenGVLab/Vlaser/)

<p align="center">
<img src="https://github.com/OpenGVLab/Vlaser/raw/main/images/embodied_fig1_1.png" alt="Vlaser Overview" style="width: 100%; height: auto;" />
</p>

## Introduction

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing **Vlaser** -- a **V**ision-**L**anguage-**A**ction Model with **s**ynergistic **e**mbodied **r**easoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality **Vlaser-6M** dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning.
Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

## News

- **`2025-10-13`**: 🤖 We release Vlaser VLM model (Vlaser-2B and Vlaser-8B) as well as VLA model (Vlaser-2B-VLA) on [🤗Vlaser](https://huggingface.co/collections/OpenGVLab/vlaser-68e9fd4178da453c348997f8).
- **`2025-10-13`**: 🤖 We release the training and inference code of Vlaser VLM based on [InternVL3](https://github.com/OpenGVLab/InternVL).

## Quick Start

For details on Vlaser VLM, please refer to the [Vlaser VLM Quick Start Guide](https://github.com/OpenGVLab/Vlaser/tree/main/Vlaser_VLM) in the GitHub repository.

## Citation

If you find this work helpful in your research, please consider citing our paper:

```bibtex
@article{luo2025visual,
  title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
  author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
  journal={arXiv preprint arXiv:2506.00123},
  year={2025}
}
```