--- license: cc-by-nc-sa-4.0 base_model: - Qwen/Qwen3-VL-2B-Instruct tags: - robotics - vision-language-action-model datasets: - InternRobotics/InternData-A1 --- # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
Teaser Image
[![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/pdf/2601.02456) [![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/InternVLA-A1) [![Data](https://img.shields.io/badge/Data-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/InternRobotics/InternData-A1) [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/) InternVLA-A1 integrates understanding, generation, and action experts into a unified model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution. Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series: - [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World) - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only ## 🔑 Key Features Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework. It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
Teaser Image
Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines the scene diversity of simulation with the physical fidelity of real-world data.
Teaser Image
## Usage Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1). ## Demonstrations ### âš¡ Dynamic Manipulation

InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.

### 🤖 Daily tasks

InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.

## License and Citation All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research. ```BibTeX @article{contributors2026internvla_a1, title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation}, author={InternVLA-A1 contributors}, journal={arXiv preprint arXiv:2601.02456}, year={2026} } ``` ## Acknowledgments - [Lerobot](https://github.com/huggingface/lerobot) - [openpi](https://github.com/Physical-Intelligence/openpi) - [InternVL](https://github.com/OpenGVLab/InternVL) - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) - [COSMOS](https://github.com/nvidia-cosmos)