[](https://arxiv.org/pdf/2601.02456)
[](https://github.com/InternRobotics/InternVLA-A1)
[](https://huggingface.co/datasets/InternRobotics/InternData-A1)
[](https://internrobotics.github.io/internvla-a1.github.io/)
InternVLA-A1 integrates understanding, generation, and action experts via a Mixture-of-Transformers (MoT) framework, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
- [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
- [x] [InternVLA-A1-3B-RoboTwin](https://huggingface.co/InternRobotics/InternVLA-A1-3B-RoboTwin): finetuned on RoboTwin 2.0 benchmark
- [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
- [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
## **Evaluation on RoboTwin 2.0 Simulation Benchmark**
**Setting:** All models are jointly fine-tuned across 50 tasks (50 clean + 500 randomized demos each).
**Performance Summary:** InternVLA-A1-3B achieves the highest success rates across both Easy and Hard settings on the RoboTwin 2.0 Benchmark (averaged over 50 tasks).
| Metric | pi0 | pi0.5 | **InternVLA-A1-3B** |
| :--- | :---: | :---: | :---: |
| Avg. Success (Easy) | 79.98% | 86.76% | **88.30%** |
| Avg. Success (Hard) | 79.50% | 86.96% | **88.48%** |
## 🔑 Key Features
- 🔮 *The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.*
- 🚀 *The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.*
- âš¡ *The Output: Tackles highly dynamic scenarios with effortless mastery.*
## Usage
Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
## Demonstrations
**InternVLA-A1** exhibits consistent robustness across static manipulation, dynamic manipulation, and simulation benchmarks, especially demonstrating remarkable superiority in dynamic scenarios.
### âš¡ Dynamic Manipulation Tasks
InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.
### 🤖 Static Manipulation Tasks
InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation.
## License and Citation
All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
```BibTeX
@article{internvla_a1,
title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
author={Cai, Junhao and Cai, Zetao and Cao, Jiafei and Chen, Yilun and He, Zeyu and Jiang, Lei and Li, Hang and Li, Hengjie and Li, Yang and Liu, Yufei and others},
journal={arXiv preprint arXiv:2601.02456},
year={2026}
}
```
## Acknowledgments
- [Lerobot](https://github.com/huggingface/lerobot)
- [openpi](https://github.com/Physical-Intelligence/openpi)
- [InternVL](https://github.com/OpenGVLab/InternVL)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- [COSMOS](https://github.com/nvidia-cosmos)