InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
InternVLA-A1 integrates understanding, generation, and action experts into a unified model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
- InternVLA-A1-3B: pretrained on the large-scale, high-fidelity simulation data InternData-A1, together with open-source robot data (e.g. Agibot-World)
- InternVLA-A1-3B-RoboTwin: finetuned on RoboTwin 2.0 benchmark
- InternVLA-A1-3B-Pretrain-InternData-A1: pretrained on InternData-A1 only
- InternVLA-A1-2B-Pretrain-InternData-A1: pretrained on InternData-A1 only
Evaluation on RoboTwin 2.0 Simulation Benchmark
Setting: All models are jointly fine-tuned across 50 tasks (50 clean + 500 randomized demos each).
Performance Summary: InternVLA-A1-3B achieves the highest success rates across both Easy and Hard settings on the RoboTwin 2.0 Benchmark (averaged over 50 tasks).
| Metric | $\pi_0$ | $\pi_{0.5}$ | InternVLA-A1-3B |
|---|---|---|---|
| Avg. Success (Easy) | 79.98% | 84.70% | 88.30% π₯ |
| Avg. Success (Hard) | 79.50% | 85.02% | 88.48% π₯ |
π» Click to view detailed results for specific tasks
The table below shows success rates formatted as Easy / Hard.
| Task Name | $\pi_0$ | $\pi_{0.5}$ | InternVLA-A1-3B |
|---|---|---|---|
| Click Bell | 70.0% / 69.0% | 97.0% / 93.0% | 97.0% / 94.0% |
| Move Pillbottle Pad | 83.0% / 82.0% | 92.0% / 89.0% | 95.0% / 99.0% |
| Open Laptop | 90.0% / 97.0% | 92.0% / 97.0% | 99.0% / 99.0% |
| Handover Block | 70.0% / 53.0% | 60.0% / 59.0% | 87.0% / 81.0% |
| Blocks Ranking Size | 59.0% / 57.0% | 73.0% / 77.0% | 82.0% / 92.0% |
| Place Dual Shoes | 69.0% / 76.0% | 57.0% / 65.0% | 93.0% / 85.0% |
| Stamp Seal | 62.0% / 65.0% | 66.0% / 73.0% | 71.0% / 71.0% |
| Stack Bowls Three | 81.0% / 75.0% | 88.0% / 85.0% | 86.0% / 95.0% |
π Key Features
Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework. It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines the scene diversity of simulation with the physical fidelity of real-world data.
Usage
Please refer to our official repo InternVLA-A1.
Demonstrations
β‘ Dynamic Manipulation
InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.
π€ Daily tasks
InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.
License and Citation
All the code within this repo are under CC BY-NC-SA 4.0. Please consider citing our project if it helps your research.
@article{contributors2026internvla_a1,
title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
author={InternVLA-A1 contributors},
journal={arXiv preprint arXiv:2601.02456},
year={2026}
}
Acknowledgments
- Downloads last month
- 20
Model tree for InternRobotics/InternVLA-A1-3B-RoboTwin
Base model
Qwen/Qwen3-VL-2B-Instruct