InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation

InternVLA-A1 integrates understanding, generation, and action experts via a Mixture-of-Transformers (MoT) framework, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.

Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:

InternVLA-A1-3B: pretrained on heterogeneous data sources over real-robot data, synthetic simulation data, and human videos
InternVLA-A1-3B-RoboTwin: finetuned on RoboTwin 2.0 benchmark
InternVLA-A1-3B-Pretrain-InternData-A1: pretrained on InternData-A1 only
InternVLA-A1-2B-Pretrain-InternData-A1: pretrained on InternData-A1 only

Evaluation on RoboTwin 2.0 Simulation Benchmark

Setting: All models are jointly fine-tuned across 50 tasks (50 clean + 500 randomized demos each).

Performance Summary: InternVLA-A1-3B achieves the highest success rates across both Easy and Hard settings on the RoboTwin 2.0 Benchmark (averaged over 50 tasks).

Metric	pi0	pi0.5	InternVLA-A1-3B
Avg. Success (Easy)	79.98%	86.76%	89.40%
Avg. Success (Hard)	79.50%	86.96%	89.64%

🔑 Key Features

🔮 The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.
🚀 The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.
⚡ The Output: Tackles highly dynamic scenarios with effortless mastery.

Usage

Please refer to our official repo InternVLA-A1.

Demonstrations

InternVLA-A1 exhibits consistent robustness across static manipulation, dynamic manipulation, and simulation benchmarks, especially demonstrating remarkable superiority in dynamic scenarios.

⚡ Dynamic Manipulation Tasks

InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.

🤖 Static Manipulation Tasks

InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation.

License and Citation

All the code within this repo are under CC BY-NC-SA 4.0. Please consider citing our project if it helps your research.

@article{internvla_a1,
  title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
  author={Cai, Junhao and Cai, Zetao and Cao, Jiafei and Chen, Yilun and He, Zeyu and Jiang, Lei and Li, Hang and Li, Hengjie and Li, Yang and Liu, Yufei and others},
  journal={arXiv preprint arXiv:2601.02456},
  year={2026}
}

Acknowledgments

Downloads last month: 66

Safetensors

Model size

3B params

Tensor type

I64

F32

BF16

Video Preview

Robotics

Model tree for InternRobotics/InternVLA-A1-3B-RoboTwin

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

InternRobotics/InternVLA-A1-3B

Finetuned

(3)

this model

Dataset used to train InternRobotics/InternVLA-A1-3B-RoboTwin

Paper for InternRobotics/InternVLA-A1-3B-RoboTwin

InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation

Paper • 2601.02456 • Published Jan 5 • 7