Robotics
Safetensors
vision-language-action-model

InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation

Teaser Image

Paper Code Data Website

InternVLA-A1 integrates understanding, generation, and action experts into a unified model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.

Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:

Evaluation on RoboTwin 2.0 Simulation Benchmark

Setting: All models are jointly fine-tuned across 50 tasks (50 clean + 500 randomized demos each).

Performance Summary: InternVLA-A1-3B achieves the highest success rates across both Easy and Hard settings on the RoboTwin 2.0 Benchmark (averaged over 50 tasks).

Metric $\pi_0$ $\pi_{0.5}$ InternVLA-A1-3B
Avg. Success (Easy) 79.98% 84.70% 88.30% πŸ₯‡
Avg. Success (Hard) 79.50% 85.02% 88.48% πŸ₯‡
πŸ”» Click to view detailed results for specific tasks

The table below shows success rates formatted as Easy / Hard.

Task Name $\pi_0$ $\pi_{0.5}$ InternVLA-A1-3B
Click Bell 70.0% / 69.0% 97.0% / 93.0% 97.0% / 94.0%
Move Pillbottle Pad 83.0% / 82.0% 92.0% / 89.0% 95.0% / 99.0%
Open Laptop 90.0% / 97.0% 92.0% / 97.0% 99.0% / 99.0%
Handover Block 70.0% / 53.0% 60.0% / 59.0% 87.0% / 81.0%
Blocks Ranking Size 59.0% / 57.0% 73.0% / 77.0% 82.0% / 92.0%
Place Dual Shoes 69.0% / 76.0% 57.0% / 65.0% 93.0% / 85.0%
Stamp Seal 62.0% / 65.0% 66.0% / 73.0% 71.0% / 71.0%
Stack Bowls Three 81.0% / 75.0% 88.0% / 85.0% 86.0% / 95.0%

πŸ”‘ Key Features

Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework. It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.

Teaser Image

Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines the scene diversity of simulation with the physical fidelity of real-world data.

Teaser Image

Usage

Please refer to our official repo InternVLA-A1.

Demonstrations

⚑ Dynamic Manipulation

InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.

πŸ€– Daily tasks

InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.

License and Citation

All the code within this repo are under CC BY-NC-SA 4.0. Please consider citing our project if it helps your research.

@article{contributors2026internvla_a1,
  title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
  author={InternVLA-A1 contributors},
  journal={arXiv preprint arXiv:2601.02456},
  year={2026}
}

Acknowledgments

Downloads last month
20
Safetensors
Model size
3B params
Tensor type
I64
Β·
F32
Β·
BF16
Β·
Video Preview
loading

Model tree for InternRobotics/InternVLA-A1-3B-RoboTwin

Finetuned
(2)
this model

Dataset used to train InternRobotics/InternVLA-A1-3B-RoboTwin

Paper for InternRobotics/InternVLA-A1-3B-RoboTwin