File size: 8,683 Bytes
0c583ca f9a4d11 8654ca1 0c583ca 1ee5b85 0c583ca 7d3d296 0c583ca 7d3d296 0c583ca 4d4f23d 0c583ca 1ee5b85 0c583ca 1ee5b85 0c583ca 1ee5b85 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: cc-by-nc-sa-4.0
base_model:
- Qwen/Qwen3-VL-2B-Instruct
tags:
- robotics
- vision-language-action-model
datasets:
- InternRobotics/InternData-A1
---
# InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_InternVLA-A1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
</div>
[](https://arxiv.org/pdf/2601.02456)
[](https://github.com/InternRobotics/InternVLA-A1)
[](https://huggingface.co/datasets/InternRobotics/InternData-A1)
[](https://internrobotics.github.io/internvla-a1.github.io/)
<strong>InternVLA-A1</strong> integrates understanding, generation, and action experts into a unified
model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
- [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
- [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
- [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
## 🔑 Key Features
Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
</div>
Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
the scene diversity of simulation with the physical fidelity of real-world data.
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
</div>
## Usage
Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
## Demonstrations
### ⚡ Dynamic Manipulation
<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
<!-- First Row -->
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
</video>
</div>
<!-- Second Row -->
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
</video>
</div>
<!-- Third Row -->
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
</video>
</div>
<p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
</div>
### 🤖 Daily tasks
<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
<!-- First Row -->
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
</video>
</div>
<!-- Second Row -->
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
</video>
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
</video>
</div>
<p><em>InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
</div>
## License and Citation
All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
```BibTeX
@article{contributors2026internvla_a1,
title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
author={InternVLA-A1 contributors},
journal={arXiv preprint arXiv:2601.02456},
year={2026}
}
```
## Acknowledgments
- [Lerobot](https://github.com/huggingface/lerobot)
- [openpi](https://github.com/Physical-Intelligence/openpi)
- [InternVL](https://github.com/OpenGVLab/InternVL)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- [COSMOS](https://github.com/nvidia-cosmos) |