|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-2B-Instruct |
|
|
tags: |
|
|
- robotics |
|
|
- vision-language-action-model |
|
|
datasets: |
|
|
- InternRobotics/InternData-A1 |
|
|
--- |
|
|
|
|
|
# InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation |
|
|
|
|
|
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;"> |
|
|
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_InternVLA-A1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
</div> |
|
|
|
|
|
[](https://arxiv.org/pdf/2601.02456) |
|
|
[](https://github.com/InternRobotics/InternVLA-A1) |
|
|
[](https://huggingface.co/datasets/InternRobotics/InternData-A1) |
|
|
[](https://internrobotics.github.io/internvla-a1.github.io/) |
|
|
|
|
|
|
|
|
<strong>InternVLA-A1</strong> integrates understanding, generation, and action experts into a unified |
|
|
model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution. |
|
|
|
|
|
Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series: |
|
|
|
|
|
- [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World) |
|
|
- [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only |
|
|
- [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only |
|
|
|
|
|
## 🔑 Key Features |
|
|
|
|
|
Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework. |
|
|
It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions. |
|
|
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;"> |
|
|
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
</div> |
|
|
|
|
|
Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines |
|
|
the scene diversity of simulation with the physical fidelity of real-world data. |
|
|
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;"> |
|
|
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
</div> |
|
|
|
|
|
## Usage |
|
|
Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1). |
|
|
|
|
|
## Demonstrations |
|
|
### ⚡ Dynamic Manipulation |
|
|
<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;"> |
|
|
<!-- First Row --> |
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;"> |
|
|
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
<!-- Second Row --> |
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;"> |
|
|
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
<!-- Third Row --> |
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;"> |
|
|
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
<p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p> |
|
|
</div> |
|
|
|
|
|
|
|
|
### 🤖 Daily tasks |
|
|
|
|
|
<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;"> |
|
|
<!-- First Row --> |
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;"> |
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
<!-- Second Row --> |
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;"> |
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4"> |
|
|
</video> |
|
|
</div> |
|
|
<p><em>InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p> |
|
|
</div> |
|
|
|
|
|
## License and Citation |
|
|
All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research. |
|
|
|
|
|
```BibTeX |
|
|
@article{contributors2026internvla_a1, |
|
|
title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation}, |
|
|
author={InternVLA-A1 contributors}, |
|
|
journal={arXiv preprint arXiv:2601.02456}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [Lerobot](https://github.com/huggingface/lerobot) |
|
|
- [openpi](https://github.com/Physical-Intelligence/openpi) |
|
|
- [InternVL](https://github.com/OpenGVLab/InternVL) |
|
|
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) |
|
|
- [COSMOS](https://github.com/nvidia-cosmos) |