| | --- |
| | license: cc-by-nc-sa-4.0 |
| | base_model: |
| | - InternRobotics/InternVLA-A1-3B |
| | tags: |
| | - robotics |
| | - vision-language-action-model |
| | datasets: |
| | - hxma/RoboTwin-LeRobot-v3.0 |
| | --- |
| | |
| | # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation |
| |
|
| | <div style="display: flex; justify-content: center; align-items: center; margin: 10px 0;"> |
| | <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_internvla-a1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
| | </div> |
| | |
| | [](https://arxiv.org/pdf/2601.02456) |
| | [](https://github.com/InternRobotics/InternVLA-A1) |
| | [](https://huggingface.co/datasets/InternRobotics/InternData-A1) |
| | [](https://internrobotics.github.io/internvla-a1.github.io/) |
| |
|
| |
|
| | <strong>InternVLA-A1</strong> integrates understanding, generation, and action experts via a Mixture-of-Transformers (MoT) framework, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution. |
| |
|
| | Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series: |
| |
|
| | - [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World) |
| | - [x] [InternVLA-A1-3B-RoboTwin](https://huggingface.co/InternRobotics/InternVLA-A1-3B-RoboTwin): finetuned on RoboTwin 2.0 benchmark |
| | - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only |
| | - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only |
| |
|
| | ## **Evaluation on RoboTwin 2.0 Simulation Benchmark** |
| |
|
| | **Setting:** All models are jointly fine-tuned across 50 tasks (50 clean + 500 randomized demos each). |
| |
|
| | **Performance Summary:** InternVLA-A1-3B achieves the highest success rates across both Easy and Hard settings on the RoboTwin 2.0 Benchmark (averaged over 50 tasks). |
| |
|
| | | Metric | pi0 | pi0.5 | **InternVLA-A1-3B** | |
| | | :--- | :---: | :---: | :---: | |
| | | Avg. Success (Easy) | 79.98% | 86.76% | **88.30%** | |
| | | Avg. Success (Hard) | 79.50% | 86.96% | **88.48%** | |
| |
|
| | ## 🔑 Key Features |
| |
|
| | <div style="display: flex; justify-content: center; align-items: center; margin: 10px 0;"> |
| | <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);"> |
| | </div> |
| | |
| | - 🔮 *The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.* |
| | - 🚀 *The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.* |
| | - ⚡ *The Output: Tackles highly dynamic scenarios with effortless mastery.* |
| |
|
| | ## Usage |
| | Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1). |
| |
|
| | ## Demonstrations |
| | **InternVLA-A1** exhibits consistent robustness across static manipulation, dynamic manipulation, and simulation benchmarks, especially demonstrating remarkable superiority in dynamic scenarios. |
| |
|
| | <div align="center"> |
| |
|
| | ### ⚡ Dynamic Manipulation Tasks |
| |
|
| | <div style="display: flex; flex-direction: column; align-items: center; gap: 5px;"> |
| | <!-- First Row --> |
| | <div style="display: flex; justify-content: center; align-items: center; gap: 5px;"> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4"> |
| | </video> |
| | </div> |
| | <!-- Second Row --> |
| | <div style="display: flex; justify-content: center; align-items: center; gap: 5px;"> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4"> |
| | </video> |
| | </div> |
| | <p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p> |
| | </div> |
| | |
| | ### 🤖 Static Manipulation Tasks |
| |
|
| | <div style="display: flex; flex-direction: column; align-items: center; gap: 5px;"> |
| | <!-- First Row --> |
| | <div style="display: flex; justify-content: center; align-items: center; gap: 5px;"> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4"> |
| | </video> |
| | </div> |
| | <!-- Second Row --> |
| | <div style="display: flex; justify-content: center; align-items: center; gap: 5px;"> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4"> |
| | </video> |
| | <video controls autoplay loop muted width="200" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);"> |
| | <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4"> |
| | </video> |
| | </div> |
| | <p><em>InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p> |
| | </div> |
| | |
| | </div> |
| |
|
| |
|
| |
|
| | ## License and Citation |
| | All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research. |
| |
|
| | ```BibTeX |
| | @article{internvla_a1, |
| | title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation}, |
| | author={Cai, Junhao and Cai, Zetao and Cao, Jiafei and Chen, Yilun and He, Zeyu and Jiang, Lei and Li, Hang and Li, Hengjie and Li, Yang and Liu, Yufei and others}, |
| | journal={arXiv preprint arXiv:2601.02456}, |
| | year={2026} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgments |
| |
|
| | - [Lerobot](https://github.com/huggingface/lerobot) |
| | - [openpi](https://github.com/Physical-Intelligence/openpi) |
| | - [InternVL](https://github.com/OpenGVLab/InternVL) |
| | - [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) |
| | - [COSMOS](https://github.com/nvidia-cosmos) |