InternVLA-A1-3B / README.md

update readme

f9a4d11 verified 6 days ago

8.68 kB

	---
	license: cc-by-nc-sa-4.0
	base_model:
	- Qwen/Qwen3-VL-2B-Instruct
	tags:
	- robotics
	- vision-language-action-model
	datasets:
	- InternRobotics/InternData-A1
	---

	# InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation

	<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
	<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_InternVLA-A1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	</div>

	[![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/pdf/2601.02456)
	[![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/InternVLA-A1)
	[![Data](https://img.shields.io/badge/Data-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/InternRobotics/InternData-A1)
	[![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/)


	<strong>InternVLA-A1</strong> integrates understanding, generation, and action experts into a unified
	model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.

	Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:

	- [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
	- [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
	- [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only

	## 🔑 Key Features

	Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
	It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
	<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
	<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	</div>

	Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
	the scene diversity of simulation with the physical fidelity of real-world data.
	<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
	<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	</div>

	## Usage
	Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).

	## Demonstrations
	### ⚡ Dynamic Manipulation
	<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
	<!-- First Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
	</video>
	</div>
	<!-- Second Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
	</video>
	</div>
	<!-- Third Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
	</video>
	</div>
	<p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
	</div>


	### 🤖 Daily tasks

	<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
	<!-- First Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
	</video>
	</div>
	<!-- Second Row -->
	<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
	</video>
	<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
	<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
	</video>
	</div>
	<p><em>InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
	</div>

	## License and Citation
	All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.

	```BibTeX
	@article{contributors2026internvla_a1,
	title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
	author={InternVLA-A1 contributors},
	journal={arXiv preprint arXiv:2601.02456},
	year={2026}
	}
	```

	## Acknowledgments

	- [Lerobot](https://github.com/huggingface/lerobot)
	- [openpi](https://github.com/Physical-Intelligence/openpi)
	- [InternVL](https://github.com/OpenGVLab/InternVL)
	- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
	- [COSMOS](https://github.com/nvidia-cosmos)