Robotics
Safetensors
vision-language-action-model
Jia-Zeng commited on
Commit
0c583ca
·
verified ·
1 Parent(s): 6fe7dc4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ base_model:
4
+ - Qwen/Qwen3-VL-2B-Instruct
5
+ tags:
6
+ - robotics
7
+ - vision-language-action-model
8
+ library_name: transformers
9
+ ---
10
+
11
+ # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
12
+
13
+ <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
14
+ <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_InternVLA-A1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
15
+ </div>
16
+
17
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://internrobotics.github.io/internvla-a1.github.io/paper/InternVLA_A1.pdf)
18
+ [![Code](https://img.shields.io/badge/GitHub-Code-800820?logo=github)](https://github.com/InternRobotics/InternVLA-A1)
19
+ [![Data](https://img.shields.io/badge/Data-HuggingFace-blue?logo=huggingface)](https://huggingface.co/datasets/InternRobotics/InternData-A1)
20
+ [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/)
21
+
22
+
23
+ <strong>InternVLA-A1</strong> integrates understanding, generation, and action experts into a unified
24
+ model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
25
+
26
+ Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
27
+
28
+ - [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
29
+ - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
30
+ - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
31
+
32
+ ## 🔑 Key Features
33
+
34
+ Architecturally, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unify semantic un-
35
+ derstanding, visual foresight, and action prediction, effectively synergizing high-level reasoning with
36
+ low-level dynamics.
37
+ <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
38
+ <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
39
+ </div>
40
+
41
+ Our hybrid synthetic-real pre-training strategy combines
42
+ the scene diversity of simulation with the physical fidelity of real-world data.
43
+ <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
44
+ <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
45
+ </div>
46
+
47
+ ## Demonstrations
48
+ ### ⚡ Dynamic Manipulation
49
+ <div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
50
+ <!-- First Row -->
51
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
52
+ <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
53
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
54
+ </video>
55
+ <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
56
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
57
+ </video>
58
+ </div>
59
+ <!-- Second Row -->
60
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
61
+ <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
62
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
63
+ </video>
64
+ <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
65
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
66
+ </video>
67
+ </div>
68
+ <!-- Third Row -->
69
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
70
+ <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
71
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
72
+ </video>
73
+ <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
74
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
75
+ </video>
76
+ </div>
77
+ <p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
78
+ </div>
79
+
80
+
81
+ ### 🤖 Daily tasks
82
+
83
+ <div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
84
+ <!-- First Row -->
85
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
86
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
87
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
88
+ </video>
89
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
90
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
91
+ </video>
92
+ <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
93
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
94
+ </video>
95
+ </div>
96
+ <!-- Second Row -->
97
+ <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
98
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
99
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
100
+ </video>
101
+ <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
102
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
103
+ </video>
104
+ <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
105
+ <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
106
+ </video>
107
+ </div>
108
+ <p><em>InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
109
+ </div>
110
+
111
+ ## Usage
112
+ Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
113
+
114
+ ## License and Citation
115
+ All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
116
+
117
+ ```BibTeX
118
+ @misc{contributors2026internvla_a1,
119
+ title={InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation},
120
+ author={InternVLA-A1 contributors},
121
+ year={2026}
122
+ }
123
+ ```