Robotics
Safetensors
vision-language-action-model
Jia-Zeng commited on
Commit
7db4dd8
·
verified ·
1 Parent(s): 04430c8

update teaser figure and experimental results

Browse files
Files changed (1) hide show
  1. README.md +48 -37
README.md CHANGED
@@ -12,7 +12,7 @@ datasets:
12
  # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
13
 
14
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
15
- <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_InternVLA-A1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
16
  </div>
17
 
18
  [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/pdf/2601.02456)
@@ -21,96 +21,107 @@ datasets:
21
  [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/)
22
 
23
 
24
- <strong>InternVLA-A1</strong> integrates understanding, generation, and action experts into a unified
25
- model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
26
 
27
  Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
28
 
29
  - [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
 
30
  - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
31
  - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
32
 
33
  ## 🔑 Key Features
34
 
35
- Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
36
- It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
37
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
38
  <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
39
  </div>
40
 
41
- Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
42
- the scene diversity of simulation with the physical fidelity of real-world data.
43
- <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
44
- <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
45
- </div>
46
 
47
  ## Usage
48
  Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
49
 
50
  ## Demonstrations
51
- ### âš¡ Dynamic Manipulation
52
- <div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
 
 
 
 
 
53
  <!-- First Row -->
54
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
55
- <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
56
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
57
  </video>
58
- <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
59
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
60
  </video>
61
- </div>
62
- <!-- Second Row -->
63
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
64
- <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
65
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
66
  </video>
67
- <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 
 
 
68
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
69
  </video>
70
- </div>
71
- <!-- Third Row -->
72
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
73
- <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
74
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
75
  </video>
76
- <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
77
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
78
  </video>
79
  </div>
80
  <p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
81
  </div>
82
 
 
83
 
84
- ### 🤖 Daily tasks
85
-
86
- <div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
87
  <!-- First Row -->
88
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
89
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
90
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
91
  </video>
92
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
93
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
94
  </video>
95
- <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
96
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
97
  </video>
98
  </div>
99
  <!-- Second Row -->
100
- <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
101
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
102
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
103
  </video>
104
- <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
105
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
106
  </video>
107
- <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
108
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
109
  </video>
110
  </div>
111
- <p><em>InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
112
  </div>
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ## License and Citation
115
  All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
116
 
 
12
  # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
13
 
14
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
15
+ <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_internvla-a1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
16
  </div>
17
 
18
  [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/pdf/2601.02456)
 
21
  [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/)
22
 
23
 
24
+ <strong>InternVLA-A1</strong> integrates understanding, generation, and action experts via a Mixture-of-Transformers (MoT) framework, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
 
25
 
26
  Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
27
 
28
  - [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
29
+ - [x] [InternVLA-A1-3B-RoboTwin](https://huggingface.co/InternRobotics/InternVLA-A1-3B-RoboTwin): finetuned on RoboTwin 2.0 benchmark
30
  - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
31
  - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
32
 
33
  ## 🔑 Key Features
34
 
 
 
35
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
36
  <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
37
  </div>
38
 
39
+ - 🔮 *The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.*
40
+ - 🚀 *The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.*
41
+ - âš¡ *The Output: Tackles highly dynamic scenarios with effortless mastery.*
 
 
42
 
43
  ## Usage
44
  Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
45
 
46
  ## Demonstrations
47
+ **InternVLA-A1** exhibits consistent robustness across static manipulation, dynamic manipulation, and simulation benchmarks, especially demonstrating remarkable superiority in dynamic scenarios.
48
+
49
+ <div align="center">
50
+
51
+ ### âš¡ Dynamic Manipulation Tasks
52
+
53
+ <div style="display: flex; flex-direction: column; align-items: center; gap: 5px;">
54
  <!-- First Row -->
55
+ <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
56
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
57
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
58
  </video>
59
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
60
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
61
  </video>
62
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
 
 
 
63
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
64
  </video>
65
+ </div>
66
+ <!-- Second Row -->
67
+ <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
68
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
69
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
70
  </video>
71
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
 
 
 
72
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
73
  </video>
74
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
75
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
76
  </video>
77
  </div>
78
  <p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
79
  </div>
80
 
81
+ ### 🤖 Static Manipulation Tasks
82
 
83
+ <div style="display: flex; flex-direction: column; align-items: center; gap: 5px;">
 
 
84
  <!-- First Row -->
85
+ <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
86
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
87
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
88
  </video>
89
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
90
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
91
  </video>
92
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
93
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
94
  </video>
95
  </div>
96
  <!-- Second Row -->
97
+ <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
98
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
99
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
100
  </video>
101
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
102
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
103
  </video>
104
+ <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
105
  <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
106
  </video>
107
  </div>
108
+ <p><em>InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
109
  </div>
110
 
111
+ ### 📊 Simulation benchmark
112
+
113
+
114
+ | Metric | pi0 | pi0.5 | **InternVLA-A1-3B** |
115
+ | :--- | :---: | :---: | :---: |
116
+ | Avg. Success (Easy) | 79.98% | 86.76% | **89.40%** |
117
+ | Avg. Success (Hard) | 79.50% | 86.96% | **89.64%** |
118
+
119
+ <em>InternVLA-A1 achieves State-of-the-art results on RoboTwin 2.0 Benchmark (averaged over 50 tasks).</em>
120
+
121
+ </div>
122
+
123
+
124
+
125
  ## License and Citation
126
  All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
127