update teaser figure and experimental results
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ datasets:
|
|
| 12 |
# InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
|
| 13 |
|
| 14 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 15 |
-
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/
|
| 16 |
</div>
|
| 17 |
|
| 18 |
[](https://arxiv.org/pdf/2601.02456)
|
|
@@ -21,96 +21,107 @@ datasets:
|
|
| 21 |
[](https://internrobotics.github.io/internvla-a1.github.io/)
|
| 22 |
|
| 23 |
|
| 24 |
-
<strong>InternVLA-A1</strong> integrates understanding, generation, and action experts
|
| 25 |
-
model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
|
| 26 |
|
| 27 |
Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
|
| 28 |
|
| 29 |
- [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
|
|
|
|
| 30 |
- [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
|
| 31 |
- [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
|
| 32 |
|
| 33 |
## 🔑 Key Features
|
| 34 |
|
| 35 |
-
Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
|
| 36 |
-
It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
|
| 37 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 38 |
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 39 |
</div>
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 45 |
-
</div>
|
| 46 |
|
| 47 |
## Usage
|
| 48 |
Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
|
| 49 |
|
| 50 |
## Demonstrations
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
<!-- First Row -->
|
| 54 |
-
<div style="display: flex; justify-content: center; align-items: center; gap:
|
| 55 |
-
<video controls autoplay loop muted width="250" style="border-radius:
|
| 56 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
|
| 57 |
</video>
|
| 58 |
-
<video controls autoplay loop muted width="250" style="border-radius:
|
| 59 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
|
| 60 |
</video>
|
| 61 |
-
|
| 62 |
-
<!-- Second Row -->
|
| 63 |
-
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
|
| 64 |
-
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 65 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
|
| 66 |
</video>
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
| 68 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
|
| 69 |
</video>
|
| 70 |
-
|
| 71 |
-
<!-- Third Row -->
|
| 72 |
-
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
|
| 73 |
-
<video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 74 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
|
| 75 |
</video>
|
| 76 |
-
<video controls autoplay loop muted width="250" style="border-radius:
|
| 77 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
|
| 78 |
</video>
|
| 79 |
</div>
|
| 80 |
<p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
|
| 81 |
</div>
|
| 82 |
|
|
|
|
| 83 |
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
|
| 87 |
<!-- First Row -->
|
| 88 |
-
<div style="display: flex; justify-content: center; align-items: center; gap:
|
| 89 |
-
<video controls autoplay loop muted width="
|
| 90 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
|
| 91 |
</video>
|
| 92 |
-
<video controls autoplay loop muted width="
|
| 93 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
|
| 94 |
</video>
|
| 95 |
-
<video controls autoplay loop muted width="
|
| 96 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
|
| 97 |
</video>
|
| 98 |
</div>
|
| 99 |
<!-- Second Row -->
|
| 100 |
-
<div style="display: flex; justify-content: center; align-items: center; gap:
|
| 101 |
-
<video controls autoplay loop muted width="
|
| 102 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
|
| 103 |
</video>
|
| 104 |
-
<video controls autoplay loop muted width="
|
| 105 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
|
| 106 |
</video>
|
| 107 |
-
<video controls autoplay loop muted width="
|
| 108 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
|
| 109 |
</video>
|
| 110 |
</div>
|
| 111 |
-
<p><em>InternVLA-A1
|
| 112 |
</div>
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
## License and Citation
|
| 115 |
All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
|
| 116 |
|
|
|
|
| 12 |
# InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
|
| 13 |
|
| 14 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 15 |
+
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_internvla-a1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 16 |
</div>
|
| 17 |
|
| 18 |
[](https://arxiv.org/pdf/2601.02456)
|
|
|
|
| 21 |
[](https://internrobotics.github.io/internvla-a1.github.io/)
|
| 22 |
|
| 23 |
|
| 24 |
+
<strong>InternVLA-A1</strong> integrates understanding, generation, and action experts via a Mixture-of-Transformers (MoT) framework, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
|
|
|
|
| 25 |
|
| 26 |
Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
|
| 27 |
|
| 28 |
- [x] [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
|
| 29 |
+
- [x] [InternVLA-A1-3B-RoboTwin](https://huggingface.co/InternRobotics/InternVLA-A1-3B-RoboTwin): finetuned on RoboTwin 2.0 benchmark
|
| 30 |
- [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
|
| 31 |
- [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
|
| 32 |
|
| 33 |
## 🔑 Key Features
|
| 34 |
|
|
|
|
|
|
|
| 35 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 36 |
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 37 |
</div>
|
| 38 |
|
| 39 |
+
- 🔮 *The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.*
|
| 40 |
+
- 🚀 *The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.*
|
| 41 |
+
- âš¡ *The Output: Tackles highly dynamic scenarios with effortless mastery.*
|
|
|
|
|
|
|
| 42 |
|
| 43 |
## Usage
|
| 44 |
Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
|
| 45 |
|
| 46 |
## Demonstrations
|
| 47 |
+
**InternVLA-A1** exhibits consistent robustness across static manipulation, dynamic manipulation, and simulation benchmarks, especially demonstrating remarkable superiority in dynamic scenarios.
|
| 48 |
+
|
| 49 |
+
<div align="center">
|
| 50 |
+
|
| 51 |
+
### âš¡ Dynamic Manipulation Tasks
|
| 52 |
+
|
| 53 |
+
<div style="display: flex; flex-direction: column; align-items: center; gap: 5px;">
|
| 54 |
<!-- First Row -->
|
| 55 |
+
<div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
|
| 56 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 57 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
|
| 58 |
</video>
|
| 59 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 60 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
|
| 61 |
</video>
|
| 62 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
|
|
|
|
|
|
|
|
|
| 63 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
|
| 64 |
</video>
|
| 65 |
+
</div>
|
| 66 |
+
<!-- Second Row -->
|
| 67 |
+
<div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
|
| 68 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 69 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
|
| 70 |
</video>
|
| 71 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
|
|
|
|
|
|
|
|
|
| 72 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
|
| 73 |
</video>
|
| 74 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 75 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
|
| 76 |
</video>
|
| 77 |
</div>
|
| 78 |
<p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
|
| 79 |
</div>
|
| 80 |
|
| 81 |
+
### 🤖 Static Manipulation Tasks
|
| 82 |
|
| 83 |
+
<div style="display: flex; flex-direction: column; align-items: center; gap: 5px;">
|
|
|
|
|
|
|
| 84 |
<!-- First Row -->
|
| 85 |
+
<div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
|
| 86 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 87 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
|
| 88 |
</video>
|
| 89 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 90 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
|
| 91 |
</video>
|
| 92 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 93 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
|
| 94 |
</video>
|
| 95 |
</div>
|
| 96 |
<!-- Second Row -->
|
| 97 |
+
<div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
|
| 98 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 99 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
|
| 100 |
</video>
|
| 101 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 102 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
|
| 103 |
</video>
|
| 104 |
+
<video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
|
| 105 |
<source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
|
| 106 |
</video>
|
| 107 |
</div>
|
| 108 |
+
<p><em>InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
|
| 109 |
</div>
|
| 110 |
|
| 111 |
+
### 📊 Simulation benchmark
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
| Metric | pi0 | pi0.5 | **InternVLA-A1-3B** |
|
| 115 |
+
| :--- | :---: | :---: | :---: |
|
| 116 |
+
| Avg. Success (Easy) | 79.98% | 86.76% | **89.40%** |
|
| 117 |
+
| Avg. Success (Hard) | 79.50% | 86.96% | **89.64%** |
|
| 118 |
+
|
| 119 |
+
<em>InternVLA-A1 achieves State-of-the-art results on RoboTwin 2.0 Benchmark (averaged over 50 tasks).</em>
|
| 120 |
+
|
| 121 |
+
</div>
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
|
| 125 |
## License and Citation
|
| 126 |
All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.
|
| 127 |
|