InternRobotics
/

InternVLA-A1-3B

Robotics

Safetensors

vision-language-action-model

Model card Files Files and versions

xet

Community

Jia-Zeng commited on Feb 2

Commit

7db4dd8

verified ·

1 Parent(s): 04430c8

update teaser figure and experimental results

Browse files

Files changed (1) hide show

README.md +48 -37

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ datasets:
 # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
-    <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_InternVLA-A1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>
 [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/pdf/2601.02456)
@@ -21,96 +21,107 @@ datasets:
 [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/)
-<strong>InternVLA-A1</strong>  integrates understanding, generation, and action experts into a unified
-model, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
 Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
 - [x]  [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
 - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
 - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
 ## 🔑 Key Features
-Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
-It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
     <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>
-Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
-the scene diversity of simulation with the physical fidelity of real-world data.
-<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
-    <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
-</div>
 ## Usage
 Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
 ## Demonstrations
-### ⚡  Dynamic Manipulation
-<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
     <!-- First Row -->
-    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
-        <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
         </video>
-    </div>
-    <!-- Second Row -->
-    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
-        <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
         </video>
-    </div>
-    <!-- Third Row -->
-    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
-        <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="250" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
         </video>
     </div>
     <p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
 </div>
-### 🤖 Daily tasks
-<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
     <!-- First Row -->
-    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
-        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
         </video>
     </div>
     <!-- Second Row -->
-    <div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
-        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
         </video>
-        <video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
         </video>
     </div>
-    <p><em>InternVLA-A1 also demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
 </div>
 ## License and Citation
 All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.

 # InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
+    <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/teaser_internvla-a1.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>
 [![Paper](https://img.shields.io/badge/Paper-arXiv-red.svg)](https://arxiv.org/pdf/2601.02456)
 [![Website](https://img.shields.io/badge/Website-Pages-blue.svg)](https://internrobotics.github.io/internvla-a1.github.io/)
+<strong>InternVLA-A1</strong>  integrates understanding, generation, and action experts via a Mixture-of-Transformers (MoT) framework, which synergizes MLLMs' semantic reasoning with world-model-style dynamics prediction to guide action execution.
 Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B parameter scales. Covering different model scales and pre-training data configurations, we release the InternVLA-A1 series:
 - [x]  [InternVLA-A1-3B](https://huggingface.co/InternRobotics/InternVLA-A1-3B): pretrained on the large-scale, high-fidelity simulation data [InternData-A1](https://huggingface.co/datasets/InternRobotics/InternData-A1), together with open-source robot data (e.g. Agibot-World)
+- [x]  [InternVLA-A1-3B-RoboTwin](https://huggingface.co/InternRobotics/InternVLA-A1-3B-RoboTwin): finetuned on RoboTwin 2.0 benchmark
 - [ ] [InternVLA-A1-3B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-3B-Pretrain-InternData-A1): pretrained on InternData-A1 only
 - [ ] [InternVLA-A1-2B-Pretrain-InternData-A1](https://huggingface.co/InternRobotics/InternVLA-A1-2B-Pretrain-InternData-A1): pretrained on InternData-A1 only
 ## 🔑 Key Features
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
     <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>
+- 🔮 *The Core: Synergizes MLLM's semantic understanding with world-model-style dynamic prediction, enabling it to "imagine" the future and guide adaptive actions.*
+- 🚀 *The Fuel: Enables joint training on heterogeneous data sources over real-world robot data, synthetic simulation data, and egocentric human videos.*
+- ⚡ *The Output: Tackles highly dynamic scenarios with effortless mastery.*
 ## Usage
 Please refer to our official repo [InternVLA-A1](https://github.com/InternRobotics/InternVLA-A1).
 ## Demonstrations
+**InternVLA-A1** exhibits consistent robustness across static manipulation, dynamic manipulation, and simulation benchmarks, especially demonstrating remarkable superiority in dynamic scenarios.
+<div align="center">
+### ⚡ Dynamic Manipulation Tasks
+<div style="display: flex; flex-direction: column; align-items: center; gap: 5px;">
     <!-- First Row -->
+    <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_complete.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/In-motion_Ingredient_Picking_4x.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_3.mp4" type="video/mp4">
         </video>
+    </div>
+    <!-- Second Row -->
+    <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_inverted_2.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_2.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/express_sorting_upright_1.mp4" type="video/mp4">
         </video>
     </div>
     <p><em>InternVLA-A1 exhibits exceptional robustness in highly dynamic scenarios.</em></p>
 </div>
+### 🤖 Static Manipulation Tasks
+<div style="display: flex; flex-direction: column; align-items: center; gap: 5px;">
     <!-- First Row -->
+    <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/zig_bag_4x.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sort_parts_4x.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/unscrew_cap_4x.mp4" type="video/mp4">
         </video>
     </div>
     <!-- Second Row -->
+    <div style="display: flex; justify-content: center; align-items: center; gap: 5px;">
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/wipe_stain_4x.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/place_flower_4x.mp4" type="video/mp4">
         </video>
+        <video controls autoplay loop muted width="250" style="border-radius: 5px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1);">
             <source src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/sweep_trash_4x.mp4" type="video/mp4">
         </video>
     </div>
+    <p><em>InternVLA-A1 demonstrates superior proficiency in dexterous and fine-grained manipulation.</em></p>
 </div>
+### 📊 Simulation benchmark
+| Metric | pi0 | pi0.5 | **InternVLA-A1-3B** |
+| :--- | :---: | :---: | :---: |
+| Avg. Success (Easy) | 79.98% | 86.76% | **89.40%** |
+| Avg. Success (Hard) | 79.50% | 86.96% | **89.64%** |
+<em>InternVLA-A1 achieves State-of-the-art results on RoboTwin 2.0 Benchmark (averaged over 50 tasks).</em>
+</div>
 ## License and Citation
 All the code within this repo are under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Please consider citing our project if it helps your research.