update the description of InternVLA-A1's key features
Browse files
README.md
CHANGED
|
@@ -31,15 +31,14 @@ Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B p
|
|
| 31 |
|
| 32 |
## 🔑 Key Features
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
low-level dynamics.
|
| 37 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 38 |
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 39 |
</div>
|
| 40 |
|
| 41 |
-
Our hybrid synthetic-real pre-training strategy combines
|
| 42 |
-
the scene diversity of simulation with the physical fidelity of real-world data.
|
| 43 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 44 |
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 45 |
</div>
|
|
|
|
| 31 |
|
| 32 |
## 🔑 Key Features
|
| 33 |
|
| 34 |
+
Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
|
| 35 |
+
It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
|
|
|
|
| 36 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 37 |
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 38 |
</div>
|
| 39 |
|
| 40 |
+
Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
|
| 41 |
+
the scene diversity of simulation with the physical fidelity of real-world data.
|
| 42 |
<div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
|
| 43 |
<img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
| 44 |
</div>
|