InternRobotics
/

InternVLA-A1-3B

Jia-Zeng commited on Jan 8

Commit

7d3d296

verified ·

1 Parent(s): 1ee5b85

update the description of InternVLA-A1's key features

Files changed (1) hide show

README.md CHANGED Viewed

@@ -31,15 +31,14 @@ Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B p
 ## 🔑 Key Features
-Architecturally, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unify semantic un-
-derstanding, visual foresight, and action prediction, effectively synergizing high-level reasoning with
-low-level dynamics.
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
     <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>
-Our hybrid synthetic-real pre-training strategy combines
-the scene diversity of simulation with the physical fidelity of real-world data.
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
     <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>

 ## 🔑 Key Features
+Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
+It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
     <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>
+Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
+the scene diversity of simulation with the physical fidelity of real-world data.
 <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
     <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
 </div>