Robotics
Safetensors
vision-language-action-model
Jia-Zeng commited on
Commit
7d3d296
·
verified ·
1 Parent(s): 1ee5b85

update the description of InternVLA-A1's key features

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -31,15 +31,14 @@ Building upon InternVL3 and Qwen3-VL, we instantiate InternVLA-A1 at 2B and 3B p
31
 
32
  ## 🔑 Key Features
33
 
34
- Architecturally, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unify semantic un-
35
- derstanding, visual foresight, and action prediction, effectively synergizing high-level reasoning with
36
- low-level dynamics.
37
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
38
  <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
39
  </div>
40
 
41
- Our hybrid synthetic-real pre-training strategy combines
42
- the scene diversity of simulation with the physical fidelity of real-world data.
43
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
44
  <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
45
  </div>
 
31
 
32
  ## 🔑 Key Features
33
 
34
+ Regarding model architecture, InternVLA-A1 employs a Mixture-of-Transformers (MoT) design to unifies scene understanding, visual foresight, and action execution into a single framework.
35
+ It synergizes MLLM's semantic understanding with world-model-style dynamic prediction, to "imagine" the future and guide adaptive actions.
 
36
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
37
  <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/method_InternVLA-A1.png" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
38
  </div>
39
 
40
+ Regarding training data, We pre-train InternVLA-A1 on hybrid synthetic-real datasets spanning InternData-A1 and open-source real-world data (e.g. Agibot-World). Our hybrid synthetic-real pre-training strategy combines
41
+ the scene diversity of simulation with the physical fidelity of real-world data.
42
  <div style="display: flex; justify-content: center; align-items: center; margin: 20px 0;">
43
  <img src="https://huggingface.co/spaces/Jia-Zeng/InternVLA_A1_Media/resolve/main/data_paramid.jpg" alt="Teaser Image" style="max-width: 100%; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
44
  </div>