File size: 10,136 Bytes
be1cbe2 42d5651 ca9099d be1cbe2 ca9099d be1cbe2 ca9099d be1cbe2 ca9099d be1cbe2 ca9099d be1cbe2 ca9099d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
---
license: cc-by-nc-sa-4.0
tags:
- robotics
- vision-language-action-model
- vision-language-model
pipeline_tag: robotics
library_name: transformers
---
# Model Card for InternVLA-M1_spatial
InternVLA-M1 is an open-source, end-to-end vision–language–action (VLA) framework for building and researching generalist robot policies, as presented in the paper [InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy](https://huggingface.co/papers/2510.13778).
- 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
- 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)
<div align="center">
<img src="https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/main/assets/teaser.png" width="100%" height="100%"/>
</div>
## Abstract
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots.
## 🔥 Key Features
1. **Modular & Extensible**
All core components (model architecture, training data, training strategies, evaluation pipeline) are fully decoupled, enabling independent development, debugging, and extension of each module.
2. **Dual-System and Dual-Supervision**
InternVLA-M1 integrates both a language head and an action head under a unified framework, enabling collaborative training with dual supervision.
3. **Efficient Training & Fast Convergence**
Learns spatial and visual priors from large-scale multimodal pretraining and transfers them via spatial prompt fine-tuning. Achieves strong performance (e.g., SOTA-level convergence on in ~2.5 epochs without separate action pretraining).
## 🎯 Target Audience
1. Users who want to leverage open-source VLMs (e.g., Qwen2.5-VL) for robot control.
2. Teams co-training action datasets jointly with multimodal (vision–language) data.
3. Researchers exploring alternative VLA architectures and training strategies.
## 📊 Experimental Results
| | WindowX | Google Robot(VA) | Google Robot(VM) | LIBERO |
|-------------|---------|------------------|------------------|--------|
| $\pi_0$ | 27.1 | 54.8 | 58.8 | 94.2 |
| GR00t | 61.9 | 44.5 | 35.2 | 93.9 |
| InternVLA-M1 |**71.7** |**76.0** |**80.7** |**95.9**|
## 🚀 Quick Start
### 🛠 Environment Setup
```bash
# Clone the repo
git clone https://github.com/InternRobotics/InternVLA-M1
# Create conda environment
conda create -n internvla-m1 python=3.10 -y
conda activate internvla-m1
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolation
# Install InternVLA-M1
pip install -e .
```
### ⚡ Quick Interactive M1 Demo
Below are two collapsible examples: InternVLA-M1 chat and action prediction.
<details open>
<summary><b>InternVLA-M1 Chat Demo (image Q&A / Spatial Grounding)</b></summary>
```python
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
import torch
def load_image_from_url(url: str) -> Image.Image:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
img = Image.open(BytesIO(resp.content)).convert("RGB")
return img
saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
# Use the raw image link for direct download
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
image = load_image_from_url(image_url)
question = "Give the bounding box for the apple."
response = internVLA_M1.chat_with_M1(image, question)
print(response)
```
</details>
<details>
<summary><b>InternVLA-M1 Action Prediction Demo (two views)</b></summary>
```python
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
import torch
def load_image_from_url(url: str) -> Image.Image:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
img = Image.open(BytesIO(resp.content)).convert("RGB")
return img
saved_model_path = "/PATH/checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(saved_model_path)
image_url = "https://raw.githubusercontent.com/InternRobotics/InternVLA-M1/InternVLA-M1/assets/table.jpeg"
view1 = load_image_from_url(image_url)
view2 = view1.copy()
# Construct input: batch size = 1, two views
batch_images = [[view1, view2]] # List[List[PIL.Image]]
instructions = ["Pick up the apple and place it on the plate."]
if torch.cuda.is_available():
internVLA_M1 = internVLA_M1.to("cuda")
pred = internVLA_M1.predict_action(
batch_images=batch_images,
instructions=instructions,
cfg_scale=1.5,
use_ddim=True,
num_ddim_steps=10,
)
normalized_actions = pred["normalized_actions"] # [B, T, action_dim]
print(normalized_actions.shape, type(normalized_actions))
```
</details>
### 📘 Examples
We provide several end-to-end examples for reference:
* **Reproduce InternVLA-M1 in SimplerEnv**
[Example](/examples/SimplerEnv)
* **Reproduce InternVLA-M1 in LIBERO**
[Example](/examples/LIBERO)
* **Training/Deployment on real robots**
[Example](/examples/real_robot)
## 📈 Model Zoo
We release a series of pretrained models and checkpoints to facilitate reproduction and downstream use.
### ✅ Available Checkpoints
| Model | Description | Link |
|-------|-------------|------|
| **InternVLA-M1** | Main pretrained model | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1) |
| **InternVLA-M1-Pretrain-RT-1-Bridge** | Pretraining on RT-1 Bridge data | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-Pretrain-RT-1-Bridge) |
| **InternVLA-M1-LIBERO-Long** | Fine-tuned on LIBERO Long-horizon tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Long) |
| **InternVLA-M1-LIBERO-Goal** | Fine-tuned on LIBERO Goal-conditioned tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Goal) |
| **InternVLA-M1-LIBERO-Spatial** | Fine-tuned on LIBERO Spatial reasoning tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Spatial) |
| **InternVLA-M1-LIBERO-Object** | Fine-tuned on LIBERO Object-centric tasks | [🤗 Hugging Face](https://huggingface.co/InternRobotics/InternVLA-M1-LIBERO-Object) |
## Training Details
```
action_chunk: 8
batch_size: 128
training_steps: 30k
```
## 🗺️ Roadmap
* [ ] Add Co-Training Multimodel Multitask Readme (now co-training code is already here)
* [x] 0930: Unified Inference Server for Simpler and LIBERO
* [x] 0918: Release model weights
## 🤝 Contributing
We welcome contributions via Pull Requests or Issues.
Please include detailed logs and reproduction steps when reporting bugs.
## 📜 Citation
If you find this useful in your research, please consider citing:
```bibtex
@article{internvlam1,
title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
author = {InternVLA-M1 Contributors},
journal = {arXiv preprint arXiv:2510.13778},
year = {2025}
}
```
## 📬 Contact
* Issues: Submit via GitHub Issues with detailed logs and steps
## 🙏 Acknowledgements
We thank the open-source community for their inspiring work. This project builds upon and is inspired by the following projects (alphabetical order):
- [IPEC-COMMUNITY](https://huggingface.co/IPEC-COMMUNITY): Curated OXE / LIBERO style multi-task datasets and formatting examples.
- [Isaac-GR00T](https://github.com/NVIDIA/Isaac-GR00T): Standardized action data loader (GR00T-LeRobot).
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL/blob/main/qwen-vl-finetune/README.md): Multimodal input/output format, data loader, and pretrained VLM backbone.
- [CogACT](https://github.com/microsoft/CogACT/tree/main/action_model): Reference for a DiT-style action head design.
- [Llavavla](https://github.com/JinhuiYE/llavavla): Baseline code structure and engineering design references.
- [GenManip Simulation Platform](https://github.com/InternRobotics/GenManip): Simulation platform for generalizable pick-and-place based on Isaac Sim.
Notes:
- If any required attribution or license header is missing, please open an issue and we will correct it promptly.
- All third-party resources remain under their original licenses; users should comply with respective terms.
---
Thanks for using **InternVLA-M1**! 🌟
If you find it useful, please consider giving us a ⭐ on GitHub. |