File size: 2,241 Bytes
42af0a5 4e008a0 42af0a5 e21de9a 42af0a5 4c1049e 42af0a5 344c470 42af0a5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | ---
license: cc-by-nc-sa-4.0
tags:
- robotics
- vision-language-action-model
- vision-language-model
---
# Model Card for InternVLA-M1-Pretrain-RT-1-Bridge
## Description:
**InternVLA-M1** is an open-source, end-to-end **vision–language–action (VLA) framework** for building and researching generalist robot policies. The checkpoints in this repository were trained on the RT-1 and Bridge datasets.
- 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
- 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)

## Quick Start
```python
# ===== system2 demo =====
from InternVLA.model.framework.M1 import InternVLA_M1
from PIL import Image
import requests
from io import BytesIO
def load_image_from_url(url: str) -> Image.Image:
resp = requests.get(url, timeout=15)
resp.raise_for_status()
img = Image.open(BytesIO(resp.content)).convert("RGB")
return img
saved_model_path = "/PATH//checkpoints/steps_50000_pytorch_model.pt"
internVLA_M1 = InternVLA_M1.from_pretrained(
saved_model_path
)
image_url="https://github.com/InternRobotics/InternVLA-M1/blob/InternVLA-M1/assets/table.jpeg"
image = load_image_from_url(image_url)
question = "give the bbox for the apple."
response = internVLA_M1.chat_with_M1(image, question)
# ===== predict_action demo =====
# constuct input: batch size = 1, two views
view1 = load_image_from_url(image_url)
view2 = view1.copy()
batch_images = [[view1]] # List[List[PIL.Image]]
instructions = ["pick up the apple and place it on the plate."]
if torch.cuda.is_available():
internVLA_M1 = internVLA_M1.to("cuda")
# action predict
pred = internVLA_M1.predict_action(
batch_images=batch_images,
instructions=instructions,
cfg_scale=1.5,
use_ddim=True,
num_ddim_steps=10,
)
normalized_actions = pred["normalized_actions"] # [B, T, action_dim]
```
## Citation
```
@misc{internvla2024,
title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
author = {InternVLA-M1 Contributors},
year = {2025},
booktitle={arXiv},
}
``` |