| | --- |
| | license: cc-by-nc-sa-4.0 |
| | tags: |
| | - robotics |
| | - vision-language-action-model |
| | - vision-language-model |
| | --- |
| | # Model Card for InternVLA-M1-Pretrain-RT-1-Bridge |
| |
|
| | ## Description: |
| | **InternVLA-M1** is an open-source, end-to-end **vision–language–action (VLA) framework** for building and researching generalist robot policies. The checkpoints in this repository were trained on the RT-1 and Bridge datasets. |
| | - 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/) |
| | - 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1) |
| |
|
| |
|
| |  |
| |
|
| |
|
| | ## Quick Start |
| | ```python |
| | # ===== system2 demo ===== |
| | from InternVLA.model.framework.M1 import InternVLA_M1 |
| | from PIL import Image |
| | import requests |
| | from io import BytesIO |
| | |
| | def load_image_from_url(url: str) -> Image.Image: |
| | resp = requests.get(url, timeout=15) |
| | resp.raise_for_status() |
| | img = Image.open(BytesIO(resp.content)).convert("RGB") |
| | return img |
| | |
| | |
| | saved_model_path = "/PATH//checkpoints/steps_50000_pytorch_model.pt" |
| | internVLA_M1 = InternVLA_M1.from_pretrained( |
| | saved_model_path |
| | ) |
| | |
| | image_url="https://github.com/InternRobotics/InternVLA-M1/blob/InternVLA-M1/assets/table.jpeg" |
| | image = load_image_from_url(image_url) |
| | question = "give the bbox for the apple." |
| | response = internVLA_M1.chat_with_M1(image, question) |
| | |
| | # ===== predict_action demo ===== |
| | # constuct input: batch size = 1, two views |
| | view1 = load_image_from_url(image_url) |
| | view2 = view1.copy() |
| | batch_images = [[view1]] # List[List[PIL.Image]] |
| | instructions = ["pick up the apple and place it on the plate."] |
| | |
| | if torch.cuda.is_available(): |
| | internVLA_M1 = internVLA_M1.to("cuda") |
| | |
| | # action predict |
| | pred = internVLA_M1.predict_action( |
| | batch_images=batch_images, |
| | instructions=instructions, |
| | cfg_scale=1.5, |
| | use_ddim=True, |
| | num_ddim_steps=10, |
| | ) |
| | normalized_actions = pred["normalized_actions"] # [B, T, action_dim] |
| | ``` |
| |
|
| | ## Citation |
| | ``` |
| | @misc{internvla2024, |
| | title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy}, |
| | author = {InternVLA-M1 Contributors}, |
| | year = {2025}, |
| | booktitle={arXiv}, |
| | } |
| | ``` |