InternRobotics
/

InternVLA-M1-Pretrain-RT-1-Bridge

vision-language-action-model

vision-language-model

Model card Files Files and versions

InternVLA-M1-Pretrain-RT-1-Bridge / README.md

chenyilun95's picture

Update README.md

344c470 verified 5 months ago

|

history blame contribute delete

2.24 kB

	---
	license: cc-by-nc-sa-4.0
	tags:
	- robotics
	- vision-language-action-model
	- vision-language-model
	---
	# Model Card for InternVLA-M1-Pretrain-RT-1-Bridge

	## Description:
	InternVLA-M1 is an open-source, end-to-end vision–language–action (VLA) framework for building and researching generalist robot policies. The checkpoints in this repository were trained on the RT-1 and Bridge datasets.
	- 🌐 Homepage: [InternVLA-M1 Project Page](https://internrobotics.github.io/internvla-m1.github.io/)
	- 💻 Codebase: [InternVLA-M1 GitHub Repo](https://github.com/InternRobotics/InternVLA-M1)


	![image/png](https://github.com/InternRobotics/InternVLA-M1/raw/InternVLA-M1/assets/teaser.png)


	## Quick Start
	```python
	# ===== system2 demo =====
	from InternVLA.model.framework.M1 import InternVLA_M1
	from PIL import Image
	import requests
	from io import BytesIO

	def load_image_from_url(url: str) -> Image.Image:
	resp = requests.get(url, timeout=15)
	resp.raise_for_status()
	img = Image.open(BytesIO(resp.content)).convert("RGB")
	return img


	saved_model_path = "/PATH//checkpoints/steps_50000_pytorch_model.pt"
	internVLA_M1 = InternVLA_M1.from_pretrained(
	saved_model_path
	)

	image_url="https://github.com/InternRobotics/InternVLA-M1/blob/InternVLA-M1/assets/table.jpeg"
	image = load_image_from_url(image_url)
	question = "give the bbox for the apple."
	response = internVLA_M1.chat_with_M1(image, question)

	# ===== predict_action demo =====
	# constuct input: batch size = 1, two views
	view1 = load_image_from_url(image_url)
	view2 = view1.copy()
	batch_images = [[view1]] # List[List[PIL.Image]]
	instructions = ["pick up the apple and place it on the plate."]

	if torch.cuda.is_available():
	internVLA_M1 = internVLA_M1.to("cuda")

	# action predict
	pred = internVLA_M1.predict_action(
	batch_images=batch_images,
	instructions=instructions,
	cfg_scale=1.5,
	use_ddim=True,
	num_ddim_steps=10,
	)
	normalized_actions = pred["normalized_actions"] # [B, T, action_dim]
	```

	## Citation
	```
	@misc{internvla2024,
	title = {InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy},
	author = {InternVLA-M1 Contributors},
	year = {2025},
	booktitle={arXiv},
	}
	```