Upload 4 files

f1fe0a3 verified 5 months ago

4.69 kB

	# BLM<sub>0</sub>: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning



	<p align="center">
	</a>&nbsp&nbsp⭐️ <a href="https://boundless-large-model.github.io">Project</a></a>&nbsp&nbsp &nbsp&nbsp🤗 <a href="https://huggingface.co/BLM-Lab/BLM-0">Hugging Face</a>&nbsp&nbsp &nbsp&nbsp📑 <a href="http://arxiv.org/abs/2502.21257">Paper</a>&nbsp&nbsp
	</p>

	## 🔥 Overview
	We present Boundless Large Model (BLM<sub>0</sub>), a multimodal spatial foundation model that preserves the native instruction-following and reasoning ability of MLLMs while acquiring effective robotic control. We formalize three requirements for generalist agents—cross-space transfer (digital→physical), cross-task learning, and cross-embodiment generalization—and instantiate them with a two-stage training pipeline. Stage I performs supervised fine-tuning on large-scale digital-space understanding and reasoning corpora to inject embodied perception and spatial knowledge without degrading the underlying language capabilities. Stage II freezes the MLLM backbone and trains a diffusion-based policy head on a self-collected cross-embodiment demonstration suite spanning Franka Emika Panda, xArm-6, xArm-7, and WidowX AI over six increasingly challenging tasks; demonstrations are generated in ManiSkill to ensure collision-free, time-parameterized trajectories. A simple intent-bridging interface exposes embodiment-agnostic high-level intents from the MLLM to the policy, decoupling reasoning from low-level control. On our benchmarks, the single set of BLM<sub>0</sub> weights outperforms representative MLLMs, ELLMs, VLA models, and general multimodal large models, improving digital-space reasoning by $\sim\!\textbf{6\%}$ and physical control by $\sim\!\textbf{3\%}$ without model switching. To our knowledge, our evaluation suite is the first to fix task semantics while systematically varying embodiments to assess cross-embodiment generalization.




	## 🚀 Features
	- Achieve cross-space transfer, cross-task learning, and cross-embodiment generalization within a unified model.
	- Seamlessly migrate to cross-embodiment robot control while retaining native instruction-following capability.
	- A single model covers multiple embodiments, enabling cross-embodiment knowledge sharing and consistent control.
	- BLM-0 surpasses same-scale SOTA methods in comprehensive performance across spatial understanding, spatial reasoning, and spatial execution benchmarks.


	## 🗞️ News
	- `2025-09-25`: 🤗 [BLM-0 7B](https://huggingface.co/BLM-Lab/BLM-0) model checkpoint has been released in Huggingface.


	## 🛠️ Setup

	```bash
	# build conda env.
	conda create -n BLM python=3.10
	conda activate BLM
	pip install -r requirements.txt
	```

	## ⭐️ Inference


	Install and launch VLLM
	```bash
	# Install vllm package
	pip install vllm

	# Launch BLM with vllm
	vllm serve ./model \
	--port 8000 \
	--trust-remote-code \
	--dtype bfloat16 \
	--max-model-len 128000 \
	--served-model-name BLM-0
	```

	Run python script as example:
	```python
	from openai import OpenAI
	import base64

	openai_api_base = "http://127.0.0.1:8000/v1"
	openai_api_key = "empty"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	prompt = "What is in the picture?"
	image = "./test.png"

	with open(image, "rb") as f:
	encoded_image = base64.b64encode(f.read())
	encoded_image = encoded_image.decode("utf-8")
	base64_img = f"data:image;base64,{encoded_image}"

	response = client.chat.completions.create(
	model="BLM-0",
	messages=[
	{
	"role": "system",
	"content": "You are a helpful assistant.",
	},
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": base64_img}},
	{"type": "text", "text": prompt},
	],
	},
	]
	)

	print(response.choices[0].message.content)
	```


	## 🤖 Evaluation

	### Comparison with existing MLLMs and GMLMs on digital-space benchmarks
	<div align="center">
	<img src="images/digital-space.png" />
	</div>

	### Comparison with existing VLAs on robot benchmarks

	<div align="center">
	<img src="images/VLA.png" />
	</div>

	† denotes the training of independent models on four robots, with each model evaluated across six tasks.
	★ denotes training independent models for each of the six tasks associated with four robots (24 models in total), with evaluation on the corresponding tasks for each robot.

	## 📑 Citation
	If you find this project useful, please consider citing our paper.
	```bib
	@article{,
	title={},
	author={},
	journal={},
	year={2025}
	}
	```