BLM0: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning
⭐️ Project 🤗 Hugging Face 📑 Paper
🔥 Overview
We present Boundless Large Model (BLM0), a multimodal spatial foundation model that preserves the native instruction-following and reasoning ability of MLLMs while acquiring effective robotic control. We formalize three requirements for generalist agents—cross-space transfer (digital→physical), cross-task learning, and cross-embodiment generalization—and instantiate them with a two-stage training pipeline. Stage I performs supervised fine-tuning on large-scale digital-space understanding and reasoning corpora to inject embodied perception and spatial knowledge without degrading the underlying language capabilities. Stage II freezes the MLLM backbone and trains a diffusion-based policy head on a self-collected cross-embodiment demonstration suite spanning Franka Emika Panda, xArm-6, xArm-7, and WidowX AI over six increasingly challenging tasks; demonstrations are generated in ManiSkill to ensure collision-free, time-parameterized trajectories. A simple intent-bridging interface exposes embodiment-agnostic high-level intents from the MLLM to the policy, decoupling reasoning from low-level control. On our benchmarks, the single set of BLM0 weights outperforms representative MLLMs, ELLMs, VLA models, and general multimodal large models, improving digital-space reasoning by $\sim!\textbf{6%}$ and physical control by $\sim!\textbf{3%}$ without model switching. To our knowledge, our evaluation suite is the first to fix task semantics while systematically varying embodiments to assess cross-embodiment generalization.
🚀 Features
- Achieve cross-space transfer, cross-task learning, and cross-embodiment generalization within a unified model.
- Seamlessly migrate to cross-embodiment robot control while retaining native instruction-following capability.
- A single model covers multiple embodiments, enabling cross-embodiment knowledge sharing and consistent control.
- BLM-0 surpasses same-scale SOTA methods in comprehensive performance across spatial understanding, spatial reasoning, and spatial execution benchmarks.
🗞️ News
2025-09-25: 🤗 BLM-0 7B model checkpoint has been released in Huggingface.
🛠️ Setup
# build conda env.
conda create -n BLM python=3.10
conda activate BLM
pip install -r requirements.txt
⭐️ Inference
Install and launch VLLM
# Install vllm package
pip install vllm
# Launch BLM with vllm
vllm serve ./model \
--port 8000 \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 128000 \
--served-model-name BLM-0
Run python script as example:
from openai import OpenAI
import base64
openai_api_base = "http://127.0.0.1:8000/v1"
openai_api_key = "empty"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
prompt = "What is in the picture?"
image = "./test.png"
with open(image, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image = encoded_image.decode("utf-8")
base64_img = f"data:image;base64,{encoded_image}"
response = client.chat.completions.create(
model="BLM-0",
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": base64_img}},
{"type": "text", "text": prompt},
],
},
]
)
print(response.choices[0].message.content)
🤖 Evaluation
Comparison with existing MLLMs and GMLMs on digital-space benchmarks
Comparison with existing VLAs on robot benchmarks
† denotes the training of independent models on four robots, with each model evaluated across six tasks. ★ denotes training independent models for each of the six tasks associated with four robots (24 models in total), with evaluation on the corresponding tasks for each robot.
📑 Citation
If you find this project useful, please consider citing our paper.
@article{,
title={},
author={},
journal={},
year={2025}
}