BLM₀: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

⭐️ Project 🤗 Hugging Face 📑 Paper

🔥 Overview

We present Boundless Large Model (BLM₀), a multimodal spatial foundation model that preserves the native instruction-following and reasoning ability of MLLMs while acquiring effective robotic control. We formalize three requirements for generalist agents—cross-space transfer (digital→physical), cross-task learning, and cross-embodiment generalization—and instantiate them with a two-stage training pipeline. Stage I performs supervised fine-tuning on large-scale digital-space understanding and reasoning corpora to inject embodied perception and spatial knowledge without degrading the underlying language capabilities. Stage II freezes the MLLM backbone and trains a diffusion-based policy head on a self-collected cross-embodiment demonstration suite spanning Franka Emika Panda, xArm-6, xArm-7, and WidowX AI over six increasingly challenging tasks; demonstrations are generated in ManiSkill to ensure collision-free, time-parameterized trajectories. A simple intent-bridging interface exposes embodiment-agnostic high-level intents from the MLLM to the policy, decoupling reasoning from low-level control. On our benchmarks, the single set of BLM₀ weights outperforms representative MLLMs, ELLMs, VLA models, and general multimodal large models, improving digital-space reasoning by $\sim!\textbf{6%}$ and physical control by $\sim!\textbf{3%}$ without model switching. To our knowledge, our evaluation suite is the first to fix task semantics while systematically varying embodiments to assess cross-embodiment generalization.

🚀 Features

Achieve cross-space transfer, cross-task learning, and cross-embodiment generalization within a unified model.
Seamlessly migrate to cross-embodiment robot control while retaining native instruction-following capability.
A single model covers multiple embodiments, enabling cross-embodiment knowledge sharing and consistent control.
BLM-0 surpasses same-scale SOTA methods in comprehensive performance across spatial understanding, spatial reasoning, and spatial execution benchmarks.

🗞️ News

2025-09-25: 🤗 BLM-0 7B model checkpoint has been released in Huggingface.

🛠️ Setup

# build conda env.
conda create -n BLM python=3.10
conda activate BLM
pip install -r requirements.txt

⭐️ Inference

Install and launch VLLM

# Install vllm package
pip install vllm

# Launch BLM with vllm
vllm serve ./model  \
--port 8000 \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 128000 \
--served-model-name BLM-0

Run python script as example:

from openai import OpenAI
import base64

openai_api_base = "http://127.0.0.1:8000/v1"
openai_api_key = "empty" 

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "What is in the picture?"
image = "./test.png"

with open(image, "rb") as f:
    encoded_image = base64.b64encode(f.read())
    encoded_image = encoded_image.decode("utf-8")
    base64_img = f"data:image;base64,{encoded_image}"

response = client.chat.completions.create(
    model="BLM-0",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": base64_img}},
                {"type": "text", "text": prompt},
            ],
        },
    ]
)

print(response.choices[0].message.content)

🤖 Evaluation

Comparison with existing MLLMs and GMLMs on digital-space benchmarks

Comparison with existing VLAs on robot benchmarks

† denotes the training of independent models on four robots, with each model evaluated across six tasks. ★ denotes training independent models for each of the six tasks associated with four robots (24 models in total), with evaluation on the corresponding tasks for each robot.

📑 Citation

If you find this project useful, please consider citing our paper.

@article{,
  title={},
  author={},
  journal={},
  year={2025}
}

BLM0: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning