File size: 4,686 Bytes
f1fe0a3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | # BLM<sub>0</sub>: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning
<p align="center">
</a>  ⭐️ <a href="https://boundless-large-model.github.io">Project</a></a>     🤗 <a href="https://huggingface.co/BLM-Lab/BLM-0">Hugging Face</a>     📑 <a href="http://arxiv.org/abs/2502.21257">Paper</a>  
</p>
## 🔥 Overview
We present **Boundless Large Model** (BLM<sub>0</sub>), a multimodal spatial foundation model that preserves the native instruction-following and reasoning ability of MLLMs while acquiring effective robotic control. We formalize three requirements for generalist agents—cross-space transfer (digital→physical), cross-task learning, and cross-embodiment generalization—and instantiate them with a two-stage training pipeline. Stage I performs supervised fine-tuning on large-scale digital-space understanding and reasoning corpora to inject embodied perception and spatial knowledge without degrading the underlying language capabilities. Stage II freezes the MLLM backbone and trains a diffusion-based policy head on a self-collected cross-embodiment demonstration suite spanning Franka Emika Panda, xArm-6, xArm-7, and WidowX AI over six increasingly challenging tasks; demonstrations are generated in ManiSkill to ensure collision-free, time-parameterized trajectories. A simple intent-bridging interface exposes embodiment-agnostic high-level intents from the MLLM to the policy, decoupling reasoning from low-level control. On our benchmarks, the single set of BLM<sub>0</sub> weights outperforms representative MLLMs, ELLMs, VLA models, and general multimodal large models, improving digital-space reasoning by $\sim\!\textbf{6\%}$ and physical control by $\sim\!\textbf{3\%}$ without model switching. To our knowledge, our evaluation suite is the first to fix task semantics while systematically varying embodiments to assess cross-embodiment generalization.
## 🚀 Features
- Achieve cross-space transfer, cross-task learning, and cross-embodiment generalization within a unified model.
- Seamlessly migrate to cross-embodiment robot control while retaining native instruction-following capability.
- A single model covers multiple embodiments, enabling cross-embodiment knowledge sharing and consistent control.
- BLM-0 surpasses same-scale SOTA methods in comprehensive performance across spatial understanding, spatial reasoning, and spatial execution benchmarks.
## 🗞️ News
- **`2025-09-25`**: 🤗 [BLM-0 7B](https://huggingface.co/BLM-Lab/BLM-0) model checkpoint has been released in Huggingface.
## 🛠️ Setup
```bash
# build conda env.
conda create -n BLM python=3.10
conda activate BLM
pip install -r requirements.txt
```
## ⭐️ Inference
Install and launch VLLM
```bash
# Install vllm package
pip install vllm
# Launch BLM with vllm
vllm serve ./model \
--port 8000 \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 128000 \
--served-model-name BLM-0
```
Run python script as example:
```python
from openai import OpenAI
import base64
openai_api_base = "http://127.0.0.1:8000/v1"
openai_api_key = "empty"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
prompt = "What is in the picture?"
image = "./test.png"
with open(image, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image = encoded_image.decode("utf-8")
base64_img = f"data:image;base64,{encoded_image}"
response = client.chat.completions.create(
model="BLM-0",
messages=[
{
"role": "system",
"content": "You are a helpful assistant.",
},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": base64_img}},
{"type": "text", "text": prompt},
],
},
]
)
print(response.choices[0].message.content)
```
## 🤖 Evaluation
### Comparison with existing MLLMs and GMLMs on digital-space benchmarks
<div align="center">
<img src="images/digital-space.png" />
</div>
### Comparison with existing VLAs on robot benchmarks
<div align="center">
<img src="images/VLA.png" />
</div>
**†** denotes the training of independent models on four robots, with each model evaluated across six tasks.
**★** denotes training independent models for each of the six tasks associated with four robots (24 models in total), with evaluation on the corresponding tasks for each robot.
## 📑 Citation
If you find this project useful, please consider citing our paper.
```bib
@article{,
title={},
author={},
journal={},
year={2025}
}
```
|