File size: 4,686 Bytes
f1fe0a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# BLM<sub>0</sub>: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning



<p align="center">
        </a>&nbsp&nbsp⭐️ <a href="https://boundless-large-model.github.io">Project</a></a>&nbsp&nbsp  &nbsp&nbsp🤗 <a href="https://huggingface.co/BLM-Lab/BLM-0">Hugging Face</a>&nbsp&nbsp  &nbsp&nbsp📑 <a href="http://arxiv.org/abs/2502.21257">Paper</a>&nbsp&nbsp 
</p>

## 🔥 Overview
We present **Boundless Large Model** (BLM<sub>0</sub>), a multimodal spatial foundation model that preserves the native instruction-following and reasoning ability of MLLMs while acquiring effective robotic control. We formalize three requirements for generalist agents—cross-space transfer (digital→physical), cross-task learning, and cross-embodiment generalization—and instantiate them with a two-stage training pipeline. Stage I performs supervised fine-tuning on large-scale digital-space understanding and reasoning corpora to inject embodied perception and spatial knowledge without degrading the underlying language capabilities. Stage II freezes the MLLM backbone and trains a diffusion-based policy head on a self-collected cross-embodiment demonstration suite spanning Franka Emika Panda, xArm-6, xArm-7, and WidowX AI over six increasingly challenging tasks; demonstrations are generated in ManiSkill to ensure collision-free, time-parameterized trajectories. A simple intent-bridging interface exposes embodiment-agnostic high-level intents from the MLLM to the policy, decoupling reasoning from low-level control. On our benchmarks, the single set of BLM<sub>0</sub> weights outperforms representative MLLMs, ELLMs, VLA models, and general multimodal large models, improving digital-space reasoning by $\sim\!\textbf{6\%}$ and physical control by $\sim\!\textbf{3\%}$ without model switching. To our knowledge, our evaluation suite is the first to fix task semantics while systematically varying embodiments to assess cross-embodiment generalization.




## 🚀 Features
- Achieve cross-space transfer, cross-task learning, and cross-embodiment generalization within a unified model.  
- Seamlessly migrate to cross-embodiment robot control while retaining native instruction-following capability.  
- A single model covers multiple embodiments, enabling cross-embodiment knowledge sharing and consistent control.  
- BLM-0 surpasses same-scale SOTA methods in comprehensive performance across spatial understanding, spatial reasoning, and spatial execution benchmarks.  


## 🗞️ News
- **`2025-09-25`**: 🤗 [BLM-0 7B](https://huggingface.co/BLM-Lab/BLM-0) model checkpoint has been released in Huggingface.


## 🛠️ Setup 

```bash
# build conda env.
conda create -n BLM python=3.10
conda activate BLM
pip install -r requirements.txt
```

## ⭐️ Inference


Install and launch VLLM
```bash
# Install vllm package
pip install vllm

# Launch BLM with vllm
vllm serve ./model  \
--port 8000 \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 128000 \
--served-model-name BLM-0
```

Run python script as example:
```python
from openai import OpenAI
import base64

openai_api_base = "http://127.0.0.1:8000/v1"
openai_api_key = "empty" 

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "What is in the picture?"
image = "./test.png"

with open(image, "rb") as f:
    encoded_image = base64.b64encode(f.read())
    encoded_image = encoded_image.decode("utf-8")
    base64_img = f"data:image;base64,{encoded_image}"

response = client.chat.completions.create(
    model="BLM-0",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant.",
        },
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": base64_img}},
                {"type": "text", "text": prompt},
            ],
        },
    ]
)

print(response.choices[0].message.content)
```


## 🤖 Evaluation

### Comparison with existing MLLMs and GMLMs on digital-space benchmarks
<div align="center">
<img src="images/digital-space.png" />
</div>

### Comparison with existing VLAs on robot benchmarks

<div align="center">
<img src="images/VLA.png" />
</div>

**†** denotes the training of independent models on four robots, with each model evaluated across six tasks.
**★** denotes training independent models for each of the six tasks associated with four robots (24 models in total), with evaluation on the corresponding tasks for each robot.

## 📑 Citation
If you find this project useful, please consider citing our paper.
```bib
@article{,
  title={},
  author={},
  journal={},
  year={2025}
}
```