| # BLM<sub>0</sub>: A Boundless Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning | |
| <p align="center"> | |
| </a>  ⭐️ <a href="https://boundless-large-model.github.io">Project</a></a>     🤗 <a href="https://huggingface.co/BLM-Lab/BLM-0">Hugging Face</a>     📑 <a href="http://arxiv.org/abs/2502.21257">Paper</a>   | |
| </p> | |
| ## 🔥 Overview | |
| We present **Boundless Large Model** (BLM<sub>0</sub>), a multimodal spatial foundation model that preserves the native instruction-following and reasoning ability of MLLMs while acquiring effective robotic control. We formalize three requirements for generalist agents—cross-space transfer (digital→physical), cross-task learning, and cross-embodiment generalization—and instantiate them with a two-stage training pipeline. Stage I performs supervised fine-tuning on large-scale digital-space understanding and reasoning corpora to inject embodied perception and spatial knowledge without degrading the underlying language capabilities. Stage II freezes the MLLM backbone and trains a diffusion-based policy head on a self-collected cross-embodiment demonstration suite spanning Franka Emika Panda, xArm-6, xArm-7, and WidowX AI over six increasingly challenging tasks; demonstrations are generated in ManiSkill to ensure collision-free, time-parameterized trajectories. A simple intent-bridging interface exposes embodiment-agnostic high-level intents from the MLLM to the policy, decoupling reasoning from low-level control. On our benchmarks, the single set of BLM<sub>0</sub> weights outperforms representative MLLMs, ELLMs, VLA models, and general multimodal large models, improving digital-space reasoning by $\sim\!\textbf{6\%}$ and physical control by $\sim\!\textbf{3\%}$ without model switching. To our knowledge, our evaluation suite is the first to fix task semantics while systematically varying embodiments to assess cross-embodiment generalization. | |
| ## 🚀 Features | |
| - Achieve cross-space transfer, cross-task learning, and cross-embodiment generalization within a unified model. | |
| - Seamlessly migrate to cross-embodiment robot control while retaining native instruction-following capability. | |
| - A single model covers multiple embodiments, enabling cross-embodiment knowledge sharing and consistent control. | |
| - BLM-0 surpasses same-scale SOTA methods in comprehensive performance across spatial understanding, spatial reasoning, and spatial execution benchmarks. | |
| ## 🗞️ News | |
| - **`2025-09-25`**: 🤗 [BLM-0 7B](https://huggingface.co/BLM-Lab/BLM-0) model checkpoint has been released in Huggingface. | |
| ## 🛠️ Setup | |
| ```bash | |
| # build conda env. | |
| conda create -n BLM python=3.10 | |
| conda activate BLM | |
| pip install -r requirements.txt | |
| ``` | |
| ## ⭐️ Inference | |
| Install and launch VLLM | |
| ```bash | |
| # Install vllm package | |
| pip install vllm | |
| # Launch BLM with vllm | |
| vllm serve ./model \ | |
| --port 8000 \ | |
| --trust-remote-code \ | |
| --dtype bfloat16 \ | |
| --max-model-len 128000 \ | |
| --served-model-name BLM-0 | |
| ``` | |
| Run python script as example: | |
| ```python | |
| from openai import OpenAI | |
| import base64 | |
| openai_api_base = "http://127.0.0.1:8000/v1" | |
| openai_api_key = "empty" | |
| client = OpenAI( | |
| api_key=openai_api_key, | |
| base_url=openai_api_base, | |
| ) | |
| prompt = "What is in the picture?" | |
| image = "./test.png" | |
| with open(image, "rb") as f: | |
| encoded_image = base64.b64encode(f.read()) | |
| encoded_image = encoded_image.decode("utf-8") | |
| base64_img = f"data:image;base64,{encoded_image}" | |
| response = client.chat.completions.create( | |
| model="BLM-0", | |
| messages=[ | |
| { | |
| "role": "system", | |
| "content": "You are a helpful assistant.", | |
| }, | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image_url", "image_url": {"url": base64_img}}, | |
| {"type": "text", "text": prompt}, | |
| ], | |
| }, | |
| ] | |
| ) | |
| print(response.choices[0].message.content) | |
| ``` | |
| ## 🤖 Evaluation | |
| ### Comparison with existing MLLMs and GMLMs on digital-space benchmarks | |
| <div align="center"> | |
| <img src="images/digital-space.png" /> | |
| </div> | |
| ### Comparison with existing VLAs on robot benchmarks | |
| <div align="center"> | |
| <img src="images/VLA.png" /> | |
| </div> | |
| **†** denotes the training of independent models on four robots, with each model evaluated across six tasks. | |
| **★** denotes training independent models for each of the six tasks associated with four robots (24 models in total), with evaluation on the corresponding tasks for each robot. | |
| ## 📑 Citation | |
| If you find this project useful, please consider citing our paper. | |
| ```bib | |
| @article{, | |
| title={}, | |
| author={}, | |
| journal={}, | |
| year={2025} | |
| } | |
| ``` | |