File size: 5,347 Bytes
f756d24 8793a87 f756d24 2533d7e f756d24 52c23d0 f756d24 519002e f756d24 8fbcd0a f756d24 f57184f f756d24 f57184f f756d24 8fbcd0a f756d24 2533d7e f756d24 7c09c3b f756d24 e9036d3 f756d24 e9036d3 322b904 e9036d3 322b904 e9036d3 f756d24 14401a4 f756d24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | ---
base_model:
- ACE-Brain/ACE-Brain-0-8B
library_name: transformers
license: mit
---
<div align="center">
<img src="./assets/title.png" width=600>
</div>
<br/>
<div align="center" style="line-height: 1;">
|
<a href="https://huggingface.co/ACE-Brain/ACE-Brain-8B" target="_blank">🤗 HuggingFace</a>
|
<a href="https://ACE-Brain-Team.github.io/ACE-Brain-0/" target="_blank"> 📁 Project Page</a>
|
<a href="https://arxiv.org/abs/2603.03198v1" target="_blank">📔 Technical Report</a>
|
<a href="https://github.com/ACE-BRAIN/ACE-Brain" target="_blank"> 🤖 Github</a>
|
<br/>
</div>
## Overview
**ACE-Brain-0** is a generalist multimodal foundation model designed to unify perception, reasoning, and decision-making across diverse embodied domains, including **spatial cognition**, **autonomous driving**, **low-altitude sensing** and **embodied interaction**. Built upon a unified multimodal large language model (MLLM) architecture, ACE-Brain learns a shared spatial reasoning substrate that enables generalization across heterogeneous physical environments and agent embodiments.
Extensive evaluation across **24** benchmarks demonstrates that ACE-Brain achieves state-of-the-art or competitive performance across multiple domains, validating its effectiveness as a unified embodied intelligence model.
<div align="center">
<img src="./assets/teaser.png" width=800>
</div>
## Key Features
- Unified multimodal foundation model for embodied intelligence
- Strong spatial reasoning as a universal intelligence scaffold
- Supports diverse embodiment platforms:
- Spatial Cognition
- Autonomous Driving
- Low-Altitude Sensing
- Embodied Interaction
- Cross-domain generalization across perception, reasoning, and planning
## Performance Highlights
ACE-Brain achieves strong performance across **24 benchmarks covering Spatial Cognition, Autonomous Driving, Low-Altitude Sensing and Embodied Interaction**, consistently outperforming existing open-source embodied VLMs and remaining competitive with closed-source models.
The model shows robust capability in **spatial reasoning, physical interaction understanding, task-oriented decision-making, and dynamic scene interpretation**, enabling reliable performance across diverse real-world embodiment scenarios.
In driving and aerial domains, ACE-Brain demonstrates excellent performance in **environment understanding, motion reasoning, and planning-aware prediction**, highlighting its effectiveness in complex, large-scale, and safety-critical environments.
Despite its domain specialization, ACE-Brain maintains strong general multimodal reasoning ability, confirming that spatial intelligence-based training enhances overall visual-language intelligence rather than limiting generalization.
### Spatial Benchmarks
<div align="center">
<img src="./assets/table1.png" width=800>
</div>
### Autonomous Driving Benchmarks
<div align="center">
<img src="./assets/table2.png" width=800>
</div>
### Low-Altitude Benchmarks
<div align="center">
<img src="./assets/table3.png" width=800>
</div>
### Embodied Benchmarks
<div align="center">
<img src="./assets/table4.png" width=800>
</div>
> **Bold** numbers indicate the best results, <u>underlined</u> numbers indicate the second-best results, and results marked with \* are obtained using our evaluation framework.
## Inference Example
```
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"ACE-Brain/ACE-Brain-0-8B", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("ACE-Brain/ACE-Brain-0-8B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## Citation
```bibtex
@misc{gong2026acebrain0spatialintelligenceshared,
title={ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments},
author={Ziyang Gong and Zehang Luo and Anke Tang and Zhe Liu and Shi Fu and Zhi Hou and Ganlin Yang and Weiyun Wang and Xiaofeng Wang and Jianbo Liu and Gen Luo and Haolan Kang and Shuang Luo and Yue Zhou and Yong Luo and Li Shen and Xiaosong Jia and Yao Mu and Xue Yang and Chunxiao Liu and Junchi Yan and Hengshuang Zhao and Dacheng Tao and Xiaogang Wang},
year={2026},
eprint={2603.03198},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.03198},
}
``` |