File size: 5,347 Bytes

---
base_model:
- ACE-Brain/ACE-Brain-0-8B
library_name: transformers
license: mit
---

<div align="center">
  <img src="./assets/title.png" width=600>
</div>

<br/>

<div align="center" style="line-height: 1;">
  |
  <a href="https://huggingface.co/ACE-Brain/ACE-Brain-8B" target="_blank">🤗 HuggingFace</a>
  &nbsp;|
  <a href="https://ACE-Brain-Team.github.io/ACE-Brain-0/" target="_blank"> 📁 Project Page</a>
  &nbsp;|
  <a href="https://arxiv.org/abs/2603.03198v1" target="_blank">📔 Technical Report</a>
  &nbsp;|
  <a href="https://github.com/ACE-BRAIN/ACE-Brain" target="_blank"> 🤖 Github</a>
  &nbsp;|
  <br/>
</div>

## Overview

**ACE-Brain-0** is a generalist multimodal foundation model designed to unify perception, reasoning, and decision-making across diverse embodied domains, including **spatial cognition**, **autonomous driving**, **low-altitude sensing** and **embodied interaction**. Built upon a unified multimodal large language model (MLLM) architecture, ACE-Brain learns a shared spatial reasoning substrate that enables generalization across heterogeneous physical environments and agent embodiments. 

Extensive evaluation across **24** benchmarks demonstrates that ACE-Brain achieves state-of-the-art or competitive performance across multiple domains, validating its effectiveness as a unified embodied intelligence model. 


<div align="center">
  <img src="./assets/teaser.png" width=800>
</div>


## Key Features

- Unified multimodal foundation model for embodied intelligence  
- Strong spatial reasoning as a universal intelligence scaffold  
- Supports diverse embodiment platforms:
  - Spatial Cognition
  - Autonomous Driving
  - Low-Altitude Sensing
  - Embodied Interaction
- Cross-domain generalization across perception, reasoning, and planning   

## Performance Highlights

ACE-Brain achieves strong performance across **24 benchmarks covering Spatial Cognition, Autonomous Driving, Low-Altitude Sensing and Embodied Interaction**, consistently outperforming existing open-source embodied VLMs and remaining competitive with closed-source models.

The model shows robust capability in **spatial reasoning, physical interaction understanding, task-oriented decision-making, and dynamic scene interpretation**, enabling reliable performance across diverse real-world embodiment scenarios.

In driving and aerial domains, ACE-Brain demonstrates excellent performance in **environment understanding, motion reasoning, and planning-aware prediction**, highlighting its effectiveness in complex, large-scale, and safety-critical environments.

Despite its domain specialization, ACE-Brain maintains strong general multimodal reasoning ability, confirming that spatial intelligence-based training enhances overall visual-language intelligence rather than limiting generalization.

### Spatial Benchmarks

<div align="center">
  <img src="./assets/table1.png" width=800>
</div>


###  Autonomous Driving Benchmarks

<div align="center">
  <img src="./assets/table2.png" width=800>
</div>


### Low-Altitude Benchmarks

<div align="center">
  <img src="./assets/table3.png" width=800>
</div>


###  Embodied Benchmarks

<div align="center">
  <img src="./assets/table4.png" width=800>
</div>

> **Bold** numbers indicate the best results, <u>underlined</u> numbers indicate the second-best results, and results marked with \* are obtained using our evaluation framework.

## Inference Example
```
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "ACE-Brain/ACE-Brain-0-8B", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("ACE-Brain/ACE-Brain-0-8B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

```



## Citation

```bibtex
@misc{gong2026acebrain0spatialintelligenceshared,
      title={ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments}, 
      author={Ziyang Gong and Zehang Luo and Anke Tang and Zhe Liu and Shi Fu and Zhi Hou and Ganlin Yang and Weiyun Wang and Xiaofeng Wang and Jianbo Liu and Gen Luo and Haolan Kang and Shuang Luo and Yue Zhou and Yong Luo and Li Shen and Xiaosong Jia and Yao Mu and Xue Yang and Chunxiao Liu and Junchi Yan and Hengshuang Zhao and Dacheng Tao and Xiaogang Wang},
      year={2026},
      eprint={2603.03198},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.03198}, 
}
```