Upload README.md
Browse files
README.md
CHANGED
|
@@ -38,7 +38,6 @@ Extensive evaluation across **24** benchmarks demonstrates that ACE-Brain achiev
|
|
| 38 |
|
| 39 |
## Key Features
|
| 40 |
|
| 41 |
-
|
| 42 |
- Unified multimodal foundation model for embodied intelligence
|
| 43 |
- Strong spatial reasoning as a universal intelligence scaffold
|
| 44 |
- Supports diverse embodiment platforms:
|
|
@@ -49,18 +48,8 @@ Extensive evaluation across **24** benchmarks demonstrates that ACE-Brain achiev
|
|
| 49 |
- Cross-domain generalization across perception, reasoning, and planning
|
| 50 |
- Evaluated on 24 real-world embodied intelligence benchmarks
|
| 51 |
|
| 52 |
-
## Core Capabilities
|
| 53 |
-
|
| 54 |
-
<div align="center">
|
| 55 |
-
<img src="./assets/fig2.png" width=800>
|
| 56 |
-
</div>
|
| 57 |
-
|
| 58 |
## Performance Highlights
|
| 59 |
|
| 60 |
-
<div align="center">
|
| 61 |
-
<img src="./assets/radarchart.png" width=800>
|
| 62 |
-
</div>
|
| 63 |
-
|
| 64 |
ACE-Brain achieves strong performance across **24 benchmarks covering Spatial Intelligence, Embodied Interaction, Autonomous Driving, and Low-Altitude Sensing**, consistently outperforming existing open-source embodied VLMs and remaining competitive with closed-source models.
|
| 65 |
|
| 66 |
The model shows robust capability in **spatial reasoning, physical interaction understanding, task-oriented decision-making, and dynamic scene interpretation**, enabling reliable performance across diverse real-world embodiment scenarios.
|
|
@@ -96,9 +85,54 @@ Despite its domain specialization, ACE-Brain maintains strong general multimodal
|
|
| 96 |
<img src="./assets/table4.png" width=800>
|
| 97 |
</div>
|
| 98 |
|
|
|
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
> **Bold** numbers indicate the best results, <u>underlined</u> numbers indicate the second-best results, and results marked with \* are obtained using our evaluation framework.
|
| 102 |
|
| 103 |
|
| 104 |
## Citation
|
|
|
|
| 38 |
|
| 39 |
## Key Features
|
| 40 |
|
|
|
|
| 41 |
- Unified multimodal foundation model for embodied intelligence
|
| 42 |
- Strong spatial reasoning as a universal intelligence scaffold
|
| 43 |
- Supports diverse embodiment platforms:
|
|
|
|
| 48 |
- Cross-domain generalization across perception, reasoning, and planning
|
| 49 |
- Evaluated on 24 real-world embodied intelligence benchmarks
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
## Performance Highlights
|
| 52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
ACE-Brain achieves strong performance across **24 benchmarks covering Spatial Intelligence, Embodied Interaction, Autonomous Driving, and Low-Altitude Sensing**, consistently outperforming existing open-source embodied VLMs and remaining competitive with closed-source models.
|
| 54 |
|
| 55 |
The model shows robust capability in **spatial reasoning, physical interaction understanding, task-oriented decision-making, and dynamic scene interpretation**, enabling reliable performance across diverse real-world embodiment scenarios.
|
|
|
|
| 85 |
<img src="./assets/table4.png" width=800>
|
| 86 |
</div>
|
| 87 |
|
| 88 |
+
> **Bold** numbers indicate the best results, <u>underlined</u> numbers indicate the second-best results, and results marked with \* are obtained using our evaluation framework.
|
| 89 |
|
| 90 |
+
## Inference Example
|
| 91 |
+
```
|
| 92 |
+
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
|
| 93 |
+
|
| 94 |
+
# default: Load the model on the available device(s)
|
| 95 |
+
model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 96 |
+
"ACE-Brain/ACE-Brain-8B", dtype="auto", device_map="auto"
|
| 97 |
+
)
|
| 98 |
+
|
| 99 |
+
processor = AutoProcessor.from_pretrained("ACE-Brain/ACE-Brain-8B")
|
| 100 |
+
|
| 101 |
+
messages = [
|
| 102 |
+
{
|
| 103 |
+
"role": "user",
|
| 104 |
+
"content": [
|
| 105 |
+
{
|
| 106 |
+
"type": "image",
|
| 107 |
+
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
|
| 108 |
+
},
|
| 109 |
+
{"type": "text", "text": "Describe this image."},
|
| 110 |
+
],
|
| 111 |
+
}
|
| 112 |
+
]
|
| 113 |
+
|
| 114 |
+
# Preparation for inference
|
| 115 |
+
inputs = processor.apply_chat_template(
|
| 116 |
+
messages,
|
| 117 |
+
tokenize=True,
|
| 118 |
+
add_generation_prompt=True,
|
| 119 |
+
return_dict=True,
|
| 120 |
+
return_tensors="pt"
|
| 121 |
+
)
|
| 122 |
+
inputs = inputs.to(model.device)
|
| 123 |
+
|
| 124 |
+
# Inference: Generation of the output
|
| 125 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
| 126 |
+
generated_ids_trimmed = [
|
| 127 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 128 |
+
]
|
| 129 |
+
output_text = processor.batch_decode(
|
| 130 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 131 |
+
)
|
| 132 |
+
print(output_text)
|
| 133 |
+
|
| 134 |
+
```
|
| 135 |
|
|
|
|
| 136 |
|
| 137 |
|
| 138 |
## Citation
|