Add model card for Embodied-R1.5
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: robotics
|
| 3 |
+
library_name: transformers
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
|
| 8 |
+
|
| 9 |
+
Embodied-R1.5 is a unified Embodied Foundation Model (EFM), built on Qwen3-VL-8B-Instruct, that integrates comprehensive embodied reasoning—including spatial cognition, task planning, error correction, and pointing—within a single architecture.
|
| 10 |
+
|
| 11 |
+
- **Paper:** [Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models](https://huggingface.co/papers/2606.11324)
|
| 12 |
+
- **Project Page:** [https://embodied-r.github.io/](https://embodied-r.github.io/)
|
| 13 |
+
- **Repository:** [https://github.com/pickxiguapi/Embodied-R1.5](https://github.com/pickxiguapi/Embodied-R1.5)
|
| 14 |
+
|
| 15 |
+
## Model Description
|
| 16 |
+
Embodied-R1.5 achieves state-of-the-art performance on 16 out of 24 embodied VLM benchmarks. It introduces a **Planner-Grounder-Corrector (PGC)** closed-loop framework, allowing the model to autonomously execute and self-correct during long-horizon tasks. Despite having only 8B parameters, it demonstrates strong zero-shot generalization to real-world robots for instruction following, affordance grounding, and articulated object manipulation.
|
| 17 |
+
|
| 18 |
+
## Installation
|
| 19 |
+
|
| 20 |
+
```bash
|
| 21 |
+
pip install transformers>=4.57.0 qwen-vl-utils vllm openai pillow
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
## Sample Usage
|
| 25 |
+
|
| 26 |
+
The following example demonstrates how to use the model for local inference. Note that this requires the inference utilities provided in the [official repository](https://github.com/pickxiguapi/Embodied-R1.5).
|
| 27 |
+
|
| 28 |
+
```python
|
| 29 |
+
from inference.hf_example import HuggingFaceClient
|
| 30 |
+
|
| 31 |
+
client = HuggingFaceClient(model_path="IffYuan/Embodied-R1.5", device_map="auto", dtype="auto")
|
| 32 |
+
|
| 33 |
+
case = {
|
| 34 |
+
"prompt": "How many table lamps are in the image? Select from the following choices.
|
| 35 |
+
(A) 0
|
| 36 |
+
(B) 2
|
| 37 |
+
(C) 1
|
| 38 |
+
(D) 3",
|
| 39 |
+
"image": "test_assets/sample_2_image.png",
|
| 40 |
+
"type": "single_image",
|
| 41 |
+
}
|
| 42 |
+
result = client.inference(case, max_new_tokens=512)
|
| 43 |
+
print(result["generated_text"])
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
## Citation
|
| 47 |
+
|
| 48 |
+
```bibtex
|
| 49 |
+
@article{yuan2026embodiedr15,
|
| 50 |
+
title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
|
| 51 |
+
author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Li, Yutong and Zhang, Shuoheng and Han, Linqi and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Zhang, Zhao and Liu, Yuhao and Liao, Ruihao and Hu, Yucheng and Wu, Qiyu and Li, Yuxiao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang metal, Yi and Tang, Hongyao and Han, Han and Hao, Jianye},
|
| 52 |
+
journal={arXiv preprint},
|
| 53 |
+
year={2026}
|
| 54 |
+
}
|
| 55 |
+
```
|