Add model card for Embodied-R1.5

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +55 -0
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: robotics
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
8
+
9
+ Embodied-R1.5 is a unified Embodied Foundation Model (EFM), built on Qwen3-VL-8B-Instruct, that integrates comprehensive embodied reasoning—including spatial cognition, task planning, error correction, and pointing—within a single architecture.
10
+
11
+ - **Paper:** [Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models](https://huggingface.co/papers/2606.11324)
12
+ - **Project Page:** [https://embodied-r.github.io/](https://embodied-r.github.io/)
13
+ - **Repository:** [https://github.com/pickxiguapi/Embodied-R1.5](https://github.com/pickxiguapi/Embodied-R1.5)
14
+
15
+ ## Model Description
16
+ Embodied-R1.5 achieves state-of-the-art performance on 16 out of 24 embodied VLM benchmarks. It introduces a **Planner-Grounder-Corrector (PGC)** closed-loop framework, allowing the model to autonomously execute and self-correct during long-horizon tasks. Despite having only 8B parameters, it demonstrates strong zero-shot generalization to real-world robots for instruction following, affordance grounding, and articulated object manipulation.
17
+
18
+ ## Installation
19
+
20
+ ```bash
21
+ pip install transformers>=4.57.0 qwen-vl-utils vllm openai pillow
22
+ ```
23
+
24
+ ## Sample Usage
25
+
26
+ The following example demonstrates how to use the model for local inference. Note that this requires the inference utilities provided in the [official repository](https://github.com/pickxiguapi/Embodied-R1.5).
27
+
28
+ ```python
29
+ from inference.hf_example import HuggingFaceClient
30
+
31
+ client = HuggingFaceClient(model_path="IffYuan/Embodied-R1.5", device_map="auto", dtype="auto")
32
+
33
+ case = {
34
+ "prompt": "How many table lamps are in the image? Select from the following choices.
35
+ (A) 0
36
+ (B) 2
37
+ (C) 1
38
+ (D) 3",
39
+ "image": "test_assets/sample_2_image.png",
40
+ "type": "single_image",
41
+ }
42
+ result = client.inference(case, max_new_tokens=512)
43
+ print(result["generated_text"])
44
+ ```
45
+
46
+ ## Citation
47
+
48
+ ```bibtex
49
+ @article{yuan2026embodiedr15,
50
+ title={Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},
51
+ author={Yuan, Yifu and Huang, Yaoting and Yao, Xianze and Li, Yutong and Zhang, Shuoheng and Han, Linqi and Li, Pengyi and Sun, Jiangeng and Jia, Wenting and Zhang, Zhao and Liu, Yuhao and Liao, Ruihao and Hu, Yucheng and Wu, Qiyu and Li, Yuxiao and Dong, Zibin and Ni, Fei and Zheng, Yan and Gu, Shuyang metal, Yi and Tang, Hongyao and Han, Han and Hao, Jianye},
52
+ journal={arXiv preprint},
53
+ year={2026}
54
+ }
55
+ ```