--- language: - en license: other pipeline_tag: image-text-to-text tags: - robotics - vision-language-model - embodied-ai - manipulation - qwen2-vl library_name: transformers --- # Embodied-R1-3B-v1 **Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation (ICLR 2026)** [[🌐 Project Website](https://embodied-r1.github.io)] [[📄 Paper](http://arxiv.org/abs/2508.13998)] [[🏆 ICLR2026 Version](https://openreview.net/forum?id=i5wlozMFsQ)] [[🎯 Dataset](https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset)] [[📦 Code](https://github.com/pickxiguapi/Embodied-R1)] --- ## Model Details ### Model Description **Embodied-R1** is a 3B vision-language model (VLM) for general robotic manipulation. It introduces a **Pointing** mechanism and uses **Reinforced Fine-tuning (RFT)** to bridge perception and action, with strong zero-shot generalization in embodied tasks. ![Embodied-R1 Framework](https://raw.githubusercontent.com/pickxiguapi/Embodied-R1/main/assets/r1_framework_readme.jpg) *Figure: Embodied-R1 framework, performance overview, and zero-shot manipulation demos.* ### Model Sources - **Repository:** https://github.com/pickxiguapi/Embodied-R1 - **Paper:** http://arxiv.org/abs/2508.13998 - **OpenReview:** https://openreview.net/forum?id=i5wlozMFsQ ### Updates - **[2026-03]** VABench-P / VABench-V released: [VABench-P](https://huggingface.co/datasets/IffYuan/VABench-P), [VABench-V](https://huggingface.co/datasets/IffYuan/vabench-v) - **[2026-03-03]** Embodied-R1 dataset released: https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset - **[2026-01-27]** Accepted by ICLR 2026 - **[2025-08-22]** Embodied-R1-3B-v1 checkpoint released --- ## Intended Uses ### Direct Use This model is intended for **research and benchmarking** in embodied reasoning and robotic manipulation tasks, including: - Visual target grounding (VTG) - Referring region grounding (RRG/REG-style tasks) - Open-form grounding (OFG) ### Out-of-Scope Use - Safety-critical real-world deployment without additional safeguards and validation - Decision-making in high-risk domains - Any use requiring guaranteed robustness under distribution shift --- ## How to Use ### Setup ```bash git clone https://github.com/pickxiguapi/Embodied-R1.git cd Embodied-R1 conda create -n embodied_r1 python=3.11 -y conda activate embodied_r1 pip install transformers==4.51.3 accelerate pip install qwen-vl-utils[decord] ``` ### Inference ```bash python inference_example.py ``` ### Example Tasks - VTG: *put the red block on top of the yellow block* - RRG: *put pepper in pan* - REG: *bring me the camel model* - OFG: *loosening stuck bolts* (Visualization examples are available in the project repo: `assets/`) --- ## Evaluation ```bash cd eval python hf_inference_where2place.py python hf_inference_vabench_point.py ... ``` Related benchmarks: - [Embodied-R1-Dataset](https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset) - [VABench-P](https://huggingface.co/datasets/IffYuan/VABench-P) - [VABench-V](https://huggingface.co/datasets/IffYuan/vabench-v) --- ## Training Training scripts are available at: https://github.com/pickxiguapi/Embodied-R1/tree/main/scripts ```bash # Stage 1 training bash scripts/stage_1_embodied_r1.sh # Stage 2 training bash scripts/stage_2_embodied_r1.sh ``` Key files: - `scripts/config_stage1.yaml` - `scripts/config_stage2.yaml` - `scripts/stage_1_embodied_r1.sh` - `scripts/stage_2_embodied_r1.sh` - `scripts/model_merger.py` (checkpoint merging + HF export) --- ## Limitations - Performance may vary across environments, camera viewpoints, and unseen object domains. - Outputs are generated from visual-language reasoning and may include localization/action errors. - Additional system-level constraints (calibration, motion planning, safety checks) are required for real robot deployment. --- ## Citation ```bibtex @article{yuan2026embodied, title={Embodied-r1: Reinforced embodied reasoning for general robotic manipulation}, author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Tang, Hongyao and Hao, Jianye}, journal={The Fourteenth International Conference on Learning Representations}, year={2026} } @article{yuan2026seeing, title={From seeing to doing: Bridging reasoning and decision for robotic manipulation}, author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye}, journal={The Fourteenth International Conference on Learning Representations}, year={2026} } ``` --- ## Acknowledgements If this model or resources are useful for your research, please consider citing our work and starring the repository.