File size: 4,811 Bytes

4dfe26d
 
 
 
 
 
 
 
 
 
 
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
 
65ef268
4dfe26d
 
65ef268
4dfe26d
65ef268
4dfe26d
 
 
65ef268
4dfe26d
65ef268
4dfe26d
 
 
 
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
 
 
 
65ef268
4dfe26d
65ef268
4dfe26d
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
 
 
65ef268
4dfe26d
 
65ef268
4dfe26d
 
 
65ef268
4dfe26d
65ef268
4dfe26d
 
 
65ef268
4dfe26d
65ef268
4dfe26d
 
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
 
 
 
 
 
65ef268
4dfe26d
 
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
 
65ef268
4dfe26d
 
 
65ef268
4dfe26d
 
 
65ef268
4dfe26d
 
 
 
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d
 
 
65ef268
4dfe26d
65ef268
4dfe26d
65ef268
4dfe26d

---
language:
- en
license: other
pipeline_tag: image-text-to-text
tags:
- robotics
- vision-language-model
- embodied-ai
- manipulation
- qwen2-vl
library_name: transformers
---

# Embodied-R1-3B-v1

**Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation (ICLR 2026)**

[[🌐 Project Website](https://embodied-r1.github.io)] [[📄 Paper](http://arxiv.org/abs/2508.13998)] [[🏆 ICLR2026 Version](https://openreview.net/forum?id=i5wlozMFsQ)] [[🎯 Dataset](https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset)] [[📦 Code](https://github.com/pickxiguapi/Embodied-R1)]

---

## Model Details

### Model Description

**Embodied-R1** is a 3B vision-language model (VLM) for general robotic manipulation.
It introduces a **Pointing** mechanism and uses **Reinforced Fine-tuning (RFT)** to bridge perception and action, with strong zero-shot generalization in embodied tasks.

![Embodied-R1 Framework](https://raw.githubusercontent.com/pickxiguapi/Embodied-R1/main/assets/r1_framework_readme.jpg)
*Figure: Embodied-R1 framework, performance overview, and zero-shot manipulation demos.*

### Model Sources

- **Repository:** https://github.com/pickxiguapi/Embodied-R1
- **Paper:** http://arxiv.org/abs/2508.13998
- **OpenReview:** https://openreview.net/forum?id=i5wlozMFsQ

### Updates

- **[2026-03]** VABench-P / VABench-V released:
  [VABench-P](https://huggingface.co/datasets/IffYuan/VABench-P), [VABench-V](https://huggingface.co/datasets/IffYuan/vabench-v)
- **[2026-03-03]** Embodied-R1 dataset released:
  https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset
- **[2026-01-27]** Accepted by ICLR 2026
- **[2025-08-22]** Embodied-R1-3B-v1 checkpoint released

---

## Intended Uses

### Direct Use

This model is intended for **research and benchmarking** in embodied reasoning and robotic manipulation tasks, including:
- Visual target grounding (VTG)
- Referring region grounding (RRG/REG-style tasks)
- Open-form grounding (OFG)

### Out-of-Scope Use

- Safety-critical real-world deployment without additional safeguards and validation
- Decision-making in high-risk domains
- Any use requiring guaranteed robustness under distribution shift

---

## How to Use

### Setup

```bash
git clone https://github.com/pickxiguapi/Embodied-R1.git
cd Embodied-R1

conda create -n embodied_r1 python=3.11 -y
conda activate embodied_r1

pip install transformers==4.51.3 accelerate
pip install qwen-vl-utils[decord]
```

### Inference

```bash
python inference_example.py
```

### Example Tasks

- VTG: *put the red block on top of the yellow block*
- RRG: *put pepper in pan*
- REG: *bring me the camel model*
- OFG: *loosening stuck bolts*

(Visualization examples are available in the project repo: `assets/`)

---

## Evaluation

```bash
cd eval
python hf_inference_where2place.py
python hf_inference_vabench_point.py
...
```

Related benchmarks:
- [Embodied-R1-Dataset](https://huggingface.co/datasets/IffYuan/Embodied-R1-Dataset)
- [VABench-P](https://huggingface.co/datasets/IffYuan/VABench-P)
- [VABench-V](https://huggingface.co/datasets/IffYuan/vabench-v)

---

## Training

Training scripts are available at:
https://github.com/pickxiguapi/Embodied-R1/tree/main/scripts

```bash
# Stage 1 training
bash scripts/stage_1_embodied_r1.sh

# Stage 2 training
bash scripts/stage_2_embodied_r1.sh
```

Key files:
- `scripts/config_stage1.yaml`
- `scripts/config_stage2.yaml`
- `scripts/stage_1_embodied_r1.sh`
- `scripts/stage_2_embodied_r1.sh`
- `scripts/model_merger.py` (checkpoint merging + HF export)

---

## Limitations

- Performance may vary across environments, camera viewpoints, and unseen object domains.
- Outputs are generated from visual-language reasoning and may include localization/action errors.
- Additional system-level constraints (calibration, motion planning, safety checks) are required for real robot deployment.

---

## Citation

```bibtex
@article{yuan2026embodied,
  title={Embodied-r1: Reinforced embodied reasoning for general robotic manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Huang, Yaoting and Chen, Yibin and Ni, Fei and Dong, Zibin and Li, Pengyi and Zheng, Yan and Tang, Hongyao and Hao, Jianye},
  journal={The Fourteenth International Conference on Learning Representations},
  year={2026}
}

@article{yuan2026seeing,
  title={From seeing to doing: Bridging reasoning and decision for robotic manipulation},
  author={Yuan, Yifu and Cui, Haiqin and Chen, Yibin and Dong, Zibin and Ni, Fei and Kou, Longxin and Liu, Jinyi and Li, Pengyi and Zheng, Yan and Hao, Jianye},
  journal={The Fourteenth International Conference on Learning Representations},
  year={2026}
}
```

---

## Acknowledgements

If this model or resources are useful for your research, please consider citing our work and starring the repository.