File size: 9,055 Bytes
a493d78 ff2c8d5 a493d78 ff2c8d5 a493d78 d3dbff4 ff2c8d5 a493d78 dcaab56 ff2c8d5 dcaab56 d3dbff4 dcaab56 cdf915f ff2c8d5 dcaab56 b42028a f5fbeae dcaab56 f5fbeae dcaab56 f5fbeae dcaab56 ff2c8d5 dcaab56 ff2c8d5 dcaab56 ff2c8d5 dcaab56 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
---
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
datasets:
- agibot-world/AgiBotWorld-Beta
- IPEC-COMMUNITY/fractal20220817_data_lerobot
- youliangtan/bridge_dataset
- IPEC-COMMUNITY/droid_lerobot
- liuhaotian/LLaVA-Instruct-150K
- lmms-lab/LLaVA-Video-178K
- lmms-lab/RefCOCO
- allenai/pixmo-points
- IPEC-COMMUNITY/EO-Data1.5M
- lmms-lab/RoboVQA
- x-humanoid-robomind/RoboMIND
language:
- en
license: mit
metrics:
- accuracy
- bleu
tags:
- Robot Control
- Generalist robot policies
- VLA
- Embodied AI
- Unified Model
- multimodal
- large embodied model
pipeline_tag: robotics
library_name: transformers
---
<p align="center">
<img src="assets/logo.png" width="100%">
</p>
<p align="left">
<a href="https://eo-robotics.ai/eo-1">
<img
src="https://img.shields.io/badge/EO--Robotics-Website-5865F2?logo=googleplay&logoColor=white"
alt="EO-Robotics Website"
/>
</a>
<a href="https://arxiv.org/abs/2508.21112">
<img
src="https://img.shields.io/badge/EO--1-Paper-red?logo=arxiv&logoColor=red"
alt="EO-Robotics Paper on arXiv"
/>
</a>
<a href="https://github.com/EO-Robotics/EO1">
<img
src="https://img.shields.io/badge/GitHub-Code-blue.svg?logo=github&"
alt="GitHub Code"
/>
</a>
<a href="https://huggingface.co/collections/IPEC-COMMUNITY/eo-robotics-68ac4ff30e1f746cac28ca14">
<img
src="https://img.shields.io/badge/EO--1--3B-Model-FFCC11?logo=huggingface&logoColor=brightyellow"
alt="EO-1 Model"
/>
</a>
<a href="https://huggingface.co/spaces/IPEC-COMMUNITY/EO-Robotics">
<img
src="https://img.shields.io/badge/EO--Robotics-Space-orange?logo=huggingface&logoColor=brightyellow"
alt="EO-Robotics Model"
/>
</a>
<a href="https://discord.gg/JqfDs6va">
<img
src="https://img.shields.io/badge/EO--Robotics-Discord-155dfc?logo=discord&logoColor=lightblue"
alt="EO-Robotics Discord"
/>
</a>
<a href="mailto:wangdong@pjlab.org.cn">
<img
src="https://img.shields.io/badge/EO--Robotics-Email-D14836?logo=gmail&logoColor=red"
alt="EO-Robotics Email"
/>
</a>
<a href="https://huggingface.co/datasets/IPEC-COMMUNITY/EO-Data1.5M">
<img
src="https://img.shields.io/badge/Dataset-EO--Data1.5M-brightgreen?logo=huggingface&logoColor=brightyellow"
alt="EO-1.5M"
/>
</a>
</p>
## Interleaved Vision-Text-Action Pretraining for General Robot Control
We introduce **EO-1** model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The **EO-1** model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:
- β‘ **Unified Architecture**: A single decoder-only transformer integrating text, image, video, and actions.
- π **EO-1.5M Dataset**: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
- π **Interleaved Pretraining**: Seamless synergy between language and action with autoregressive + flow matching.
- π€ **Reasoning-Enhanced Generalization**: Superior generalization capabilities with multimodal embodied reasoning and real robot control.
<p align="left">
<img src="assets/embodiments.png" width="100%">
</p>
## 0. Model Architecture
<p align="left">
<img src="assets/arch.png" width="100%">
</p>
**EO-1** model is a Vision-Language-Action (VLA) model that adopts a single unified decoder-only transformer, equipping with discrete language-modeling head for multimodal embodied reasoning and continuous flow-matching head for robot action generation. The language instruction, image observations, robot state, and noisy action are encoded into an interleaved token sequence of tokens to be processed by the shared transformer backbone, whose weights are initialized from Qwen2.5-VL. The model is trained on interleaved vision-text-action data with a combination of flow-matching objective and next-token-prediction objective and capable of seamless embodied reasoning and acting.
### Input:
Input Type:
- Vision: Image Frames, Video
- State: Robot Proprioception
- Language Instruction: Text, Pointing, Bounding Box, etc.
- Input Format:
- Vision: Variable number of uint8 image frames or long video sequence
- State: Floating Point
- Language Instruction: String
### Output:
Output Type(s): Actions, Language
Output Format: Continuous-value vectors, Discrete Text
## 1. Inference with pre-trained model
**EO-1** is built entirely on π€ HuggingFace Transformers and Lerobot, making deployment straightforward and accessible. If your environment supports transformers and lerobot, you can load the model and run inference directly with just a few lines of code (requires ~6.5GB GPU memory). **EO-1** unifies high-level embodied reasoning with low-level robot control, producing either natural language outputs or actionable robot commands.
```python
from transformers import AutoModel, AutoProcessor
# load the model and processor
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/EO-1-3B", trust_remote_code=True)
model = AutoModel.from_pretrained(
"IPEC-COMMUNITY/EO-1-3B",
trust_remote_code=True,
torch_dtype=torch.bfloat16
).eval().cuda()
# prepare the model input
batch = {
"observation.images.image": [img], # PIL.Image
"observation.images.wrist_image": [wrist_img],
"observation.state": [state],
"task": ["You are a helpful physical agent equipped with both reasoning and robotic control. \
You see the Tic-Tac-Toe board, think strategically, act logically, and block threats."]
}
# generate multimodal outputs
output = processor.generate(model, batch)
text = output.text
actions = output.action.numpy()
```
## 2. Benchmark
Mastering Diverse Manipulations on Multiple Embodiments
| Model | Franka Pick-and-Place (7 Tasks) | AgiBot Long-horizon Dexterity (4 Tasks) | WidowX Out-of-Box (13 Tasks) | Reasoning Control (4 Tasks) |
|--------------|---------------------------------|-----------------------------------------|------------------------------|-----------------------------|
| $\pi_0$-fast | 0.610 | 0.449 | 0.227 | β |
| $\pi_0$ | 0.831 | 0.672 | 0.693 | 0.525 |
| GR00T-N1.5 | 0.857 | 0.681 | 0.705 | 0.617 |
| **EO-1** | **0.935** | **0.807** | **0.852** | **0.831** |
Multi-modal Benchmark Results
| Model | RoboVQA | ERQA | EO-Bench @ Spatial | EO-Bench @ Temporal | Overall |
|---------------------|----------|----------|--------------------|---------------------|----------|
| Claude 3.5 | 26.7 | 35.5 | 24.0 | 34.8 | 30.3 |
| GPT-4o (2024-11-20) | 47.2 | 40.0 | 35.6 | 39.3 | 40.5 |
| Qwen2.5 VL 3B | 55.9 | 35.3 | 20.0 | 22.6 | 33.5 |
| Magma 8B | 30.3 | 29.3 | 29.4 | 36.7 | 31.4 |
| **EO-1 (3B)** | **58.5** | **45.5** | **36.4** | **38.9** | **44.8** |
Robot Control Benchmark Results
| Model | LIBERO | Simpler @ Google VM | Simpler @ Google VA | Simpler @ WidowX VM |
|--------------|-----------|---------------------|---------------------|---------------------|
| $\pi_0$ | 0.942 | 0.714 | 0.714 | 0.692 |
| $\pi_0$-fast | 0.855 | 0.464 | 0.464 | 0.321 |
| GR00T-N1 | 0.939 | β | β | β |
| Magma | β | 0.488 | 0.488 | 0.448 |
| **EO-1** | **0.982** | **0.765** | **0.765** | **0.727** |
## π 3. Citation
If you find this project useful, please consider citing:
```bibtex
@article{eo-1,
title={EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control},
author={Delin Qu and Haoming Song and Qizhi Chen and Zhaoqing Chen and Xianqiang Gao and Xinyi Ye and Qi Lv and Modi Shi and Guanghui Ren and Cheng Ruan and Maoqing Yao and Haoran Yang and Jiacheng Bao and Bin Zhao and Dong Wang},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2508.21112}
}
``` |