README.md · IPEC-COMMUNITY/EO-1-3B at d3dbff4f528c22a0acd07d74c751ebb7ada27392

EO-1-3B / README.md

delinqu

Update README.md

d3dbff4 verified 8 months ago

8.75 kB

license: mit
datasets:
  - agibot-world/AgiBotWorld-Beta
  - IPEC-COMMUNITY/fractal20220817_data_lerobot
  - youliangtan/bridge_dataset
  - IPEC-COMMUNITY/droid_lerobot
  - liuhaotian/LLaVA-Instruct-150K
  - lmms-lab/LLaVA-Video-178K
  - lmms-lab/RefCOCO
  - allenai/pixmo-points
  - IPEC-COMMUNITY/EO-Data1.5M
  - lmms-lab/RoboVQA
  - x-humanoid-robomind/RoboMIND
language:
  - en
metrics:
  - accuracy
  - bleu
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
tags:
  - Robot Control
  - Generalist robot policies
  - VLA
  - Embodied AI
  - Unified Model
  - multimodal
  - large embodied model

Interleaved Vision-Text-Action Pretraining for General Robot Control

We introduce EO-1 model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The EO-1 model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:

⚡ Unified Architecture: A single decoder-only transformer integrating text, image, video, and actions.
📚 EO-1.5M Dataset: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
🌀 Interleaved Pretraining: Seamless synergy between language and action with autoregressive + flow matching.
🤖 Reasoning-Enhanced Generalization: Superior generalization capabilities with multimodal embodied reasoning and real robot control.

0. Model Architecture

EO-1 model is a Vision-Language-Action (VLA) model that adopts a single unified decoder-only transformer, equipping with discrete language-modeling head for multimodal embodied reasoning and continuous flow-matching head for robot action generation. The language instruction, image observations, robot state, and noisy action are encoded into an interleaved token sequence of tokens to be processed by the shared transformer backbone, whose weights are initialized from Qwen2.5-VL. The model is trained on interleaved vision-text-action data with a combination of flow-matching objective and next-token-prediction objective and capable of seamless embodied reasoning and acting.

Input:

Input Type:

Vision: Image Frames, Video
State: Robot Proprioception
Language Instruction: Text, Pointing, Bounding Box, etc.
Input Format:
- Vision: Variable number of 224x224 uint8 image frames or long video sequence
- State: Floating Point
- Language Instruction: String

Output:

Output Type(s): Actions, Language

Output Format: Continuous-value vectors, Discrete Text

1. Inference with pre-trained model

EO-1 is built entirely on 🤗 HuggingFace Transformers and Lerobot, making deployment straightforward and accessible. If your environment supports transformers and lerobot, you can load the model and run inference directly with just a few lines of code (requires ~6.5GB GPU memory). EO-1 unifies high-level embodied reasoning with low-level robot control, producing either natural language outputs or actionable robot commands.

from transformers import AutoModel, AutoProcessor
# load the model and processor
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/EO-1-3B", trust_remote_code=True)
model = AutoModel.from_pretrained(
  "IPEC-COMMUNITY/EO-1-3B", 
  trust_remote_code=True, 
  torch_dtype=torch.bfloat16
).eval().cuda()

# prepare the model input
batch = {
    "observation.images.image": [img], # PIL.Image
    "observation.images.wrist_image": [wrist_img],
    "observation.state": [state],
    "task": ["You are a helpful physical agent equipped with both reasoning and robotic control. \
      You see the Tic-Tac-Toe board, think strategically, act logically, and block threats."]
}

# generate multimodal outputs
output = processor.generate(model, batch)
text = output.text
actions = output.action.numpy()

2. Benchmark

Mastering Diverse Manipulations on Multiple Embodiments

Model	Franka Pick-and-Place (7 Tasks)	AgiBot Long-horizon Dexterity (4 Tasks)	WidowX Out-of-Box (13 Tasks)	Reasoning Control (4 Tasks)
$\pi_0$-fast	0.610	0.449	0.227	—
$\pi_0$	0.831	0.672	0.693	0.525
GR00T-N1.5	0.857	0.681	0.705	0.617
EO-1	0.935	0.807	0.852	0.831

Multi-modal Benchmark Results

Model	RoboVQA	ERQA	EO-Bench @ Spatial	EO-Bench @ Temporal	Overall
Claude 3.5	26.7	35.5	24.0	34.8	30.3
GPT-4o (2024-11-20)	47.2	40.0	35.6	39.3	40.5
Qwen2.5 VL 3B	55.9	35.3	20.0	22.6	33.5
Magma 8B	30.3	29.3	29.4	36.7	31.4
EO-1 (3B)	58.5	45.5	36.4	38.9	44.8

Robot Control Benchmark Results

Model	LIBERO	Simpler @ Google VM	Simpler @ Google VA	Simpler @ WidowX VM
$\pi_0$	0.942	0.714	0.714	0.692
$\pi_0$-fast	0.855	0.464	0.464	0.321
GR00T-N1	0.939	—	—	—
Magma	—	0.488	0.488	0.448
EO-1	0.982	0.765	0.765	0.727

📚 3. Citation

If you find this project useful, please consider citing:

@article{eo-robotics,
  title={EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control},
  author={Qu, Delin and Song, Haoming and Chen, Qizhi and Chen, Zhaoqing, and Gao Xianqiang, and Ye, Xinyi, and Modi Shi, and Guanghui Ren and Maoqing Yao, and Zhao, Bin and Wang, Dong},
  journal={arXiv preprint},
  year={2025}
}