arxiv:2606.17200

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Published on Jun 15

· Submitted by

Siyuan on Jun 17

#3 Paper of the day

CUHK

Upvote

Authors:

Abstract

A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

View arXiv page View PDF Project page GitHub 10 Add to collection

Community

SiyuanH

Paper submitter about 18 hours ago

avahal

about 6 hours ago

the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20

noahml

about 2 hours ago

Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.

How does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8