Papers
arxiv:2606.17200

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Published on Jun 15
· Submitted by
Siyuan
on Jun 17
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

A unified Vision-Language-Action pretraining framework leverages heterogeneous data sources including human egocentric videos and robot trajectories through a reliability-aware training approach that improves performance on embodied AI tasks.

Vision-Language-Action (VLA) models benefit from large-scale and diverse embodied data, yet scaling robot trajectory collection is costly and labor-intensive. Recent advances show that large-scale egocentric human videos provide complementary real-world supervision in pretraining. However, joint training on human and robot data remains challenging due to divergences in action spaces, embodiment structures, temporal dynamics, and supervision quality. We introduce ACE-EGO-0, a unified VLA pretraining framework jointly leveraging heterogeneous data sources. To extract large-scale pretraining supervision from egocentric human videos, we build a scalable egocentric video-to-action pipeline that converts raw human videos into robot-format pseudo-action trajectories. To make these labels comparable with robot demonstrations, ACE-EGO-0 uses a unified action representation based on camera-space actions, morphology conditioning, and time-aligned action chunking. To robustly leverage noisy pseudo-action supervision from egocentric human videos, we formulate a reliability-aware training objective with a human auxiliary loss that concentrates supervision on reliable signals. We instantiate ACE-EGO-0 on 4.53K hours of robot and simulation data, together with 1.48K hours of pseudo-action-labeled egocentric human data. Experiments show that incorporating large-scale human supervision under reliability-aware weighting consistently improves both unified joint pretraining and supervised fine-tuning. ACE-EGO-0 achieves state-of-the-art performance on RoboCasa GR1 TableTop and RoboTwin 2.0, while demonstrating strong transfer to real-world bimanual manipulation.

Community

Paper submitter

the unified camera-space action space is slick, and the reliability-aware loss is a clean guardrail for noisy signals. i’m curious how this plays out when egocentric reconstructions have systematic noise like occlusions or timing jitter—does the weighting auto-correct or still need tuning? it’d be great to see an ablation varying the noise profile or dropping some channels to stress-test the auto-weighting during pretraining. btw the arxivlens breakdown helped me parse the method details: https://arxivlens.com/PaperView/Details/ace-ego-0-unifying-egocentric-human-and-robotic-data-for-vla-pretraining-8825-7e203c20

Neat paper. The bridge between human video and robot actions has always been a pain point, so I'm interested to see how that reliability-aware training objective actually performs in practice. It makes a lot of sense to filter out the noise from those pseudo-labels rather than just throwing everything into the mix.

How does the model handle the inherent differences in temporal dynamics between human movement and robot trajectories when aligning these action chunks?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4526ec0a-584b-4dea-9efd-d759ba040fd8

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.17200
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17200 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17200 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17200 in a Space README.md to link it from this page.

Collections including this paper 1