UniLACT / README.md
mgovind7's picture
Update README.md
dd53ecd verified
---
tags:
- robot manipulation
- multi-modal perception
- vision-language-action
---
# UniLACT
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models.
## Abstract
Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for
pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived
solely from RGB observations primarily encode appearancedriven dynamics and lack explicit 3D geometric structure,
which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UNILACT, a
transformer-based VLA model that incorporates geometric
structure through depth-aware latent pretraining, enabling
downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UNILARN, a unified latent action
learning framework based on inverse and forward dynamics
objectives that learns a shared embedding space for RGB and
depth while explicitly modeling their cross-modal interactions.
This formulation produces modality-specific and unified latent
action representations that serve as pseudo-labels for the depthaware pretraining of UNILACT. Extensive experiments in both
simulation and real-world settings demonstrate the effectiveness
of depth-aware unified latent action representations. UNILACT
consistently outperforms RGB-based latent action baselines
under in-domain and out-of-domain pretraining regimes, as
well as on both seen and unseen manipulation tasks.
## Citation
```bibtex
@misc{govind2026unilactdepthawarergblatent,
title={UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models},
author={Manish Kumar Govind and Dominick Reilly and Pu Wang and Srijan Das},
year={2026},
eprint={2602.20231},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.20231}
}