mgovind7
/

UniLACT

+---
+license: mit
+tags:
+- robot manipulation
+- multi-modal perception
+- vision-language-action
+---
+# UniLACT
+UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models.
+## Abstract
+Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for
+pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived
+solely from RGB observations primarily encode appearancedriven dynamics and lack explicit 3D geometric structure,
+which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UNILACT, a
+transformer-based VLA model that incorporates geometric
+structure through depth-aware latent pretraining, enabling
+downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UNILARN, a unified latent action
+learning framework based on inverse and forward dynamics
+objectives that learns a shared embedding space for RGB and
+depth while explicitly modeling their cross-modal interactions.
+This formulation produces modality-specific and unified latent
+action representations that serve as pseudo-labels for the depthaware pretraining of UNILACT. Extensive experiments in both
+simulation and real-world settings demonstrate the effectiveness
+of depth-aware unified latent action representations. UNILACT
+consistently outperforms RGB-based latent action baselines
+under in-domain and out-of-domain pretraining regimes, as
+well as on both seen and unseen manipulation tasks.
+## Citation
+```bibtex
+@misc{govind2026unilactdepthawarergblatent,
+  title={UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models},
+  author={Manish Kumar Govind and Dominick Reilly and Pu Wang and Srijan Das},
+  year={2026},
+  eprint={2602.20231},
+  archivePrefix={arXiv},
+  primaryClass={cs.RO},
+  url={https://arxiv.org/abs/2602.20231}
+}