mgovind7
/

UniLACT

robot manipulation

multi-modal perception

vision-language-action

Model card Files Files and versions

UniLACT / README.md

mgovind7's picture

Update README.md

dd53ecd verified 11 days ago

|

history blame contribute delete

1.93 kB

	---
	tags:
	- robot manipulation
	- multi-modal perception
	- vision-language-action
	---

	# UniLACT

	UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models.

	## Abstract
	Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for
	pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived
	solely from RGB observations primarily encode appearancedriven dynamics and lack explicit 3D geometric structure,
	which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UNILACT, a
	transformer-based VLA model that incorporates geometric
	structure through depth-aware latent pretraining, enabling
	downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UNILARN, a unified latent action
	learning framework based on inverse and forward dynamics
	objectives that learns a shared embedding space for RGB and
	depth while explicitly modeling their cross-modal interactions.
	This formulation produces modality-specific and unified latent
	action representations that serve as pseudo-labels for the depthaware pretraining of UNILACT. Extensive experiments in both
	simulation and real-world settings demonstrate the effectiveness
	of depth-aware unified latent action representations. UNILACT
	consistently outperforms RGB-based latent action baselines
	under in-domain and out-of-domain pretraining regimes, as
	well as on both seen and unseen manipulation tasks.


	## Citation

	```bibtex
	@misc{govind2026unilactdepthawarergblatent,
	title={UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models},
	author={Manish Kumar Govind and Dominick Reilly and Pu Wang and Srijan Das},
	year={2026},
	eprint={2602.20231},
	archivePrefix={arXiv},
	primaryClass={cs.RO},
	url={https://arxiv.org/abs/2602.20231}
	}