mgovind7 commited on
Commit
b56f51c
·
verified ·
1 Parent(s): e792dae

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - robot manipulation
5
+ - multi-modal perception
6
+ - vision-language-action
7
+ ---
8
+
9
+ # UniLACT
10
+
11
+ UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models.
12
+
13
+ ## Abstract
14
+ Latent action representations learned from unlabeled videos have recently emerged as a promising paradigm for
15
+ pretraining vision-language-action (VLA) models without explicit robot action supervision. However, latent actions derived
16
+ solely from RGB observations primarily encode appearancedriven dynamics and lack explicit 3D geometric structure,
17
+ which is essential for precise and contact-rich manipulation. To address this limitation, we introduce UNILACT, a
18
+ transformer-based VLA model that incorporates geometric
19
+ structure through depth-aware latent pretraining, enabling
20
+ downstream policies to inherit stronger spatial priors. To facilitate this process, we propose UNILARN, a unified latent action
21
+ learning framework based on inverse and forward dynamics
22
+ objectives that learns a shared embedding space for RGB and
23
+ depth while explicitly modeling their cross-modal interactions.
24
+ This formulation produces modality-specific and unified latent
25
+ action representations that serve as pseudo-labels for the depthaware pretraining of UNILACT. Extensive experiments in both
26
+ simulation and real-world settings demonstrate the effectiveness
27
+ of depth-aware unified latent action representations. UNILACT
28
+ consistently outperforms RGB-based latent action baselines
29
+ under in-domain and out-of-domain pretraining regimes, as
30
+ well as on both seen and unseen manipulation tasks.
31
+
32
+
33
+ ## Citation
34
+
35
+ ```bibtex
36
+ @misc{govind2026unilactdepthawarergblatent,
37
+ title={UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models},
38
+ author={Manish Kumar Govind and Dominick Reilly and Pu Wang and Srijan Das},
39
+ year={2026},
40
+ eprint={2602.20231},
41
+ archivePrefix={arXiv},
42
+ primaryClass={cs.RO},
43
+ url={https://arxiv.org/abs/2602.20231}
44
+ }