Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation
Paper • 2605.11832 • Published
This repository contains the weights for Multi-view-VLA, a Vision-Language-Action framework designed for robust and precise robotic manipulation.
Project Page | Code | arXiv
Multi-view-VLA addresses the challenges of spatial perception and manipulation in Vision-Language-Action (VLA) models. Key features include:
The model demonstrates superior success rates and robustness on benchmarks like LIBERO, RoboTwin 2.0, and real-world robotic tasks.
For detailed instructions on installation, training, and evaluation, please refer to the official GitHub repository.
If you find this work useful, please consider citing:
@article{xiao2026learning,
title={Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation},
author={Junjin Xiao and Dongyang Li and Yandan Yang and Shuang Zeng and Tong Lin and Xinyuan Chang and Feng Xiong and Mu Xu and Xing Wei and Zhiheng Ma and Qing Zhang and Wei-Shi Zheng},
year={2026},
journal={arxiv:2605.11832},
}
This project builds upon starVLA, Qwen3-VL, vggt, JiT, LeRobot, Isaac-GR00T and any4lerobot.