VLANeXt: Recipes for Building Strong VLA Models

arXiv Project Page GitHub Awesome VLA

VLANeXt is a Vision-Language-Action (VLA) model designed for general-purpose robotic policy learning. By systematically reexamining the VLA design space, the authors distill a set of 12 practical findings that significantly improve model performance and generalization across benchmarks like LIBERO and LIBERO-plus.

πŸ“– Abstract

Following the rise of large foundation models, Vision–Language–Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. VLANeXt reexamines the VLA design space under a unified framework and evaluation setup, dissecting design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. The resulting model outperforms prior state-of-the-art methods and demonstrates strong generalization in real-world experiments.

πŸ› οΈ Usage

This repository hosts the checkpoints for evaluation on the LIBERO and LIBERO-plus benchmark suites. For environment setup, training, and evaluation instructions, please refer to the official VLANeXt GitHub repository.

πŸ“š Citation

If you find VLANeXt useful for your research or applications, please cite the paper:

@article{wu2026vlanext,
    title={VLANeXt: Recipes for Building Strong VLA Models}, 
    author={Xiao-Ming Wu and Bin Fan and Kang Liao and Jian-Jian Jiang and Runze Yang and Yihang Luo and Zhonghua Wu and Wei-Shi Zheng and Chen Change Loy},
    journal={arXiv preprint arXiv:2602.18532},
    year={2026}
}

πŸ—žοΈ License

This project is licensed under the NTU S-Lab License 1.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for DravenALG/VLANeXt