VLANeXt: Recipes for Building Strong VLA Models

VLANeXt is a Vision-Language-Action (VLA) model designed for general-purpose robotic policy learning. By systematically reexamining the VLA design space, the authors distill a set of 12 practical findings that significantly improve model performance and generalization across benchmarks like LIBERO and LIBERO-plus.

📖 Abstract

Following the rise of large foundation models, Vision–Language–Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. VLANeXt reexamines the VLA design space under a unified framework and evaluation setup, dissecting design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. The resulting model outperforms prior state-of-the-art methods and demonstrates strong generalization in real-world experiments.

🛠️ Usage

This repository hosts the checkpoints for evaluation on the LIBERO and LIBERO-plus benchmark suites. For environment setup, training, and evaluation instructions, please refer to the official VLANeXt GitHub repository.

📚 Citation

If you find VLANeXt useful for your research or applications, please cite the paper:

@article{wu2026vlanext,
    title={VLANeXt: Recipes for Building Strong VLA Models}, 
    author={Xiao-Ming Wu and Bin Fan and Kang Liao and Jian-Jian Jiang and Runze Yang and Yihang Luo and Zhonghua Wu and Wei-Shi Zheng and Chen Change Loy},
    journal={arXiv preprint arXiv:2602.18532},
    year={2026}
}

🗞️ License

This project is licensed under the NTU S-Lab License 1.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for DravenALG/VLANeXt

VLANeXt: Recipes for Building Strong VLA Models

Paper • 2602.18532 • Published Feb 20 • 52