license: apache-2.0
pipeline_tag: robotics
library_name: transformers
Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
This repository contains the weights for the Vision-Language-Action (VLA) models presented in the paper Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization.
Project Website | GitHub Repository
Summary
This work presents a systematic and controlled study of Vision-Language-Action (VLA) model scaling, aiming to clarify whether standard data scaling recipes apply to robotics given the inherent heterogeneity of training data across embodiments, sensors, and action spaces.
The analysis targets three key dimensions of VLA scaling:
- Physical alignment: A unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer.
- Embodiment mixture: Naively pooling heterogeneous robot datasets often leads to negative transfer, highlighting the challenges of indiscriminate data scaling.
- Training regularization: Intuitive strategies like sensory dropout and multi-stage fine-tuning do not consistently improve performance at scale.
Usage
Please refer to the GitHub Repository for detailed instructions on pre-training, post-training, and evaluation using benchmarks such as LIBERO and RoboCasa.
Citation
If you find this work useful, please cite it as:
@article{rethinkvla2025,
title={Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization},
author={Anonymous Authors},
journal={arXiv preprint arXiv:2602.09722},
year={2025}
}
Acknowledgments
We thank the authors of the following projects for their contributions to the robotics and machine learning communities: