nielsr HF Staff

Improve model card: add metadata, paper, project, and code links

c04c4d0 verified 2 months ago

2.5 kB

license: apache-2.0
pipeline_tag: robotics
library_name: transformers

Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization

This repository contains the weights for the Vision-Language-Action (VLA) models presented in the paper Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization.

Project Website | GitHub Repository

Summary

This work presents a systematic and controlled study of Vision-Language-Action (VLA) model scaling, aiming to clarify whether standard data scaling recipes apply to robotics given the inherent heterogeneity of training data across embodiments, sensors, and action spaces.

The analysis targets three key dimensions of VLA scaling:

Physical alignment: A unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer.
Embodiment mixture: Naively pooling heterogeneous robot datasets often leads to negative transfer, highlighting the challenges of indiscriminate data scaling.
Training regularization: Intuitive strategies like sensory dropout and multi-stage fine-tuning do not consistently improve performance at scale.

Usage

Please refer to the GitHub Repository for detailed instructions on pre-training, post-training, and evaluation using benchmarks such as LIBERO and RoboCasa.

Citation

If you find this work useful, please cite it as:

@article{rethinkvla2025,
  title={Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2602.09722},
  year={2025}
}

Acknowledgments

We thank the authors of the following projects for their contributions to the robotics and machine learning communities:

BeingH0.5: VLA framework
InternVL: Vision-Language model backbone
Bagel: Training framework
Qwen: Language model
LIBERO: Benchmark for lifelong robot learning
RoboCasa: Large-scale simulation benchmark for everyday tasks