wwwyyy
/

pretrain-RealSimEEFJoint-eef-relative

+---
+license: apache-2.0
+pipeline_tag: robotics
+library_name: transformers
+---
+# Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
+This repository contains the weights for the Vision-Language-Action (VLA) models presented in the paper [Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization](https://huggingface.co/papers/2602.09722).
+[**Project Website**](https://research.beingbeyond.com/rethink_vla) | [**GitHub Repository**](https://github.com/BeingBeyond/Rethink_VLA)
+## Summary
+This work presents a systematic and controlled study of Vision-Language-Action (VLA) model scaling, aiming to clarify whether standard data scaling recipes apply to robotics given the inherent heterogeneity of training data across embodiments, sensors, and action spaces.
+The analysis targets three key dimensions of VLA scaling:
+1.  **Physical alignment**: A unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer.
+2.  **Embodiment mixture**: Naively pooling heterogeneous robot datasets often leads to negative transfer, highlighting the challenges of indiscriminate data scaling.
+3.  **Training regularization**: Intuitive strategies like sensory dropout and multi-stage fine-tuning do not consistently improve performance at scale.
+## Usage
+Please refer to the [GitHub Repository](https://github.com/BeingBeyond/Rethink_VLA) for detailed instructions on pre-training, post-training, and evaluation using benchmarks such as LIBERO and RoboCasa.
+## Citation
+If you find this work useful, please cite it as:
+```bibtex
+@article{rethinkvla2025,
+  title={Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization},
+  author={Anonymous Authors},
+  journal={arXiv preprint arXiv:2602.09722},
+  year={2025}
+}
+```
+## Acknowledgments
+We thank the authors of the following projects for their contributions to the robotics and machine learning communities:
+*   [BeingH0.5](https://github.com/BeingBeyond/Being-H): VLA framework
+*   [InternVL](https://github.com/OpenGVLab/InternVL): Vision-Language model backbone
+*   [Bagel](https://github.com/ByteDance-Seed/Bagel): Training framework
+*   [Qwen](https://github.com/QwenLM/Qwen): Language model
+*   [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO): Benchmark for lifelong robot learning
+*   [RoboCasa](https://github.com/robocasa/robocasa): Large-scale simulation benchmark for everyday tasks