Improve model card: add metadata, paper, project, and code links
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
|
| 8 |
+
|
| 9 |
+
This repository contains the weights for the Vision-Language-Action (VLA) models presented in the paper [Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization](https://huggingface.co/papers/2602.09722).
|
| 10 |
+
|
| 11 |
+
[**Project Website**](https://research.beingbeyond.com/rethink_vla) | [**GitHub Repository**](https://github.com/BeingBeyond/Rethink_VLA)
|
| 12 |
+
|
| 13 |
+
## Summary
|
| 14 |
+
|
| 15 |
+
This work presents a systematic and controlled study of Vision-Language-Action (VLA) model scaling, aiming to clarify whether standard data scaling recipes apply to robotics given the inherent heterogeneity of training data across embodiments, sensors, and action spaces.
|
| 16 |
+
|
| 17 |
+
The analysis targets three key dimensions of VLA scaling:
|
| 18 |
+
1. **Physical alignment**: A unified end-effector (EEF)-relative action representation is critical for robust cross-embodiment transfer.
|
| 19 |
+
2. **Embodiment mixture**: Naively pooling heterogeneous robot datasets often leads to negative transfer, highlighting the challenges of indiscriminate data scaling.
|
| 20 |
+
3. **Training regularization**: Intuitive strategies like sensory dropout and multi-stage fine-tuning do not consistently improve performance at scale.
|
| 21 |
+
|
| 22 |
+
## Usage
|
| 23 |
+
|
| 24 |
+
Please refer to the [GitHub Repository](https://github.com/BeingBeyond/Rethink_VLA) for detailed instructions on pre-training, post-training, and evaluation using benchmarks such as LIBERO and RoboCasa.
|
| 25 |
+
|
| 26 |
+
## Citation
|
| 27 |
+
|
| 28 |
+
If you find this work useful, please cite it as:
|
| 29 |
+
|
| 30 |
+
```bibtex
|
| 31 |
+
@article{rethinkvla2025,
|
| 32 |
+
title={Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization},
|
| 33 |
+
author={Anonymous Authors},
|
| 34 |
+
journal={arXiv preprint arXiv:2602.09722},
|
| 35 |
+
year={2025}
|
| 36 |
+
}
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Acknowledgments
|
| 40 |
+
|
| 41 |
+
We thank the authors of the following projects for their contributions to the robotics and machine learning communities:
|
| 42 |
+
|
| 43 |
+
* [BeingH0.5](https://github.com/BeingBeyond/Being-H): VLA framework
|
| 44 |
+
* [InternVL](https://github.com/OpenGVLab/InternVL): Vision-Language model backbone
|
| 45 |
+
* [Bagel](https://github.com/ByteDance-Seed/Bagel): Training framework
|
| 46 |
+
* [Qwen](https://github.com/QwenLM/Qwen): Language model
|
| 47 |
+
* [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO): Benchmark for lifelong robot learning
|
| 48 |
+
* [RoboCasa](https://github.com/robocasa/robocasa): Large-scale simulation benchmark for everyday tasks
|