| --- |
| license: mit |
| language: |
| - en |
| base_model: |
| - Qwen/Qwen2.5-0.5B |
| --- |
| # Model Card for MergeVLA-LIBERO |
| MergeVLA β Single-Skill Experts for Spatial / Object / Goal / Long-10 (LIBERO Task Suite). These models are used as the base expert checkpoints for our MergeVLA. |
|
|
| ## Model Details |
| Each uploaded model is a 0.68B-parameter VLA model *(excluding the vision backbone)* composed of: |
| - Qwen2.5-0.5B as the Vision-Language Model (VLM) |
| - A lightweight 0.18B Action Expert |
| - A two-layer Proprioceptive Projector MLP |
|
|
| ### βοΈ **Performance (Success Rates on LIBERO)** |
| | Task Family | Success Rate (%) | |
| | ----------- | ---------------- | |
| | **Spatial** | **98.0** | |
| | **Object** | **98.6** | |
| | **Goal** | **95.0** | |
| | **Long-10** | **95.0** | |
|
|
| ### π§ **Training Details** |
| Each expert is fine-tuned independently using modified LIBER demonstrations in RLDS format. |
| | Category | Value | |
| | ----------------------- | ------------------------ | |
| | LoRA | Enabled (rank = 64) | |
| | Optimizer | AdamW | |
| | Learning Rate | 2e-4 | |
| | Batch Size | 8 (Γ2 grad accumulation) | |
| | num_images_in_input | 2 | |
| |
| ### **Training Steps** |
| * **Spatial** β 30,000 |
| * **Object** β 20,000 |
| * **Goal** β 30,000 |
| * **Long-10** β 50,000 |
| |
| |
| ## Citation instructions |
| ```BibTeX |
| @misc{fu2025mergevla, |
| title={MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent}, |
| author={Yuxia Fu and Zhizhen Zhang and Yuqi Zhang and Zijian Wang and Zi Huang and Yadan Luo}, |
| year={2025}, |
| eprint={2511.18810}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.RO}, |
| url={https://arxiv.org/abs/2511.18810}, |
| } |
| ``` |