|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: robotics |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
<h1 align="center">Causal World Modeling for Robot Control</h1> |
|
|
|
|
|
<p align="center"> |
|
|
<img src="assets/teaser.png" width="100%"> |
|
|
</p> |
|
|
|
|
|
**LingBot-VA** is an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously, introduced in the paper [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998). |
|
|
|
|
|
It focuses on: |
|
|
- **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction. |
|
|
- **High-efficiency Execution**: A dual-stream mixture-of-transformers (MoT) architecture with Asynchronous Execution and KV Cache. |
|
|
- **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes. |
|
|
|
|
|
--- |
|
|
|
|
|
# Model Sources |
|
|
|
|
|
- **Repository:** [https://github.com/Robbyant/lingbot-va](https://github.com/Robbyant/lingbot-va) |
|
|
- **Paper:** [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998) |
|
|
- **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va) |
|
|
|
|
|
--- |
|
|
|
|
|
# π¦ Model Download |
|
|
- **Pretrained Checkpoints for Post-Training** |
|
|
|
|
|
| Model Name | Huggingface Repository | Description | |
|
|
| :--- | :---: | :---: | |
|
|
| lingbot-va-base | [π€ robbyant/lingbot-va-base](https://huggingface.co/robbyant/lingbot-va-base) | LingBot-VA w/ shared backbone | |
|
|
| lingbot-va-posttrain-robotwin | [π€ robbyant/lingbot-va-posttrain-robotwin](https://huggingface.co/robbyant/lingbot-va-posttrain-robotwin) | LingBot-VA-Posttrain-Robotwin w/ shared backbone | |
|
|
|
|
|
--- |
|
|
|
|
|
# π οΈ Quick Start |
|
|
|
|
|
## Installation |
|
|
**Requirements** |
|
|
β’ Python == 3.10.16 |
|
|
β’ Pytorch == 2.9.0 |
|
|
β’ CUDA 12.6 |
|
|
|
|
|
```bash |
|
|
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu126 |
|
|
pip install websockets einops diffusers==0.36.0 transformers==5.0.0 accelerate msgpack opencv-python matplotlib ftfy easydict |
|
|
pip install flash-attn --no-build-isolation |
|
|
``` |
|
|
|
|
|
## Run Image to Video-Action Generation |
|
|
We provide a script for image to video-action generation: |
|
|
|
|
|
```bash |
|
|
NGPU=1 CONFIG_NAME='robotwin_i2av' bash script/run_launch_va_server_sync.sh |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# π Performance |
|
|
|
|
|
We evaluate our model on both simulation benchmarks and real-world scenarios, achieving state-of-the-art performance. |
|
|
|
|
|
## Simulation Evaluation (Success Rate %) |
|
|
|
|
|
| Method (Average 50 Tasks) | Easy SR (%) | Hard SR (%) | |
|
|
| :--- | :---: | :---: | |
|
|
| X-VLA | 72.9 | 72.8 | |
|
|
| Οβ | 65.9 | 58.4 | |
|
|
| Οβ.β
| 82.7 | 76.8 | |
|
|
| Motus | 88.7 | 87.0 | |
|
|
| **LingBot-VA (Ours)** | **92.9** | **91.6** | |
|
|
|
|
|
--- |
|
|
|
|
|
# π Citation |
|
|
|
|
|
```bibtex |
|
|
@article{lingbot-va2026, |
|
|
title={Causal World Modeling for Robot Control}, |
|
|
author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao}, |
|
|
journal={arXiv preprint arXiv:2601.21998}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
|
|
|
# πͺͺ License |
|
|
|
|
|
This project is released under the Apache License 2.0. See [LICENSE](LICENSE) file for details. |
|
|
|
|
|
# π§© Acknowledgments |
|
|
|
|
|
This work builds upon several excellent open-source projects: |
|
|
- [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone |
|
|
- [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture |