Add robotics metadata and improve model card
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,64 +1,79 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
|
|
|
| 3 |
|
| 4 |
<p align="center">
|
| 5 |
-
<img src="assets/teaser.png" width="100%">
|
| 6 |
</p>
|
| 7 |
|
| 8 |
-
**LingBot-VA**
|
| 9 |
-
- **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction.
|
| 10 |
-
- **High-efficiency Execution**: A dual-stream mixture-of-transformers(MoT) architecture with Asynchronous Execution and KV Cache.
|
| 11 |
-
- **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes.
|
| 12 |
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
---
|
| 15 |
|
| 16 |
-
#
|
| 17 |
|
| 18 |
-
- **Repository:** [https://github.com/
|
| 19 |
-
- **Paper:** [Causal World Modeling for Robot Control](https://
|
| 20 |
- **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
|
| 21 |
|
| 22 |
---
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
|
|
|
| 27 |
|
| 28 |
-
| Model Name | Huggingface Repository | ModelScope Repository | Description |
|
| 29 |
-
| :--- | :--- | :--- | :--- |
|
| 30 |
-
| lingbot-va-base | [🤗 robbyant/lingbot-va-base ](https://huggingface.co/robbyant/lingbot-va-base) | [🤖 Robbyant/lingbot-va-base ](https://modelscope.cn/models/Robbyant/lingbot-va-base) | LingBot-VA w/ shared backbone|
|
| 31 |
-
| lingbot-va-posttrain-robotwin | [🤗 robbyant/lingbot-va-posttrain-robotwin ](https://huggingface.co/robbyant/lingbot-va-posttrain-robotwin) | [🤖 Robbyant/lingbot-va-posttrain-robotwin ](https://modelscope.cn/models/Robbyant/lingbot-va-posttrain-robotwin) | LingBot-VA-Posttrain-Robotwin w/ shared backbone|
|
| 32 |
---
|
| 33 |
|
| 34 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
```bibtex
|
| 37 |
@article{lingbot-va2026,
|
| 38 |
title={Causal World Modeling for Robot Control},
|
| 39 |
author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
|
| 40 |
-
journal={arXiv preprint arXiv:
|
| 41 |
year={2026}
|
| 42 |
}
|
| 43 |
```
|
| 44 |
|
| 45 |
-
|
| 46 |
# 🪪 License
|
| 47 |
|
| 48 |
-
This project is released under the Apache License 2.0
|
| 49 |
|
| 50 |
# 🧩 Acknowledgments
|
| 51 |
|
| 52 |
This work builds upon several excellent open-source projects:
|
| 53 |
-
|
| 54 |
- [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
|
| 55 |
- [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture
|
| 56 |
-
- The broader open-source computer vision and robotics communities
|
| 57 |
-
|
| 58 |
-
---
|
| 59 |
-
|
| 60 |
-
For questions, discussions, or collaborations:
|
| 61 |
-
|
| 62 |
-
<!-- - **Issues**: Open an [issue](https://github.com/robbyant/lingbot-depth/issues) on GitHub
|
| 63 |
-
- **Email**: Contact Dr. [Bin Tan](https://https://icetttb.github.io/) (tanbin.tan@antgroup.com) or Dr. [Nan Xue](https://xuenan.net) (xuenan.xue@antgroup.com) -->
|
| 64 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
tags:
|
| 5 |
+
- robotics
|
| 6 |
+
- world-model
|
| 7 |
+
- video-generation
|
| 8 |
+
- transformer
|
| 9 |
+
---
|
| 10 |
|
| 11 |
+
<h1 align="center">LingBot-VA: Causal World Modeling for Robot Control</h1>
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
+
<img src="https://huggingface.co/robbyant/lingbot-va-base/resolve/main/assets/teaser.png" width="100%">
|
| 15 |
</p>
|
| 16 |
|
| 17 |
+
**LingBot-VA** is an autoregressive diffusion framework designed for simultaneous world modeling and robot action execution. By understanding the causality between actions and visual dynamics, the model provides the ability to imagine the near future and plan actions accordingly.
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
- **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence.
|
| 20 |
+
- **High-efficiency Execution**: Uses a dual-stream Mixture-of-Transformers (MoT) architecture with Asynchronous Execution and KV Cache support.
|
| 21 |
+
- **Long-Horizon Performance**: Demonstrates significant promise in long-horizon manipulation and strong generalizability to novel configurations.
|
| 22 |
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# Model Sources
|
| 26 |
|
| 27 |
+
- **Repository:** [https://github.com/robbyant/lingbot-va](https://github.com/robbyant/lingbot-va)
|
| 28 |
+
- **Paper:** [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998)
|
| 29 |
- **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
+
# 🛠️ Quick Start
|
| 34 |
+
|
| 35 |
+
## Installation
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu126
|
| 39 |
+
pip install websockets einops diffusers==0.36.0 transformers==5.0.0 accelerate msgpack opencv-python matplotlib ftfy easydict
|
| 40 |
+
pip install flash-attn --no-build-isolation
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
## Run Image to Video-Action Generation
|
| 44 |
+
|
| 45 |
+
You can use the following command to generate video-action sequences from images:
|
| 46 |
|
| 47 |
+
```bash
|
| 48 |
+
NGPU=1 CONFIG_NAME='robotwin_i2av' bash script/run_launch_va_server_sync.sh
|
| 49 |
+
```
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
---
|
| 52 |
|
| 53 |
+
# 📊 Performance
|
| 54 |
+
|
| 55 |
+
LingBot-VA achieves state-of-the-art performance on benchmarks like RoboTwin 2.0 and LIBERO, specifically excelling in long-horizon tasks and sample efficiency. For detailed evaluation results on simulation and real-world scenarios, please refer to the [paper](https://huggingface.co/papers/2601.21998) or the [GitHub README](https://github.com/robbyant/lingbot-va).
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
# 📚 Citation
|
| 60 |
|
| 61 |
```bibtex
|
| 62 |
@article{lingbot-va2026,
|
| 63 |
title={Causal World Modeling for Robot Control},
|
| 64 |
author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
|
| 65 |
+
journal={arXiv preprint arXiv:2601.21998},
|
| 66 |
year={2026}
|
| 67 |
}
|
| 68 |
```
|
| 69 |
|
|
|
|
| 70 |
# 🪪 License
|
| 71 |
|
| 72 |
+
This project is released under the [Apache License 2.0](LICENSE).
|
| 73 |
|
| 74 |
# 🧩 Acknowledgments
|
| 75 |
|
| 76 |
This work builds upon several excellent open-source projects:
|
|
|
|
| 77 |
- [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
|
| 78 |
- [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture
|
| 79 |
+
- The broader open-source computer vision and robotics communities
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|