File size: 2,908 Bytes
668a680 afa72da b4763c5 668a680 b4763c5 2a4ee18 b4763c5 668a680 b4763c5 668a680 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
---
license: apache-2.0
language:
- en
---
<h1 align="center">Causal World Modeling for Robot Control</h1>
<p align="center">
<img src="assets/teaser.png" width="100%">
</p>
**LingBot-VA** has focused on:
- **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction.
- **High-efficiency Execution**: A dual-stream mixture-of-transformers(MoT) architecture with Asynchronous Execution and KV Cache.
- **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes.
---
# Model Sources
- **Repository:** [https://github.com/Robbyant/lingbot-va](https://github.com/Robbyant/lingbot-va)
- **Paper:** [https://arxiv.org/abs/2601.21998](https://arxiv.org/abs/2601.21998)
- **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
---
# 📦 Model Download
- **Pretrained Checkpoints for Post-Training**
| Model Name | Huggingface Repository | ModelScope Repository | Description |
| :--- | :--- | :--- | :--- |
| lingbot-va-base | [🤗 robbyant/lingbot-va-base ](https://huggingface.co/robbyant/lingbot-va-base) | [🤖 Robbyant/lingbot-va-base ](https://modelscope.cn/models/Robbyant/lingbot-va-base) | LingBot-VA w/ shared backbone|
| lingbot-va-posttrain-robotwin | [🤗 robbyant/lingbot-va-posttrain-robotwin ](https://huggingface.co/robbyant/lingbot-va-posttrain-robotwin) | [🤖 Robbyant/lingbot-va-posttrain-robotwin ](https://modelscope.cn/models/Robbyant/lingbot-va-posttrain-robotwin) | LingBot-VA-Posttrain-Robotwin w/ shared backbone|
---
# 📚Citation
```bibtex
@article{lingbot-va2026,
title={Causal World Modeling for Robot Control},
author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
journal={arXiv preprint arXiv:2601.21998},
year={2026}
}
```
# 🪪 License
This project is released under the Apache License 2.0. See [LICENSE](LICENSE) file for details.
# 🧩 Acknowledgments
This work builds upon several excellent open-source projects:
- [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
- [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture
- The broader open-source computer vision and robotics communities
---
For questions, discussions, or collaborations:
<!-- - **Issues**: Open an [issue](https://github.com/robbyant/lingbot-depth/issues) on GitHub
- **Email**: Contact Dr. [Bin Tan](https://https://icetttb.github.io/) (tanbin.tan@antgroup.com) or Dr. [Nan Xue](https://xuenan.net) (xuenan.xue@antgroup.com) --> |