Diffusers
Safetensors
English

Add robotics metadata and improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +44 -29
README.md CHANGED
@@ -1,64 +1,79 @@
1
- <h1 align="center">Causal World Modeling for Robot Control</h1>
 
 
 
 
 
 
 
 
2
 
 
3
 
4
  <p align="center">
5
- <img src="assets/teaser.png" width="100%">
6
  </p>
7
 
8
- **LingBot-VA** has focused on:
9
- - **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction.
10
- - **High-efficiency Execution**: A dual-stream mixture-of-transformers(MoT) architecture with Asynchronous Execution and KV Cache.
11
- - **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes.
12
 
 
 
 
13
 
14
  ---
15
 
16
- # Model Sources
17
 
18
- - **Repository:** [https://github.com/Robbyant/lingbot-va](https://github.com/Robbyant/lingbot-va)
19
- - **Paper:** [Causal World Modeling for Robot Control](https://technology.robbyant.com/lingbot-va)
20
  - **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
21
 
22
  ---
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- # 📦 Model Download
26
- - **Pretrained Checkpoints for Post-Training**
 
27
 
28
- | Model Name | Huggingface Repository | ModelScope Repository | Description |
29
- | :--- | :--- | :--- | :--- |
30
- | lingbot-va-base &nbsp; | [🤗 robbyant/lingbot-va-base &nbsp;](https://huggingface.co/robbyant/lingbot-va-base) | [🤖 Robbyant/lingbot-va-base &nbsp;](https://modelscope.cn/models/Robbyant/lingbot-va-base) | LingBot-VA w/ shared backbone|
31
- | lingbot-va-posttrain-robotwin &nbsp; | [🤗 robbyant/lingbot-va-posttrain-robotwin &nbsp;](https://huggingface.co/robbyant/lingbot-va-posttrain-robotwin) | [🤖 Robbyant/lingbot-va-posttrain-robotwin &nbsp;](https://modelscope.cn/models/Robbyant/lingbot-va-posttrain-robotwin) | LingBot-VA-Posttrain-Robotwin w/ shared backbone|
32
  ---
33
 
34
- # 📚Citation
 
 
 
 
 
 
35
 
36
  ```bibtex
37
  @article{lingbot-va2026,
38
  title={Causal World Modeling for Robot Control},
39
  author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
40
- journal={arXiv preprint arXiv:[xxxx]},
41
  year={2026}
42
  }
43
  ```
44
 
45
-
46
  # 🪪 License
47
 
48
- This project is released under the Apache License 2.0. See [LICENSE](LICENSE) file for details.
49
 
50
  # 🧩 Acknowledgments
51
 
52
  This work builds upon several excellent open-source projects:
53
-
54
  - [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
55
  - [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture
56
- - The broader open-source computer vision and robotics communities
57
-
58
- ---
59
-
60
- For questions, discussions, or collaborations:
61
-
62
- <!-- - **Issues**: Open an [issue](https://github.com/robbyant/lingbot-depth/issues) on GitHub
63
- - **Email**: Contact Dr. [Bin Tan](https://https://icetttb.github.io/) (tanbin.tan@antgroup.com) or Dr. [Nan Xue](https://xuenan.net) (xuenan.xue@antgroup.com) -->
64
-
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: robotics
4
+ tags:
5
+ - robotics
6
+ - world-model
7
+ - video-generation
8
+ - transformer
9
+ ---
10
 
11
+ <h1 align="center">LingBot-VA: Causal World Modeling for Robot Control</h1>
12
 
13
  <p align="center">
14
+ <img src="https://huggingface.co/robbyant/lingbot-va-base/resolve/main/assets/teaser.png" width="100%">
15
  </p>
16
 
17
+ **LingBot-VA** is an autoregressive diffusion framework designed for simultaneous world modeling and robot action execution. By understanding the causality between actions and visual dynamics, the model provides the ability to imagine the near future and plan actions accordingly.
 
 
 
18
 
19
+ - **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence.
20
+ - **High-efficiency Execution**: Uses a dual-stream Mixture-of-Transformers (MoT) architecture with Asynchronous Execution and KV Cache support.
21
+ - **Long-Horizon Performance**: Demonstrates significant promise in long-horizon manipulation and strong generalizability to novel configurations.
22
 
23
  ---
24
 
25
+ # Model Sources
26
 
27
+ - **Repository:** [https://github.com/robbyant/lingbot-va](https://github.com/robbyant/lingbot-va)
28
+ - **Paper:** [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998)
29
  - **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
30
 
31
  ---
32
 
33
+ # 🛠️ Quick Start
34
+
35
+ ## Installation
36
+
37
+ ```bash
38
+ pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu126
39
+ pip install websockets einops diffusers==0.36.0 transformers==5.0.0 accelerate msgpack opencv-python matplotlib ftfy easydict
40
+ pip install flash-attn --no-build-isolation
41
+ ```
42
+
43
+ ## Run Image to Video-Action Generation
44
+
45
+ You can use the following command to generate video-action sequences from images:
46
 
47
+ ```bash
48
+ NGPU=1 CONFIG_NAME='robotwin_i2av' bash script/run_launch_va_server_sync.sh
49
+ ```
50
 
 
 
 
 
51
  ---
52
 
53
+ # 📊 Performance
54
+
55
+ LingBot-VA achieves state-of-the-art performance on benchmarks like RoboTwin 2.0 and LIBERO, specifically excelling in long-horizon tasks and sample efficiency. For detailed evaluation results on simulation and real-world scenarios, please refer to the [paper](https://huggingface.co/papers/2601.21998) or the [GitHub README](https://github.com/robbyant/lingbot-va).
56
+
57
+ ---
58
+
59
+ # 📚 Citation
60
 
61
  ```bibtex
62
  @article{lingbot-va2026,
63
  title={Causal World Modeling for Robot Control},
64
  author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
65
+ journal={arXiv preprint arXiv:2601.21998},
66
  year={2026}
67
  }
68
  ```
69
 
 
70
  # 🪪 License
71
 
72
+ This project is released under the [Apache License 2.0](LICENSE).
73
 
74
  # 🧩 Acknowledgments
75
 
76
  This work builds upon several excellent open-source projects:
 
77
  - [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
78
  - [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture
79
+ - The broader open-source computer vision and robotics communities