Add metadata and link to paper

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +55 -23
README.md CHANGED
@@ -1,50 +1,92 @@
1
- <h1 align="center">Causal World Modeling for Robot Control</h1>
 
 
 
 
2
 
 
3
 
4
  <p align="center">
5
  <img src="assets/teaser.png" width="100%">
6
  </p>
7
 
8
- **LingBot-VA** has focused on:
 
 
9
  - **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction.
10
- - **High-efficiency Execution**: A dual-stream mixture-of-transformers(MoT) architecture with Asynchronous Execution and KV Cache.
11
  - **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes.
12
 
13
-
14
  ---
15
 
16
- # Model Sources
17
 
18
  - **Repository:** [https://github.com/Robbyant/lingbot-va](https://github.com/Robbyant/lingbot-va)
19
- - **Paper:** [Causal World Modeling for Robot Control](https://technology.robbyant.com/lingbot-va)
20
  - **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
21
 
22
  ---
23
 
24
-
25
  # 📦 Model Download
26
  - **Pretrained Checkpoints for Post-Training**
27
 
28
  | Model Name | Huggingface Repository | Description |
29
  | :--- | :---: | :---: |
30
- | lingbot-va-base &nbsp; | [🤗 robbyant/lingbot-va-base &nbsp;](https://huggingface.co/robbyant/lingbot-va-base) | LingBot-VA w/ shared backbone|
31
- | lingbot-va-posttrain-robotwin &nbsp; | [🤗 robbyant/lingbot-va-posttrain-robotwin &nbsp;](https://huggingface.co/robbyant/lingbot-va-posttrain-robotwin) | LingBot-VA-Posttrain-Robotwin w/ shared backbone|
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ---
35
 
36
- # 📚Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ```bibtex
39
  @article{lingbot-va2026,
40
  title={Causal World Modeling for Robot Control},
41
  author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
42
- journal={arXiv preprint arXiv:[xxxx]},
43
  year={2026}
44
  }
45
  ```
46
 
47
-
48
  # 🪪 License
49
 
50
  This project is released under the Apache License 2.0. See [LICENSE](LICENSE) file for details.
@@ -52,15 +94,5 @@ This project is released under the Apache License 2.0. See [LICENSE](LICENSE) fi
52
  # 🧩 Acknowledgments
53
 
54
  This work builds upon several excellent open-source projects:
55
-
56
  - [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
57
- - [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture
58
- - The broader open-source computer vision and robotics communities
59
-
60
- ---
61
-
62
- For questions, discussions, or collaborations:
63
-
64
- <!-- - **Issues**: Open an [issue](https://github.com/robbyant/lingbot-depth/issues) on GitHub
65
- - **Email**: Contact Dr. [Bin Tan](https://https://icetttb.github.io/) (tanbin.tan@antgroup.com) or Dr. [Nan Xue](https://xuenan.net) (xuenan.xue@antgroup.com) -->
66
-
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: robotics
4
+ library_name: transformers
5
+ ---
6
 
7
+ <h1 align="center">Causal World Modeling for Robot Control</h1>
8
 
9
  <p align="center">
10
  <img src="assets/teaser.png" width="100%">
11
  </p>
12
 
13
+ **LingBot-VA** is an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously, introduced in the paper [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998).
14
+
15
+ It focuses on:
16
  - **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction.
17
+ - **High-efficiency Execution**: A dual-stream mixture-of-transformers (MoT) architecture with Asynchronous Execution and KV Cache.
18
  - **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes.
19
 
 
20
  ---
21
 
22
+ # Model Sources
23
 
24
  - **Repository:** [https://github.com/Robbyant/lingbot-va](https://github.com/Robbyant/lingbot-va)
25
+ - **Paper:** [Causal World Modeling for Robot Control](https://huggingface.co/papers/2601.21998)
26
  - **Project Page:** [https://technology.robbyant.com/lingbot-va](https://technology.robbyant.com/lingbot-va)
27
 
28
  ---
29
 
 
30
  # 📦 Model Download
31
  - **Pretrained Checkpoints for Post-Training**
32
 
33
  | Model Name | Huggingface Repository | Description |
34
  | :--- | :---: | :---: |
35
+ | lingbot-va-base | [🤗 robbyant/lingbot-va-base](https://huggingface.co/robbyant/lingbot-va-base) | LingBot-VA w/ shared backbone |
36
+ | lingbot-va-posttrain-robotwin | [🤗 robbyant/lingbot-va-posttrain-robotwin](https://huggingface.co/robbyant/lingbot-va-posttrain-robotwin) | LingBot-VA-Posttrain-Robotwin w/ shared backbone |
37
 
38
+ ---
39
+
40
+ # 🛠️ Quick Start
41
+
42
+ ## Installation
43
+ **Requirements**
44
+ • Python == 3.10.16
45
+ • Pytorch == 2.9.0
46
+ • CUDA 12.6
47
+
48
+ ```bash
49
+ pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu126
50
+ pip install websockets einops diffusers==0.36.0 transformers==5.0.0 accelerate msgpack opencv-python matplotlib ftfy easydict
51
+ pip install flash-attn --no-build-isolation
52
+ ```
53
+
54
+ ## Run Image to Video-Action Generation
55
+ We provide a script for image to video-action generation:
56
+
57
+ ```bash
58
+ NGPU=1 CONFIG_NAME='robotwin_i2av' bash script/run_launch_va_server_sync.sh
59
+ ```
60
 
61
  ---
62
 
63
+ # 📊 Performance
64
+
65
+ We evaluate our model on both simulation benchmarks and real-world scenarios, achieving state-of-the-art performance.
66
+
67
+ ## Simulation Evaluation (Success Rate %)
68
+
69
+ | Method (Average 50 Tasks) | Easy SR (%) | Hard SR (%) |
70
+ | :--- | :---: | :---: |
71
+ | X-VLA | 72.9 | 72.8 |
72
+ | π₀ | 65.9 | 58.4 |
73
+ | π₀.₅ | 82.7 | 76.8 |
74
+ | Motus | 88.7 | 87.0 |
75
+ | **LingBot-VA (Ours)** | **92.9** | **91.6** |
76
+
77
+ ---
78
+
79
+ # 📚 Citation
80
 
81
  ```bibtex
82
  @article{lingbot-va2026,
83
  title={Causal World Modeling for Robot Control},
84
  author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
85
+ journal={arXiv preprint arXiv:2601.21998},
86
  year={2026}
87
  }
88
  ```
89
 
 
90
  # 🪪 License
91
 
92
  This project is released under the Apache License 2.0. See [LICENSE](LICENSE) file for details.
 
94
  # 🧩 Acknowledgments
95
 
96
  This work builds upon several excellent open-source projects:
 
97
  - [Wan-Video](https://github.com/Wan-Video) - Vision transformer backbone
98
+ - [MoT](https://github.com/facebookresearch/Mixture-of-Transformers) - Mixture-of-Transformers architecture