Diffusers
Safetensors
English
lingbot-va-base / README.md
nielsr's picture
nielsr HF Staff
Add robotics metadata and improve model card
420e573 verified
|
raw
history blame
3.12 kB
metadata
license: apache-2.0
pipeline_tag: robotics
tags:
  - robotics
  - world-model
  - video-generation
  - transformer

LingBot-VA: Causal World Modeling for Robot Control

LingBot-VA is an autoregressive diffusion framework designed for simultaneous world modeling and robot action execution. By understanding the causality between actions and visual dynamics, the model provides the ability to imagine the near future and plan actions accordingly.

  • Autoregressive Video-Action World Modeling: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence.
  • High-efficiency Execution: Uses a dual-stream Mixture-of-Transformers (MoT) architecture with Asynchronous Execution and KV Cache support.
  • Long-Horizon Performance: Demonstrates significant promise in long-horizon manipulation and strong generalizability to novel configurations.

Model Sources


πŸ› οΈ Quick Start

Installation

pip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https://download.pytorch.org/whl/cu126
pip install websockets einops diffusers==0.36.0 transformers==5.0.0 accelerate msgpack opencv-python matplotlib ftfy easydict
pip install flash-attn --no-build-isolation

Run Image to Video-Action Generation

You can use the following command to generate video-action sequences from images:

NGPU=1 CONFIG_NAME='robotwin_i2av' bash script/run_launch_va_server_sync.sh 

πŸ“Š Performance

LingBot-VA achieves state-of-the-art performance on benchmarks like RoboTwin 2.0 and LIBERO, specifically excelling in long-horizon tasks and sample efficiency. For detailed evaluation results on simulation and real-world scenarios, please refer to the paper or the GitHub README.


πŸ“š Citation

@article{lingbot-va2026,
  title={Causal World Modeling for Robot Control},
  author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},
  journal={arXiv preprint arXiv:2601.21998},
  year={2026}
}

πŸͺͺ License

This project is released under the Apache License 2.0.

🧩 Acknowledgments

This work builds upon several excellent open-source projects:

  • Wan-Video - Vision transformer backbone
  • MoT - Mixture-of-Transformers architecture
  • The broader open-source computer vision and robotics communities