WAN 2.2 Video Generation Model (Stage 1 Pretrained)
Stage 1 pretrained WAN 2.2 (5B) video generation model for Motus. This checkpoint provides the video generation backbone trained on Multi-Robot Task Trajectory, Synthetic Robot Data, and Egocentric Human Videos.
Homepage | GitHub | arXiv | Feishu
Table of Contents
Model Details
Architecture
| Component | Specification |
|---|---|
| Base Model | WAN 2.2 |
| Parameters | 5B |
| Precision | bfloat16 |
Training Details
- Stage: Stage 1 (VGM Training)
- Training Data: Multi-Robot Task Trajectory, Synthetic Robot Data, Egocentric Human Videos
- Objective: Text-conditioned image-to-video generation (TI2V)
Hardware & Software Requirements
| Mode | VRAM | Example GPU |
|---|---|---|
| Inference | ~ 16 GB | RTX 4090 |
| Fine-Tuning | ~ 40 GB | A100 (40GB) |
Usage in Motus
Configuration
Update your Motus config file (e.g., configs/robotwin.yaml):
model:
wan:
checkpoint_path: "./pretrained_models/Motus_Wan2_2_5B_pretrain" # This checkpoint
config_path: "./pretrained_models/Motus_Wan2_2_5B_pretrain"
vae_path: "./pretrained_models/Wan2.2-TI2V-5B/Wan2.2_VAE.pth" # Local VAE (not included)
precision: "bfloat16"
Download
# Using Hugging Face CLI
huggingface-cli download motus-robotics/Motus_Wan2_2_5B_pretrain --local-dir ./pretrained_models/Motus_Wan2_2_5B_pretrain
# Or using Git LFS
git lfs install
git clone https://huggingface.co/motus-robotics/Motus_Wan2_2_5B_pretrain
Note on VAE
The WAN VAE (Wan2.2_VAE.pth) is not included in this repository. You need to:
- Download the original WAN 2.2 VAE from Wan-Video/Wan2.2
- Set the
vae_pathin your config to point to the local VAE file
Citation
@misc{motus2025,
title={Motus: A Unified Latent Action World Model},
author={Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu},
year={2025},
eprint={2507.23523},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://motus-robotics.github.io/motus},
}