Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
Paper β’ 2604.26694 β’ Published β’ 6
X-WAM is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors. It features:
| Component | Detail |
|---|---|
| Base model | Wan2.2-TI2V-5B |
| Text encoder | UMT5-XXL |
| VAE stride | (4, 16, 16) |
| Depth branch layers | 10 |
| Action dim | 14 (dual-arm relative EE pose + gripper) |
| Proprio dim | 16 (dual-arm absolute EE pose + gripper) |
| Prediction horizon | 8 frames video / 32 actions |
This repository contains three checkpoints:
| Checkpoint | Path | Description | Training Steps |
|---|---|---|---|
| Pretrained | pretrained/ |
Pretrained on 5,800+ hours of cross-embodiment data | 40,000 |
| RoboCasa SFT | robocasa_sft/ |
Fine-tuned on RoboCasa (24 kitchen tasks) | 20,000 |
| RoboTwin SFT | robotwin_sft/ |
Fine-tuned on RoboTwin 2.0 (50 dual-arm tasks) | 40,000 |
Each checkpoint directory contains:
{checkpoint_name}/
βββ config.yaml # Training config with normalization statistics
βββ checkpoints/
βββ last.ckpt # Model weights (~37GB)
| Benchmark | Setting | Avg Success Rate |
|---|---|---|
| RoboCasa | 24 kitchen manipulation tasks | 79.2% |
| RoboTwin 2.0 | Clean (50 tasks) | 89.8% |
| RoboTwin 2.0 | Randomized (50 tasks) | 90.7% |
# Please refer to the code repository for full inference and evaluation scripts:
# https://github.com/sharinka0715/X-WAM
@article{guo2026xwam,
title={Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising},
author={Guo, Jun and Li, Qiwei and Li, Peiyan and Chen, Zilong and Sun, Nan and Su, Yifei and Wang, Heyun and Zhang, Yuan and Li, Xinghang and Liu, Huaping},
journal={arXiv preprint arXiv:2604.26694},
year={2026}
}
This project is licensed under the Apache License 2.0.