X-WAM

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Model Description

X-WAM is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors. It features:

Lightweight Depth Adaptation: Replicates the final blocks of the pretrained DiT as an interleaved depth branch for spatial reconstruction without increasing sequence length.
Asynchronous Noise Sampling (ANS): Rapidly decodes actions with fewer denoising steps for real-time execution, while dedicating the full sequence of steps to generate high-fidelity video.
4D Unified Modeling: Simultaneously optimizes video generation, 3D spatial reconstruction, and policy execution in a single framework.

Architecture

Component	Detail
Base model	Wan2.2-TI2V-5B
Text encoder	UMT5-XXL
VAE stride	(4, 16, 16)
Depth branch layers	10
Action dim	14 (dual-arm relative EE pose + gripper)
Proprio dim	16 (dual-arm absolute EE pose + gripper)
Prediction horizon	8 frames video / 32 actions

Checkpoints

This repository contains three checkpoints:

Checkpoint	Path	Description	Training Steps
Pretrained	`pretrained/`	Pretrained on 5,800+ hours of cross-embodiment data	40,000
RoboCasa SFT	`robocasa_sft/`	Fine-tuned on RoboCasa (24 kitchen tasks)	20,000
RoboTwin SFT	`robotwin_sft/`	Fine-tuned on RoboTwin 2.0 (50 dual-arm tasks)	40,000

Each checkpoint directory contains:

{checkpoint_name}/
├── config.yaml          # Training config with normalization statistics
└── checkpoints/
    └── last.ckpt        # Model weights (~37GB)

Performance

Policy Evaluation

Benchmark	Setting	Avg Success Rate
RoboCasa	24 kitchen manipulation tasks	79.2%
RoboTwin 2.0	Clean (50 tasks)	89.8%
RoboTwin 2.0	Randomized (50 tasks)	90.7%

Training Details

Pretraining

Data: 5,800+ hours (1.49M episodes) from AgibotWorld-Beta, DROID, InternA1, RoboCasa MimicGen, RoboTwin 2.0
Hardware: 256× NVIDIA H20 GPUs
Batch size: 2,048 (256 GPUs × 8)
Learning rate: 1e-4, linear warmup 1,000 steps + cosine decay
Steps: 40,000

Fine-tuning (RoboCasa / RoboTwin)

Hardware: 32× NVIDIA H20 GPUs
Batch size: 128 (32 GPUs × 4)
Learning rate: 1e-5, linear warmup + cosine decay
Steps: 20,000 (RoboCasa) / 40,000 (RoboTwin)

Inference

Action decoding: 10 steps (ANS asynchronous)
Video generation: 50 steps
Scheduler: UniPC
CFG scale: 1.0

Usage

# Please refer to the code repository for full inference and evaluation scripts:
# https://github.com/sharinka0715/X-WAM

Citation

@article{guo2026xwam,
  title={Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising},
  author={Guo, Jun and Li, Qiwei and Li, Peiyan and Chen, Zilong and Sun, Nan and Su, Yifei and Wang, Heyun and Zhang, Yuan and Li, Xinghang and Liu, Huaping},
  journal={arXiv preprint arXiv:2604.26694},
  year={2026}
}

License

This project is licensed under the Apache License 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics

Paper for sharinka0715/X-WAM-checkpoints

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Paper • 2604.26694 • Published Apr 29 • 6