X-WAM

Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

Paper Project Page Code License


Model Description

X-WAM is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors. It features:

  • Lightweight Depth Adaptation: Replicates the final blocks of the pretrained DiT as an interleaved depth branch for spatial reconstruction without increasing sequence length.
  • Asynchronous Noise Sampling (ANS): Rapidly decodes actions with fewer denoising steps for real-time execution, while dedicating the full sequence of steps to generate high-fidelity video.
  • 4D Unified Modeling: Simultaneously optimizes video generation, 3D spatial reconstruction, and policy execution in a single framework.

Architecture

Component Detail
Base model Wan2.2-TI2V-5B
Text encoder UMT5-XXL
VAE stride (4, 16, 16)
Depth branch layers 10
Action dim 14 (dual-arm relative EE pose + gripper)
Proprio dim 16 (dual-arm absolute EE pose + gripper)
Prediction horizon 8 frames video / 32 actions

Checkpoints

This repository contains three checkpoints:

Checkpoint Path Description Training Steps
Pretrained pretrained/ Pretrained on 5,800+ hours of cross-embodiment data 40,000
RoboCasa SFT robocasa_sft/ Fine-tuned on RoboCasa (24 kitchen tasks) 20,000
RoboTwin SFT robotwin_sft/ Fine-tuned on RoboTwin 2.0 (50 dual-arm tasks) 40,000

Each checkpoint directory contains:

{checkpoint_name}/
β”œβ”€β”€ config.yaml          # Training config with normalization statistics
└── checkpoints/
    └── last.ckpt        # Model weights (~37GB)

Performance

Policy Evaluation

Benchmark Setting Avg Success Rate
RoboCasa 24 kitchen manipulation tasks 79.2%
RoboTwin 2.0 Clean (50 tasks) 89.8%
RoboTwin 2.0 Randomized (50 tasks) 90.7%

Training Details

Pretraining

  • Data: 5,800+ hours (1.49M episodes) from AgibotWorld-Beta, DROID, InternA1, RoboCasa MimicGen, RoboTwin 2.0
  • Hardware: 256Γ— NVIDIA H20 GPUs
  • Batch size: 2,048 (256 GPUs Γ— 8)
  • Learning rate: 1e-4, linear warmup 1,000 steps + cosine decay
  • Steps: 40,000

Fine-tuning (RoboCasa / RoboTwin)

  • Hardware: 32Γ— NVIDIA H20 GPUs
  • Batch size: 128 (32 GPUs Γ— 4)
  • Learning rate: 1e-5, linear warmup + cosine decay
  • Steps: 20,000 (RoboCasa) / 40,000 (RoboTwin)

Inference

  • Action decoding: 10 steps (ANS asynchronous)
  • Video generation: 50 steps
  • Scheduler: UniPC
  • CFG scale: 1.0

Usage

# Please refer to the code repository for full inference and evaluation scripts:
# https://github.com/sharinka0715/X-WAM

Citation

@article{guo2026xwam,
  title={Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising},
  author={Guo, Jun and Li, Qiwei and Li, Peiyan and Chen, Zilong and Sun, Nan and Su, Yifei and Wang, Heyun and Zhang, Yuan and Li, Xinghang and Liu, Huaping},
  journal={arXiv preprint arXiv:2604.26694},
  year={2026}
}

License

This project is licensed under the Apache License 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Paper for sharinka0715/X-WAM-checkpoints