sharinka0715
/

X-WAM-checkpoints

 ---
 license: apache-2.0
+tags:
+  - robotics
+  - vla
+  - world-model
+  - diffusion
+  - manipulation
+pipeline_tag: robotics
 ---
+<div align="center">
+# X-WAM
+**Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising**
+[![Paper](https://img.shields.io/badge/📄-Paper-red)](https://arxiv.org/abs/2604.26694)
+[![Project Page](https://img.shields.io/badge/🌐-Project_Page-blue)](https://sharinka0715.github.io/X-WAM/)
+[![Code](https://img.shields.io/badge/💻-Code-orange)](https://github.com/sharinka0715/X-WAM)
+[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
+</div>
+---
+## Model Description
+**X-WAM** is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors. It features:
+- **Lightweight Depth Adaptation**: Replicates the final blocks of the pretrained DiT as an interleaved depth branch for spatial reconstruction without increasing sequence length.
+- **Asynchronous Noise Sampling (ANS)**: Rapidly decodes actions with fewer denoising steps for real-time execution, while dedicating the full sequence of steps to generate high-fidelity video.
+- **4D Unified Modeling**: Simultaneously optimizes video generation, 3D spatial reconstruction, and policy execution in a single framework.
+### Architecture
+| Component | Detail |
+| :--- | :--- |
+| Base model | Wan2.2-TI2V-5B |
+| Text encoder | UMT5-XXL |
+| VAE stride | (4, 16, 16) |
+| Depth branch layers | 10 |
+| Action dim | 14 (dual-arm relative EE pose + gripper) |
+| Proprio dim | 16 (dual-arm absolute EE pose + gripper) |
+| Prediction horizon | 8 frames video / 32 actions |
+---
+## Checkpoints
+This repository contains three checkpoints:
+| Checkpoint | Path | Description | Training Steps |
+| :--- | :--- | :--- | :--- |
+| **Pretrained** | `pretrained/` | Pretrained on 5,800+ hours of cross-embodiment data | 40,000 |
+| **RoboCasa SFT** | `robocasa_sft/` | Fine-tuned on RoboCasa (24 kitchen tasks) | 20,000 |
+| **RoboTwin SFT** | `robotwin_sft/` | Fine-tuned on RoboTwin 2.0 (50 dual-arm tasks) | 40,000 |
+Each checkpoint directory contains:
+```
+{checkpoint_name}/
+├── config.yaml          # Training config with normalization statistics
+└── checkpoints/
+    └── last.ckpt        # Model weights (~37GB)
+```
+---
+## Performance
+### Policy Evaluation
+| Benchmark | Setting | Avg Success Rate |
+| :--- | :--- | :--- |
+| **RoboCasa** | 24 kitchen manipulation tasks | **79.2%** |
+| **RoboTwin 2.0** | Clean (50 tasks) | **89.8%** |
+| **RoboTwin 2.0** | Randomized (50 tasks) | **90.7%** |
+---
+## Training Details
+### Pretraining
+- **Data**: 5,800+ hours (1.49M episodes) from AgibotWorld-Beta, DROID, InternA1, RoboCasa MimicGen, RoboTwin 2.0
+- **Hardware**: 256× NVIDIA H20 GPUs
+- **Batch size**: 2,048 (256 GPUs × 8)
+- **Learning rate**: 1e-4, linear warmup 1,000 steps + cosine decay
+- **Steps**: 40,000
+### Fine-tuning (RoboCasa / RoboTwin)
+- **Hardware**: 32× NVIDIA H20 GPUs
+- **Batch size**: 128 (32 GPUs × 4)
+- **Learning rate**: 1e-5, linear warmup + cosine decay
+- **Steps**: 20,000 (RoboCasa) / 40,000 (RoboTwin)
+### Inference
+- **Action decoding**: 10 steps (ANS asynchronous)
+- **Video generation**: 50 steps
+- **Scheduler**: UniPC
+- **CFG scale**: 1.0
+---
+## Usage
+```python
+# Please refer to the code repository for full inference and evaluation scripts:
+# https://github.com/sharinka0715/X-WAM
+```
+---
+## Citation
+```bibtex
+@article{guo2026xwam,
+  title={Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising},
+  author={Guo, Jun and Li, Qiwei and Li, Peiyan and Chen, Zilong and Sun, Nan and Su, Yifei and Wang, Heyun and Zhang, Yuan and Li, Xinghang and Liu, Huaping},
+  journal={arXiv preprint arXiv:2604.26694},
+  year={2026}
+}
+```
+## License
+This project is licensed under the [Apache License 2.0](LICENSE).