sharinka0715 commited on
Commit
bb6fd16
Β·
verified Β·
1 Parent(s): 7dde4fb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md CHANGED
@@ -1,3 +1,129 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - robotics
5
+ - vla
6
+ - world-model
7
+ - diffusion
8
+ - manipulation
9
+ pipeline_tag: robotics
10
  ---
11
+
12
+ <div align="center">
13
+
14
+ # X-WAM
15
+
16
+ **Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising**
17
+
18
+ [![Paper](https://img.shields.io/badge/πŸ“„-Paper-red)](https://arxiv.org/abs/2604.26694)
19
+ [![Project Page](https://img.shields.io/badge/🌐-Project_Page-blue)](https://sharinka0715.github.io/X-WAM/)
20
+ [![Code](https://img.shields.io/badge/πŸ’»-Code-orange)](https://github.com/sharinka0715/X-WAM)
21
+ [![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](LICENSE)
22
+
23
+ </div>
24
+
25
+ ---
26
+
27
+ ## Model Description
28
+
29
+ **X-WAM** is a unified 4D World Action Model that jointly predicts future multi-view RGB-D videos and robot actions from video priors. It features:
30
+
31
+ - **Lightweight Depth Adaptation**: Replicates the final blocks of the pretrained DiT as an interleaved depth branch for spatial reconstruction without increasing sequence length.
32
+ - **Asynchronous Noise Sampling (ANS)**: Rapidly decodes actions with fewer denoising steps for real-time execution, while dedicating the full sequence of steps to generate high-fidelity video.
33
+ - **4D Unified Modeling**: Simultaneously optimizes video generation, 3D spatial reconstruction, and policy execution in a single framework.
34
+
35
+ ### Architecture
36
+
37
+ | Component | Detail |
38
+ | :--- | :--- |
39
+ | Base model | Wan2.2-TI2V-5B |
40
+ | Text encoder | UMT5-XXL |
41
+ | VAE stride | (4, 16, 16) |
42
+ | Depth branch layers | 10 |
43
+ | Action dim | 14 (dual-arm relative EE pose + gripper) |
44
+ | Proprio dim | 16 (dual-arm absolute EE pose + gripper) |
45
+ | Prediction horizon | 8 frames video / 32 actions |
46
+
47
+ ---
48
+
49
+ ## Checkpoints
50
+
51
+ This repository contains three checkpoints:
52
+
53
+ | Checkpoint | Path | Description | Training Steps |
54
+ | :--- | :--- | :--- | :--- |
55
+ | **Pretrained** | `pretrained/` | Pretrained on 5,800+ hours of cross-embodiment data | 40,000 |
56
+ | **RoboCasa SFT** | `robocasa_sft/` | Fine-tuned on RoboCasa (24 kitchen tasks) | 20,000 |
57
+ | **RoboTwin SFT** | `robotwin_sft/` | Fine-tuned on RoboTwin 2.0 (50 dual-arm tasks) | 40,000 |
58
+
59
+ Each checkpoint directory contains:
60
+ ```
61
+ {checkpoint_name}/
62
+ β”œβ”€β”€ config.yaml # Training config with normalization statistics
63
+ └── checkpoints/
64
+ └── last.ckpt # Model weights (~37GB)
65
+ ```
66
+
67
+ ---
68
+
69
+ ## Performance
70
+
71
+ ### Policy Evaluation
72
+
73
+ | Benchmark | Setting | Avg Success Rate |
74
+ | :--- | :--- | :--- |
75
+ | **RoboCasa** | 24 kitchen manipulation tasks | **79.2%** |
76
+ | **RoboTwin 2.0** | Clean (50 tasks) | **89.8%** |
77
+ | **RoboTwin 2.0** | Randomized (50 tasks) | **90.7%** |
78
+
79
+ ---
80
+
81
+ ## Training Details
82
+
83
+ ### Pretraining
84
+
85
+ - **Data**: 5,800+ hours (1.49M episodes) from AgibotWorld-Beta, DROID, InternA1, RoboCasa MimicGen, RoboTwin 2.0
86
+ - **Hardware**: 256Γ— NVIDIA H20 GPUs
87
+ - **Batch size**: 2,048 (256 GPUs Γ— 8)
88
+ - **Learning rate**: 1e-4, linear warmup 1,000 steps + cosine decay
89
+ - **Steps**: 40,000
90
+
91
+ ### Fine-tuning (RoboCasa / RoboTwin)
92
+
93
+ - **Hardware**: 32Γ— NVIDIA H20 GPUs
94
+ - **Batch size**: 128 (32 GPUs Γ— 4)
95
+ - **Learning rate**: 1e-5, linear warmup + cosine decay
96
+ - **Steps**: 20,000 (RoboCasa) / 40,000 (RoboTwin)
97
+
98
+ ### Inference
99
+
100
+ - **Action decoding**: 10 steps (ANS asynchronous)
101
+ - **Video generation**: 50 steps
102
+ - **Scheduler**: UniPC
103
+ - **CFG scale**: 1.0
104
+
105
+ ---
106
+
107
+ ## Usage
108
+
109
+ ```python
110
+ # Please refer to the code repository for full inference and evaluation scripts:
111
+ # https://github.com/sharinka0715/X-WAM
112
+ ```
113
+
114
+ ---
115
+
116
+ ## Citation
117
+
118
+ ```bibtex
119
+ @article{guo2026xwam,
120
+ title={Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising},
121
+ author={Guo, Jun and Li, Qiwei and Li, Peiyan and Chen, Zilong and Sun, Nan and Su, Yifei and Wang, Heyun and Zhang, Yuan and Li, Xinghang and Liu, Huaping},
122
+ journal={arXiv preprint arXiv:2604.26694},
123
+ year={2026}
124
+ }
125
+ ```
126
+
127
+ ## License
128
+
129
+ This project is licensed under the [Apache License 2.0](LICENSE).