Improve model card: Add detailed usage instructions and descriptive intro

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +502 -20
README.md CHANGED
@@ -1,13 +1,13 @@
1
  ---
2
- license: apache-2.0
3
  base_model: Wan-AI/Wan2.1-T2V-14B
 
 
4
  tags:
5
  - text-to-video
6
  - diffusion
7
  - video-generation
8
  - turbodiffusion
9
  - wan2.1
10
- pipeline_tag: text-to-video
11
  ---
12
 
13
  <p align="center">
@@ -16,13 +16,508 @@ pipeline_tag: text-to-video
16
 
17
  # TurboWan2.1-T2V-14B-720P
18
 
19
- - This HuggingFace repo contains the `TurboWan2.1-T2V-14B-720P` model.
20
-
21
- - For RTX 5090 or similar GPUs, please use the `TurboWan2.1-T2V-14B-720P-quant`. For other GPUs with a bigger GPU memory than 40GB, we recommend using `TurboWan2.1-T2V-14B-720P`.
22
 
23
- - For usage instructions, please see **https://github.com/thu-ml/TurboDiffusion**
24
 
25
  - Paper: [TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times](https://arxiv.org/pdf/2512.16093)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
 
28
  ## Citation
@@ -68,17 +563,4 @@ pipeline_tag: text-to-video
68
  booktitle={International Conference on Machine Learning (ICML)},
69
  year={2025}
70
  }
71
-
72
- @article{zhang2025sageattention2++,
73
- title={Sageattention2++: A more efficient implementation of sageattention2},
74
- author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei},
75
- journal={arXiv preprint arXiv:2505.21136},
76
- year={2025}
77
- }
78
- @article{zhang2025sageattention3,
79
- title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training},
80
- author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
81
- journal={arXiv preprint arXiv:2505.11594},
82
- year={2025}
83
- }
84
- ```
 
1
  ---
 
2
  base_model: Wan-AI/Wan2.1-T2V-14B
3
+ license: apache-2.0
4
+ pipeline_tag: text-to-video
5
  tags:
6
  - text-to-video
7
  - diffusion
8
  - video-generation
9
  - turbodiffusion
10
  - wan2.1
 
11
  ---
12
 
13
  <p align="center">
 
16
 
17
  # TurboWan2.1-T2V-14B-720P
18
 
19
+ This repository contains the `TurboWan2.1-T2V-14B-720P` model, which is part of the **TurboDiffusion** framework. TurboDiffusion is designed to accelerate end-to-end video diffusion generation by 100-200 times while maintaining high video quality, leveraging innovations in attention acceleration, step distillation, and W8A8 quantization. This particular model is based on `Wan-AI/Wan2.1-T2V-14B` and is optimized for 720p video generation.
 
 
20
 
21
+ - For RTX 5090 or similar GPUs, please use the `TurboWan2.1-T2V-14B-720P-quant` checkpoint. For other GPUs with a bigger GPU memory than 40GB, we recommend using `TurboWan2.1-T2V-14B-720P`.
22
 
23
  - Paper: [TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times](https://arxiv.org/pdf/2512.16093)
24
+ - GitHub Repository: [https://github.com/thu-ml/TurboDiffusion](https://github.com/thu-ml/TurboDiffusion)
25
+
26
+ ## Quick Start: Inference
27
+
28
+ For GPUs with more than 40GB of GPU memory, **e.g., H100, we recommend using the unquantized checkpoint (without `-quant`) and removing `--quant_linear` from the command.**
29
+
30
+ 1. Download the Wan2.1 VAE (**applicable for both Wan2.1 and Wan2.2**) and umT5 text encoder checkpoints from the official [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) repository on Huggingface:
31
+
32
+ ```bash
33
+ mkdir checkpoints
34
+ cd checkpoints
35
+ wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/Wan2.1_VAE.pth
36
+ wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/models_t5_umt5-xxl-enc-bf16.pth
37
+ ```
38
+
39
+ 2. Download our finetuned checkpoints:
40
+ ```bash
41
+ wget https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P.pth
42
+ ```
43
+
44
+ For RTX 5090, RTX 4090, or similar GPUs, please use the quantized checkpoint:
45
+
46
+ ```bash
47
+ wget https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P-quant.pth
48
+ ```
49
+
50
+
51
+ For the Wan2.2-I2V model, download both the high-noise and low-noise checkpoints:
52
+ ```bash
53
+ wget https://huggingface.co/TurboDiffusion/TurboWan2.2-I2V-A14B-720P/resolve/main/TurboWan2.2-I2V-A14B-high-720P.pth
54
+ wget https://huggingface.co/TurboDiffusion/TurboWan2.2-I2V-A14B-720P/resolve/main/TurboWan2.2-I2V-A14B-low-720P.pth
55
+ ```
56
+
57
+ 3. Use the inference script for the **T2V** model:
58
+ ```bash
59
+ export PYTHONPATH=turbodiffusion
60
+
61
+ # Arguments:
62
+ # --dit_path Path to the finetuned TurboDiffusion checkpoint
63
+ # --model Model to use: Wan2.1-1.3B or Wan2.1-14B (default: Wan2.1-1.3B)
64
+ # --num_samples Number of videos to generate (default: 1)
65
+ # --num_steps Sampling steps, 1–4 (default: 4)
66
+ # --sigma_max Initial sigma for rCM (default: 80); larger choices (e.g., 1600) reduce diversity but may enhance quality
67
+ # --vae_path Path to Wan2.1 VAE (default: checkpoints/Wan2.1_VAE.pth)
68
+ # --text_encoder_path Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth)
69
+ # --num_frames Number of frames to generate (default: 81)
70
+ # --prompt Text prompt for video generation
71
+ # --resolution Output resolution: "480p" or "720p" (default: 480p)
72
+ # --aspect_ratio Aspect ratio in W:H format (default: 16:9)
73
+ # --seed Random seed for reproducibility (default: 0)
74
+ # --save_path Output file path including extension (default: output/generated_video.mp4)
75
+ # --attention_type Attention module to use: original, sla or sagesla (default: sagesla)
76
+ # --sla_topk Top-k ratio for SLA/SageSLA attention (default: 0.1), we recommend using 0.15 for better video quality
77
+ # --quant_linear Enable quantization for linear layers, pass this if using a quantized checkpoint
78
+ # --default_norm Use the original LayerNorm and RMSNorm of Wan models
79
+
80
+ python turbodiffusion/inference/wan2.1_t2v_infer.py \
81
+ --model Wan2.1-1.3B \
82
+ --dit_path checkpoints/TurboWan2.1-T2V-1.3B-480P-quant.pth \
83
+ --resolution 480p \
84
+ --prompt "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about." \
85
+ --num_samples 1 \
86
+ --num_steps 4 \
87
+ --quant_linear \
88
+ --attention_type sagesla \
89
+ --sla_topk 0.1
90
+ ```
91
+
92
+ Or the script for the **I2V** model:
93
+ ```bash
94
+ export PYTHONPATH=turbodiffusion
95
+
96
+ # --image_path Path to the input image
97
+ # --high_noise_model_path Path to the high noise TurboDiffusion checkpoint
98
+ # --low_noise_model_path Path to the high noise TurboDiffusion checkpoint
99
+ # --boundary Timestep boundary for switching from high to low noise model (default: 0.9)
100
+ # --model Model to use: Wan2.2-A14B (default: Wan2.2-A14B)
101
+ # --num_samples Number of videos to generate (default: 1)
102
+ # --num_steps Sampling steps, 1–4 (default: 4)
103
+ # --sigma_max Initial sigma for rCM (default: 200); larger choices (e.g., 1600) reduce diversity but may enhance quality
104
+ # --vae_path Path to Wan2.2 VAE (default: checkpoints/Wan2.2_VAE.pth)
105
+ # --text_encoder_path Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth)
106
+ # --num_frames Number of frames to generate (default: 81)
107
+ # --prompt Text prompt for video generation
108
+ # --resolution Output resolution: "480p" or "720p" (default: 720p)
109
+ # --aspect_ratio Aspect ratio in W:H format (default: 16:9)
110
+ # --adaptive_resolution Enable adaptive resolution based on input image size
111
+ # --ode Use ODE for sampling (sharper but less robust than SDE)
112
+ # --seed Random seed for reproducibility (default: 0)
113
+ # --save_path Output file path including extension (default: output/generated_video.mp4)
114
+ # --attention_type Attention module to use: original, sla or sagesla (default: sagesla)
115
+ # --sla_topk Top-k ratio for SLA/SageSLA attention (default: 0.1), we recommend using 0.15 for better video quality
116
+ # --quant_linear Enable quantization for linear layers, pass this if using a quantized checkpoint
117
+ # --default_norm Use the original LayerNorm and RMSNorm of Wan models
118
+
119
+ python turbodiffusion/inference/wan2.2_i2v_infer.py \
120
+ --model Wan2.2-A14B \
121
+ --low_noise_model_path checkpoints/TurboWan2.2-I2V-A14B-low-720P-quant.pth \
122
+ --high_noise_model_path checkpoints/TurboWan2.2-I2V-A14B-high-720P-quant.pth \
123
+ --resolution 720p \
124
+ --adaptive_resolution \
125
+ --image_path assets/i2v_inputs/i2v_input_0.jpg \
126
+ --prompt "POV selfie video, ultra-messy and extremely fast. A white cat in sunglasses stands on a surfboard with a neutral look when the board suddenly whips sideways, throwing cat and camera into the water; the frame dives sharply downward, swallowed by violent bursts of bubbles, spinning turbulence, and smeared water streaks as the camera sinks. Shadows thicken, pressure ripples distort the edges, and loose bubbles rush upward past the lens, showing the camera is still sinking. Then the cat kicks upward with explosive speed, dragging the view through churning bubbles and rapidly brightening water as sunlight floods back in; the camera races upward, water streaming off the lens, and finally breaks the surface in a sudden blast of light and spray, snapping back into a crooked, frantic selfie as the cat resurfaces." \
127
+ --num_samples 1 \
128
+ --num_steps 4 \
129
+ --quant_linear \
130
+ --attention_type sagesla \
131
+ --sla_topk 0.1 \
132
+ --ode
133
+ ```
134
+
135
+ ## Evaluation
136
+
137
+ We evaluate video generation on **a single RTX 5090 GPU**. The E2E Time refers to the end-to-end diffusion generation latency, excluding text encoding and VAE decoding.
138
+
139
+ ### Wan-2.2-I2V-A14B-720P
140
+
141
+ <table>
142
+ <tr>
143
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
144
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
145
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/0.gif" width="360"/></div>
146
+ </td>
147
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
148
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
149
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/0.gif" width="360"/></div>
150
+ </td>
151
+ </tr>
152
+ <tr>
153
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
154
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
155
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/1.gif" width="360"/></div>
156
+ </td>
157
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
158
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
159
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/1.gif" width="360"/></div>
160
+ </td>
161
+ </tr>
162
+ <tr>
163
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
164
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
165
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/2.gif" width="360"/></div>
166
+ </td>
167
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
168
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
169
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/2.gif" width="360"/></div>
170
+ </td>
171
+ </tr>
172
+ <tr>
173
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
174
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
175
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/3.gif" width="360"/></div>
176
+ </td>
177
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
178
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
179
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/3.gif" width="360"/></div>
180
+ </td>
181
+ </tr>
182
+ <tr>
183
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
184
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
185
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/4.gif" width="360"/></div>
186
+ </td>
187
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
188
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
189
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/4.gif" width="360"/></div>
190
+ </td>
191
+ </tr>
192
+ <tr>
193
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
194
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
195
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/5.gif" width="360"/></div>
196
+ </td>
197
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
198
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
199
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/5.gif" width="360"/></div>
200
+ </td>
201
+ </tr>
202
+ <tr>
203
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
204
+ <div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
205
+ <div><img src="assets/videos/i2v/original/A14B_720p/gif/6.gif" width="360"/></div>
206
+ </td>
207
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
208
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
209
+ <div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/6.gif" width="360"/></div>
210
+ </td>
211
+ </tr>
212
+ </table>
213
+
214
+
215
+ ### Wan-2.1-T2V-1.3B-480P
216
+
217
+ <table>
218
+ <tr>
219
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
220
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
221
+ <div><img src="assets/videos/original/1.3B/5.gif" width="249"/></div>
222
+ </td>
223
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
224
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
225
+ <div><img src="assets/videos/fastvideo/video_1.3B/5.gif" width="249"/></div>
226
+ </td>
227
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
228
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
229
+ <div><img src="assets/videos/turbodiffusion/1.3B/5.gif" width="249"/></div>
230
+ </td>
231
+ </tr>
232
+ <tr>
233
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
234
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
235
+ <div><img src="assets/videos/original/1.3B/0.gif" width="249"/></div>
236
+ </td>
237
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
238
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
239
+ <div><img src="assets/videos/fastvideo/video_1.3B/0.gif" width="249"/></div>
240
+ </td>
241
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
242
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
243
+ <div><img src="assets/videos/turbodiffusion/1.3B/0.gif" width="249"/></div>
244
+ </td>
245
+ </tr>
246
+ <tr>
247
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
248
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
249
+ <div><img src="assets/videos/original/1.3B/1.gif" width="249"/></div>
250
+ </td>
251
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
252
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
253
+ <div><img src="assets/videos/fastvideo/video_1.3B/1.gif" width="249"/></div>
254
+ </td>
255
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
256
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
257
+ <div><img src="assets/videos/turbodiffusion/1.3B/1.gif" width="249"/></div>
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
262
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
263
+ <div><img src="assets/videos/original/1.3B/2.gif" width="249"/></div>
264
+ </td>
265
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
266
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
267
+ <div><img src="assets/videos/fastvideo/video_1.3B/2.gif" width="249"/></div>
268
+ </td>
269
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
270
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
271
+ <div><img src="assets/videos/turbodiffusion/1.3B/2.gif" width="249"/></div>
272
+ </td>
273
+ </tr>
274
+ <tr>
275
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
276
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
277
+ <div><img src="assets/videos/original/1.3B/7.gif" width="249"/></div>
278
+ </td>
279
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
280
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
281
+ <div><img src="assets/videos/fastvideo/video_1.3B/7.gif" width="249"/></div>
282
+ </td>
283
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
284
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
285
+ <div><img src="assets/videos/turbodiffusion/1.3B/7.gif" width="249"/></div>
286
+ </td>
287
+ </tr>
288
+ <tr>
289
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
290
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
291
+ <div><img src="assets/videos/original/1.3B/11.gif" width="249"/></div>
292
+ </td>
293
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
294
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
295
+ <div><img src="assets/videos/fastvideo/video_1.3B/11.gif" width="249"/></div>
296
+ </td>
297
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
298
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
299
+ <div><img src="assets/videos/turbodiffusion/1.3B/11.gif" width="249"/></div>
300
+ </td>
301
+ </tr>
302
+ <tr>
303
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
304
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
305
+ <div><img src="assets/videos/original/1.3B/13.gif" width="249"/></div>
306
+ </td>
307
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
308
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
309
+ <div><img src="assets/videos/fastvideo/video_1.3B/13.gif" width="249"/></div>
310
+ </td>
311
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
312
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
313
+ <div><img src="assets/videos/turbodiffusion/1.3B/13.gif" width="249"/></div>
314
+ </td>
315
+ </tr>
316
+ <tr>
317
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
318
+ <div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
319
+ <div><img src="assets/videos/original/1.3B/14.gif" width="249"/></div>
320
+ </td>
321
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
322
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
323
+ <div><img src="assets/videos/fastvideo/video_1.3B/14.gif" width="249"/></div>
324
+ </td>
325
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
326
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
327
+ <div><img src="assets/videos/turbodiffusion/1.3B/14.gif" width="249"/></div>
328
+ </td>
329
+ </tr>
330
+ </table>
331
+
332
+
333
+ ### Wan-2.1-T2V-14B-720P
334
+
335
+ <table>
336
+ <tr>
337
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
338
+ <div style="font-size: 1.1em;">Original, E2E Time: 4767s</div>
339
+ <div><img src="assets/videos/original/14B_720p/0.gif" width="249"/></div>
340
+ </td>
341
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
342
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 72.6s</div>
343
+ <div><img src="assets/videos/fastvideo/video_14B_720p/0.gif" width="249"/></div>
344
+ </td>
345
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
346
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>24s</b></div>
347
+ <div><img src="assets/videos/turbodiffusion/14B_720p/0.gif" width="249"/></div>
348
+ </td>
349
+ </tr>
350
+ <tr>
351
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
352
+ <div style="font-size: 1.1em;">Original, E2E Time: 4767s</div>
353
+ <div><img src="assets/videos/original/14B_720p/3.gif" width="249"/></div>
354
+ </td>
355
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
356
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 72.6s</div>
357
+ <div><img src="assets/videos/fastvideo/video_14B_720p/3.gif" width="249"/></div>
358
+ </td>
359
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
360
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>24s</b></div>
361
+ <div><img src="assets/videos/turbodiffusion/14B_720p/3.gif" width="249"/></div>
362
+ </td>
363
+ </tr>
364
+ <tr>
365
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
366
+ <div style="font-size: 1.1em;">Original, E2E Time: 4767s</div>
367
+ <div><img src="assets/videos/original/14B_720p/6.gif" width="249"/></div>
368
+ </td>
369
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
370
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 72.6s</div>
371
+ <div><img src="assets/videos/fastvideo/video_14B_720p/6.gif" width="249"/></div>
372
+ </td>
373
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
374
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>24s</b></div>
375
+ <div><img src="assets/videos/turbodiffusion/14B_720p/6.gif" width="249"/></div>
376
+ </td>
377
+ </tr>
378
+ </table>
379
+
380
+
381
+ ### Wan-2.1-T2V-14B-480P
382
+
383
+ <table>
384
+ <tr>
385
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
386
+ <div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
387
+ <div><img src="assets/videos/original/14B_480p/0.gif" width="249"/></div>
388
+ </td>
389
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
390
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
391
+ <div><img src="assets/videos/fastvideo/video_14B_480p/0.gif" width="249"/></div>
392
+ </td>
393
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
394
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
395
+ <div><img src="assets/videos/turbodiffusion/14B_480p/0.gif" width="249"/></div>
396
+ </td>
397
+ </tr>
398
+ <tr>
399
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
400
+ <div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
401
+ <div><img src="assets/videos/original/14B_480p/1.gif" width="249"/></div>
402
+ </td>
403
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
404
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
405
+ <div><img src="assets/videos/fastvideo/video_14B_480p/1.gif" width="249"/></div>
406
+ </td>
407
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
408
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
409
+ <div><img src="assets/videos/turbodiffusion/14B_480p/1.gif" width="249"/></div>
410
+ </td>
411
+ </tr>
412
+ <tr>
413
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
414
+ <div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
415
+ <div><img src="assets/videos/original/14B_480p/4.gif" width="249"/></div>
416
+ </td>
417
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
418
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
419
+ <div><img src="assets/videos/fastvideo/video_14B_480p/4.gif" width="249"/></div>
420
+ </td>
421
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
422
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
423
+ <div><img src="assets/videos/turbodiffusion/14B_480p/4.gif" width="249"/></div>
424
+ </td>
425
+ </tr>
426
+ <tr>
427
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
428
+ <div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
429
+ <div><img src="assets/videos/original/14B_480p/5.gif" width="249"/></div>
430
+ </td>
431
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
432
+ <div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
433
+ <div><img src="assets/videos/fastvideo/video_14B_480p/5.gif" width="249"/></div>
434
+ </td>
435
+ <td align="center" style="border: 2px solid #000; padding: 10px;">
436
+ <div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
437
+ <div><img src="assets/videos/turbodiffusion/14B_480p/5.gif" width="249"/></div>
438
+ </td>
439
+ </tr>
440
+ </table>
441
+
442
+ ## Training
443
+
444
+ In this repo, we provide training code based on Wan2.1 and its synthetic data. The training builds on the rCM codebase (https://github.com/NVlabs/rcm), with infrastructure support including FSDP2, Ulysses CP, and selective activation checkpointing (SAC). For rCM training instructions, please refer to the original rCM repository; SLA training guidance is provided here.
445
+
446
+ #### Checkpoints Downloading
447
+ Download the Wan2.1 pretrained checkpoints in `.pth` format and VAE/text encoder to `assets/checkpoints`:
448
+
449
+ ```bash
450
+ # make sure git lfs is installed
451
+ git clone https://huggingface.co/worstcoder/Wan assets/checkpoints
452
+ ```
453
+
454
+ FSDP2 relies on [Distributed Checkpoint (DCP)](https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html) for loading and saving checkpoints. Before training, convert `.pth` teacher checkpoints to `.dcp` first:
455
+
456
+ ```bash
457
+ python -m torch.distributed.checkpoint.format_utils torch_to_dcp assets/checkpoints/Wan2.1-T2V-1.3B.pth assets/checkpoints/Wan2.1-T2V-1.3B.dcp
458
+ ```
459
+
460
+ After training, the saved `.dcp` checkpoints can be converted to `.pth` using the script `scripts/dcp_to_pth.py`.
461
+
462
+ #### Dataset Downloading
463
+
464
+ We provide Wan2.1-14B-synthesized datasets. Download to `assets/datasets` using:
465
+
466
+ ```bash
467
+ # make sure git lfs is installed
468
+ git clone https://huggingface.co/datasets/worstcoder/Wan_datasets assets/datasets
469
+ ```
470
+
471
+ #### Start Training
472
+ We implement white-box SLA training by aligning the predictions of the SLA-enabled model with those of the full-attention pretrained model. Unlike black-box training in the original paper, which tunes the pretrained model using diffusion loss, white-box training mitigates distribution shift and is less sensitive to the training data.
473
+
474
+ Single-node training example:
475
+
476
+ ```bash
477
+ WORKDIR="/your/path/to/turbodiffusion"
478
+ cd $WORKDIR
479
+ export PYTHONPATH=turbodiffusion
480
+
481
+ # the "IMAGINAIRE_OUTPUT_ROOT" environment variable is the path to save experiment output files
482
+ export IMAGINAIRE_OUTPUT_ROOT=${WORKDIR}/outputs
483
+ CHECKPOINT_ROOT=${WORKDIR}/assets/checkpoints
484
+ DATASET_ROOT=${WORKDIR}/assets/datasets/Wan2.1_14B_480p_16:9_Euler-step100_shift-3.0_cfg-5.0_seed-0_250K
485
+
486
+ # your Wandb information
487
+ export WANDB_API_KEY=xxx
488
+ export WANDB_ENTITY=xxx
489
+
490
+ registry=registry_sla
491
+ experiment=wan2pt1_1pt3B_res480p_t2v_SLA
492
+
493
+ torchrun --nproc_per_node=8 \
494
+ -m scripts.train --config=rcm/configs/${registry}.py -- experiment=${experiment} \
495
+ model.config.teacher_ckpt=${CHECKPOINT_ROOT}/Wan2.1-T2V-1.3B.dcp \
496
+ model.config.tokenizer.vae_pth=${CHECKPOINT_ROOT}/Wan2.1_VAE.pth \
497
+ model.config.text_encoder_path=${CHECKPOINT_ROOT}/models_t5_umt5-xxl-enc-bf16.pth \
498
+ model.config.neg_embed_path=${CHECKPOINT_ROOT}/umT5_wan_negative_emb.pt \
499
+ dataloader_train.tar_path_pattern=${DATASET_ROOT}/shard*.tar
500
+ ```
501
+
502
+ Please refer to `turbodiffusion/rcm/configs/experiments/sla/wan2pt1_t2v.py` for the 14B config or perform modifications as needed.
503
+
504
+ #### Model Merging
505
+
506
+ The parameter updates from SLA training can be merged into rCM checkpoints using `turbodiffusion/scripts/merge_models.py`, enabling rCM to perform sparse attention inference. Specify `--base` as the rCM model, `--diff_base` as the pretrained model, and `--diff_target` as the SLA-tuned model.
507
+
508
+ ## Roadmap
509
+
510
+ We're actively working on the following features and improvements:
511
+
512
+ - [x] Organize and release training code
513
+ - [ ] Optimize infrastructure for better parallel
514
+ - [ ] vLLM-Omni integration
515
+ - [ ] Support for more video generation models
516
+ - [ ] Support for autoregressive video generation models
517
+ - [ ] More hardware-level operator optimizations
518
+
519
+
520
+ We welcome community members to help maintain and extend TurboDiffusion. Welcome to join the TurboDiffusion Team and contribute together!
521
 
522
 
523
  ## Citation
 
563
  booktitle={International Conference on Machine Learning (ICML)},
564
  year={2025}
565
  }
566
+ ```