TurboDiffusion
/

TurboWan2.1-T2V-14B-720P

@@ -1,13 +1,13 @@
 ---
-license: apache-2.0
 base_model: Wan-AI/Wan2.1-T2V-14B
 tags:
 - text-to-video
 - diffusion
 - video-generation
 - turbodiffusion
 - wan2.1
-pipeline_tag: text-to-video
 ---
 <p align="center">
@@ -16,13 +16,508 @@ pipeline_tag: text-to-video
 # TurboWan2.1-T2V-14B-720P
-- This HuggingFace repo contains the `TurboWan2.1-T2V-14B-720P` model.
-- For RTX 5090 or similar GPUs, please use the `TurboWan2.1-T2V-14B-720P-quant`. For other GPUs with a bigger GPU memory than 40GB, we recommend using `TurboWan2.1-T2V-14B-720P`.
-- For usage instructions, please see **https://github.com/thu-ml/TurboDiffusion**
 - Paper: [TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times](https://arxiv.org/pdf/2512.16093)
 ## Citation
@@ -68,17 +563,4 @@ pipeline_tag: text-to-video
   booktitle={International Conference on Machine Learning (ICML)},
   year={2025}
 }
-@article{zhang2025sageattention2++,
-  title={Sageattention2++: A more efficient implementation of sageattention2},
-  author={Zhang, Jintao and Xu, Xiaoming and Wei, Jia and Huang, Haofeng and Zhang, Pengle and Xiang, Chendong and Zhu, Jun and Chen, Jianfei},
-  journal={arXiv preprint arXiv:2505.21136},
-  year={2025}
-}
-@article{zhang2025sageattention3,
-  title={SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training},
-  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Xu, Xiaoming and Huang, Haofeng and Wang, Haoxu and Jiang, Kai and Zhu, Jun and Chen, Jianfei},
-  journal={arXiv preprint arXiv:2505.11594},
-  year={2025}
-}
-```

 ---
 base_model: Wan-AI/Wan2.1-T2V-14B
+license: apache-2.0
+pipeline_tag: text-to-video
 tags:
 - text-to-video
 - diffusion
 - video-generation
 - turbodiffusion
 - wan2.1
 ---
 <p align="center">
 # TurboWan2.1-T2V-14B-720P
+This repository contains the `TurboWan2.1-T2V-14B-720P` model, which is part of the **TurboDiffusion** framework. TurboDiffusion is designed to accelerate end-to-end video diffusion generation by 100-200 times while maintaining high video quality, leveraging innovations in attention acceleration, step distillation, and W8A8 quantization. This particular model is based on `Wan-AI/Wan2.1-T2V-14B` and is optimized for 720p video generation.
+- For RTX 5090 or similar GPUs, please use the `TurboWan2.1-T2V-14B-720P-quant` checkpoint. For other GPUs with a bigger GPU memory than 40GB, we recommend using `TurboWan2.1-T2V-14B-720P`.
 - Paper: [TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times](https://arxiv.org/pdf/2512.16093)
+- GitHub Repository: [https://github.com/thu-ml/TurboDiffusion](https://github.com/thu-ml/TurboDiffusion)
+## Quick Start: Inference
+For GPUs with more than 40GB of GPU memory, **e.g., H100, we recommend using the unquantized checkpoint (without `-quant`) and removing `--quant_linear` from the command.**
+1.  Download the Wan2.1 VAE (**applicable for both Wan2.1 and Wan2.2**) and umT5 text encoder checkpoints from the official [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) repository on Huggingface:
+    ```bash
+    mkdir checkpoints
+    cd checkpoints
+    wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/Wan2.1_VAE.pth
+    wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/models_t5_umt5-xxl-enc-bf16.pth
+    ```
+2.  Download our finetuned checkpoints:
+    ```bash
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P.pth
+    ```
+    For RTX 5090, RTX 4090, or similar GPUs, please use the quantized checkpoint:
+    ```bash
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P-quant.pth
+    ```
+    For the Wan2.2-I2V model, download both the high-noise and low-noise checkpoints:
+    ```bash
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.2-I2V-A14B-720P/resolve/main/TurboWan2.2-I2V-A14B-high-720P.pth
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.2-I2V-A14B-720P/resolve/main/TurboWan2.2-I2V-A14B-low-720P.pth
+    ```
+3.  Use the inference script for the **T2V** model:
+    ```bash
+    export PYTHONPATH=turbodiffusion
+    # Arguments:
+    # --dit_path            Path to the finetuned TurboDiffusion checkpoint
+    # --model               Model to use: Wan2.1-1.3B or Wan2.1-14B (default: Wan2.1-1.3B)
+    # --num_samples         Number of videos to generate (default: 1)
+    # --num_steps           Sampling steps, 1–4 (default: 4)
+    # --sigma_max           Initial sigma for rCM (default: 80); larger choices (e.g., 1600) reduce diversity but may enhance quality
+    # --vae_path            Path to Wan2.1 VAE (default: checkpoints/Wan2.1_VAE.pth)
+    # --text_encoder_path   Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth)
+    # --num_frames          Number of frames to generate (default: 81)
+    # --prompt              Text prompt for video generation
+    # --resolution          Output resolution: "480p" or "720p" (default: 480p)
+    # --aspect_ratio        Aspect ratio in W:H format (default: 16:9)
+    # --seed                Random seed for reproducibility (default: 0)
+    # --save_path           Output file path including extension (default: output/generated_video.mp4)
+    # --attention_type      Attention module to use: original, sla or sagesla (default: sagesla)
+    # --sla_topk            Top-k ratio for SLA/SageSLA attention (default: 0.1), we recommend using 0.15 for better video quality
+    # --quant_linear        Enable quantization for linear layers, pass this if using a quantized checkpoint
+    # --default_norm        Use the original LayerNorm and RMSNorm of Wan models
+    python turbodiffusion/inference/wan2.1_t2v_infer.py \
+        --model Wan2.1-1.3B \
+        --dit_path checkpoints/TurboWan2.1-T2V-1.3B-480P-quant.pth \
+        --resolution 480p \
+        --prompt "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about." \
+        --num_samples 1 \
+        --num_steps 4 \
+        --quant_linear \
+        --attention_type sagesla \
+        --sla_topk 0.1
+    ```
+    Or the script for the **I2V** model:
+    ```bash
+    export PYTHONPATH=turbodiffusion
+    # --image_path              Path to the input image
+    # --high_noise_model_path   Path to the high noise TurboDiffusion checkpoint
+    # --low_noise_model_path    Path to the high noise TurboDiffusion checkpoint
+    # --boundary                Timestep boundary for switching from high to low noise model (default: 0.9)
+    # --model                   Model to use: Wan2.2-A14B (default: Wan2.2-A14B)
+    # --num_samples             Number of videos to generate (default: 1)
+    # --num_steps               Sampling steps, 1–4 (default: 4)
+    # --sigma_max               Initial sigma for rCM (default: 200); larger choices (e.g., 1600) reduce diversity but may enhance quality
+    # --vae_path                Path to Wan2.2 VAE (default: checkpoints/Wan2.2_VAE.pth)
+    # --text_encoder_path       Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth)
+    # --num_frames              Number of frames to generate (default: 81)
+    # --prompt                  Text prompt for video generation
+    # --resolution              Output resolution: "480p" or "720p" (default: 720p)
+    # --aspect_ratio            Aspect ratio in W:H format (default: 16:9)
+    # --adaptive_resolution     Enable adaptive resolution based on input image size
+    # --ode                     Use ODE for sampling (sharper but less robust than SDE)
+    # --seed                    Random seed for reproducibility (default: 0)
+    # --save_path               Output file path including extension (default: output/generated_video.mp4)
+    # --attention_type          Attention module to use: original, sla or sagesla (default: sagesla)
+    # --sla_topk                Top-k ratio for SLA/SageSLA attention (default: 0.1), we recommend using 0.15 for better video quality
+    # --quant_linear            Enable quantization for linear layers, pass this if using a quantized checkpoint
+    # --default_norm            Use the original LayerNorm and RMSNorm of Wan models
+    python turbodiffusion/inference/wan2.2_i2v_infer.py \
+        --model Wan2.2-A14B \
+        --low_noise_model_path checkpoints/TurboWan2.2-I2V-A14B-low-720P-quant.pth \
+        --high_noise_model_path checkpoints/TurboWan2.2-I2V-A14B-high-720P-quant.pth \
+        --resolution 720p \
+        --adaptive_resolution \
+        --image_path assets/i2v_inputs/i2v_input_0.jpg \
+        --prompt "POV selfie video, ultra-messy and extremely fast. A white cat in sunglasses stands on a surfboard with a neutral look when the board suddenly whips sideways, throwing cat and camera into the water; the frame dives sharply downward, swallowed by violent bursts of bubbles, spinning turbulence, and smeared water streaks as the camera sinks. Shadows thicken, pressure ripples distort the edges, and loose bubbles rush upward past the lens, showing the camera is still sinking. Then the cat kicks upward with explosive speed, dragging the view through churning bubbles and rapidly brightening water as sunlight floods back in; the camera races upward, water streaming off the lens, and finally breaks the surface in a sudden blast of light and spray, snapping back into a crooked, frantic selfie as the cat resurfaces." \
+        --num_samples 1 \
+        --num_steps 4 \
+        --quant_linear \
+        --attention_type sagesla \
+        --sla_topk 0.1 \
+        --ode
+    ```
+## Evaluation
+We evaluate video generation on **a single RTX 5090 GPU**. The E2E Time refers to the end-to-end diffusion generation latency, excluding text encoding and VAE decoding.
+### Wan-2.2-I2V-A14B-720P
+<table>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/0.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/0.gif" width="360"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/1.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/1.gif" width="360"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/2.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/2.gif" width="360"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/3.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/3.gif" width="360"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/4.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/4.gif" width="360"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/5.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/5.gif" width="360"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4549s</div>
+<div><img src="assets/videos/i2v/original/A14B_720p/gif/6.gif" width="360"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>38s</b></div>
+<div><img src="assets/videos/i2v/turbodiffusion/A14B_720p/gif/6.gif" width="360"/></div>
+</td>
+</tr>
+</table>
+### Wan-2.1-T2V-1.3B-480P
+<table>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/5.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/5.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/5.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/0.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/0.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/0.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/1.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/1.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/1.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/2.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/2.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/2.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/7.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/7.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/7.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/11.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/11.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/11.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/13.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/13.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/13.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 184s</div>
+<div><img src="assets/videos/original/1.3B/14.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 5.3s</div>
+<div><img src="assets/videos/fastvideo/video_1.3B/14.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>1.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/1.3B/14.gif" width="249"/></div>
+</td>
+</tr>
+</table>
+### Wan-2.1-T2V-14B-720P
+<table>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4767s</div>
+<div><img src="assets/videos/original/14B_720p/0.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 72.6s</div>
+<div><img src="assets/videos/fastvideo/video_14B_720p/0.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>24s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_720p/0.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4767s</div>
+<div><img src="assets/videos/original/14B_720p/3.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 72.6s</div>
+<div><img src="assets/videos/fastvideo/video_14B_720p/3.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>24s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_720p/3.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 4767s</div>
+<div><img src="assets/videos/original/14B_720p/6.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 72.6s</div>
+<div><img src="assets/videos/fastvideo/video_14B_720p/6.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>24s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_720p/6.gif" width="249"/></div>
+</td>
+</tr>
+</table>
+### Wan-2.1-T2V-14B-480P
+<table>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
+<div><img src="assets/videos/original/14B_480p/0.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
+<div><img src="assets/videos/fastvideo/video_14B_480p/0.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_480p/0.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
+<div><img src="assets/videos/original/14B_480p/1.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
+<div><img src="assets/videos/fastvideo/video_14B_480p/1.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_480p/1.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
+<div><img src="assets/videos/original/14B_480p/4.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
+<div><img src="assets/videos/fastvideo/video_14B_480p/4.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_480p/4.gif" width="249"/></div>
+</td>
+</tr>
+<tr>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">Original, E2E Time: 1676s</div>
+<div><img src="assets/videos/original/14B_480p/5.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">FastVideo, E2E Time: 26.3s</div>
+<div><img src="assets/videos/fastvideo/video_14B_480p/5.gif" width="249"/></div>
+</td>
+<td align="center" style="border: 2px solid #000; padding: 10px;">
+<div style="font-size: 1.1em;">TurboDiffusion, E2E Time: <b>9.9s</b></div>
+<div><img src="assets/videos/turbodiffusion/14B_480p/5.gif" width="249"/></div>
+</td>
+</tr>
+</table>
+## Training
+In this repo, we provide training code based on Wan2.1 and its synthetic data. The training builds on the rCM codebase (https://github.com/NVlabs/rcm), with infrastructure support including FSDP2, Ulysses CP, and selective activation checkpointing (SAC). For rCM training instructions, please refer to the original rCM repository; SLA training guidance is provided here.
+#### Checkpoints Downloading
+Download the Wan2.1 pretrained checkpoints in `.pth` format and VAE/text encoder to `assets/checkpoints`:
+```bash
+# make sure git lfs is installed
+git clone https://huggingface.co/worstcoder/Wan assets/checkpoints
+```
+FSDP2 relies on [Distributed Checkpoint (DCP)](https://docs.pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html) for loading and saving checkpoints. Before training, convert `.pth` teacher checkpoints to `.dcp` first:
+```bash
+python -m torch.distributed.checkpoint.format_utils torch_to_dcp assets/checkpoints/Wan2.1-T2V-1.3B.pth assets/checkpoints/Wan2.1-T2V-1.3B.dcp
+```
+After training, the saved `.dcp` checkpoints can be converted to `.pth` using the script `scripts/dcp_to_pth.py`.
+#### Dataset Downloading
+We provide Wan2.1-14B-synthesized datasets. Download to `assets/datasets` using:
+```bash
+# make sure git lfs is installed
+git clone https://huggingface.co/datasets/worstcoder/Wan_datasets assets/datasets
+```
+#### Start Training
+We implement white-box SLA training by aligning the predictions of the SLA-enabled model with those of the full-attention pretrained model. Unlike black-box training in the original paper, which tunes the pretrained model using diffusion loss, white-box training mitigates distribution shift and is less sensitive to the training data.
+Single-node training example:
+```bash
+WORKDIR="/your/path/to/turbodiffusion"
+cd $WORKDIR
+export PYTHONPATH=turbodiffusion
+# the "IMAGINAIRE_OUTPUT_ROOT" environment variable is the path to save experiment output files
+export IMAGINAIRE_OUTPUT_ROOT=${WORKDIR}/outputs
+CHECKPOINT_ROOT=${WORKDIR}/assets/checkpoints
+DATASET_ROOT=${WORKDIR}/assets/datasets/Wan2.1_14B_480p_16:9_Euler-step100_shift-3.0_cfg-5.0_seed-0_250K
+# your Wandb information
+export WANDB_API_KEY=xxx
+export WANDB_ENTITY=xxx
+registry=registry_sla
+experiment=wan2pt1_1pt3B_res480p_t2v_SLA
+torchrun --nproc_per_node=8 \
+    -m scripts.train --config=rcm/configs/${registry}.py -- experiment=${experiment} \
+        model.config.teacher_ckpt=${CHECKPOINT_ROOT}/Wan2.1-T2V-1.3B.dcp \
+        model.config.tokenizer.vae_pth=${CHECKPOINT_ROOT}/Wan2.1_VAE.pth \
+        model.config.text_encoder_path=${CHECKPOINT_ROOT}/models_t5_umt5-xxl-enc-bf16.pth \
+        model.config.neg_embed_path=${CHECKPOINT_ROOT}/umT5_wan_negative_emb.pt \
+        dataloader_train.tar_path_pattern=${DATASET_ROOT}/shard*.tar
+```
+Please refer to `turbodiffusion/rcm/configs/experiments/sla/wan2pt1_t2v.py` for the 14B config or perform modifications as needed.
+#### Model Merging
+The parameter updates from SLA training can be merged into rCM checkpoints using `turbodiffusion/scripts/merge_models.py`, enabling rCM to perform sparse attention inference. Specify `--base` as the rCM model, `--diff_base` as the pretrained model, and `--diff_target` as the SLA-tuned model.
+## Roadmap
+We're actively working on the following features and improvements:
+- [x] Organize and release training code
+- [ ] Optimize infrastructure for better parallel
+- [ ] vLLM-Omni integration
+- [ ] Support for more video generation models
+- [ ] Support for autoregressive video generation models
+- [ ] More hardware-level operator optimizations
+We welcome community members to help maintain and extend TurboDiffusion. Welcome to join the TurboDiffusion Team and contribute together!
 ## Citation
   booktitle={International Conference on Machine Learning (ICML)},
   year={2025}
 }
+```