TurboDiffusion
/

TurboWan2.1-T2V-1.3B-480P

@@ -1,13 +1,14 @@
 ---
-license: apache-2.0
 base_model: Wan-AI/Wan2.1-T2V-1.3B
 tags:
 - text-to-video
 - diffusion
 - video-generation
 - turbodiffusion
 - wan2.1
-pipeline_tag: text-to-video
 ---
 <p align="center">
@@ -16,14 +17,77 @@ pipeline_tag: text-to-video
 # TurboWan2.1-T2V-1.3B-480P
-- This HuggingFace repo contains the `TurboWan2.1-T2V-1.3B-480P` model.
-- For RTX 5090, RTX 4090, or similar GPUs, please use the `TurboWan2.1-T2V-1.3B-480P-quant`. For other GPUs with a bigger GPU memory than 40GB, we recommend using `TurboWan2.1-T2V-1.3B-480P`.
-- For usage instructions, please see **https://github.com/thu-ml/TurboDiffusion**
-- Paper: [TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times](https://arxiv.org/pdf/2512.16093)
 ## Citation
 ```
@@ -81,4 +145,4 @@ pipeline_tag: text-to-video
   journal={arXiv preprint arXiv:2505.11594},
   year={2025}
 }
-```

 ---
 base_model: Wan-AI/Wan2.1-T2V-1.3B
+license: apache-2.0
+pipeline_tag: text-to-video
 tags:
 - text-to-video
 - diffusion
 - video-generation
 - turbodiffusion
 - wan2.1
+library_name: diffusers
 ---
 <p align="center">
 # TurboWan2.1-T2V-1.3B-480P
+This HuggingFace repo contains the `TurboWan2.1-T2V-1.3B-480P` model, as presented in the paper [TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times](https://arxiv.org/pdf/2512.16093).
+For RTX 5090, RTX 4090, or similar GPUs, please use the `TurboWan2.1-T2V-1.3B-480P-quant`. For other GPUs with a bigger GPU memory than 40GB, we recommend using `TurboWan2.1-T2V-1.3B-480P`.
+For more detailed usage instructions and the full codebase, please see the [TurboDiffusion GitHub repository](https://github.com/thu-ml/TurboDiffusion).
+## Sample Usage
+For GPUs with more than 40GB of GPU memory, **e.g., H100, we recommend using the unquantized checkpoint (without `-quant`) and removing `--quant_linear` from the command.**
+1.  Download the Wan2.1 VAE (**applicable for both Wan2.1 and Wan2.2**) and umT5 text encoder checkpoints from the official [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) repository on Huggingface:
+    ```bash
+    mkdir checkpoints
+    cd checkpoints
+    wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/Wan2.1_VAE.pth
+    wget https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/resolve/main/models_t5_umt5-xxl-enc-bf16.pth
+    ```
+2.  Download our finetuned checkpoints:
+    ```bash
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P.pth
+    ```
+    For RTX 5090, RTX 4090, or similar GPUs, please use the quantized checkpoint:
+    ```bash
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.1-T2V-1.3B-480P/resolve/main/TurboWan2.1-T2V-1.3B-480P-quant.pth
+    ```
+    For the Wan2.2-I2V model, download both the high-noise and low-noise checkpoints:
+    ```bash
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.2-I2V-A14B-720P/resolve/main/TurboWan2.2-I2V-A14B-high-720P.pth
+    wget https://huggingface.co/TurboDiffusion/TurboWan2.2-I2V-A14B-720P/resolve/main/TurboWan2.2-I2V-A14B-low-720P.pth
+    ```
+3.  Use the inference script for the **T2V** model:
+    ```bash
+    export PYTHONPATH=turbodiffusion
+    # Arguments:
+    # --dit_path            Path to the finetuned TurboDiffusion checkpoint
+    # --model               Model to use: Wan2.1-1.3B or Wan2.1-14B (default: Wan2.1-1.3B)
+    # --num_samples         Number of videos to generate (default: 1)
+    # --num_steps           Sampling steps, 1–4 (default: 4)
+    # --sigma_max           Initial sigma for rCM (default: 80); larger choices (e.g., 1600) reduce diversity but may enhance quality
+    # --vae_path            Path to Wan2.1 VAE (default: checkpoints/Wan2.1_VAE.pth)
+    # --text_encoder_path   Path to umT5 text encoder (default: checkpoints/models_t5_umt5-xxl-enc-bf16.pth)
+    # --num_frames          Number of frames to generate (default: 81)
+    # --prompt              Text prompt for video generation
+    # --resolution          Output resolution: "480p" or "720p" (default: 480p)
+    # --aspect_ratio        Aspect ratio in W:H format (default: 16:9)
+    # --seed                Random seed for reproducibility (default: 0)
+    # --save_path           Output file path including extension (default: output/generated_video.mp4)
+    # --attention_type      Attention module to use: original, sla or sagesla (default: sagesla)
+    # --sla_topk            Top-k ratio for SLA/SageSLA attention (default: 0.1), we recommend using 0.15 for better video quality
+    # --quant_linear        Enable quantization for linear layers, pass this if using a quantized checkpoint
+    # --default_norm        Use the original LayerNorm and RMSNorm of Wan models
+    python turbodiffusion/inference/wan2.1_t2v_infer.py \
+        --model Wan2.1-1.3B \
+        --dit_path checkpoints/TurboWan2.1-T2V-1.3B-480P-quant.pth \
+        --resolution 480p \
+        --prompt "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about." \
+        --num_samples 1 \
+        --num_steps 4 \
+        --quant_linear \
+        --attention_type sagesla \
+        --sla_topk 0.1
+    ```
 ## Citation
 ```
   journal={arXiv preprint arXiv:2505.11594},
   year={2025}
 }
+```