RaphaelLiu
/

PusaV1

@@ -8,6 +8,8 @@ license: apache-2.0
 ## Overview
 The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability.
 In this work, we present **Pusa**¹, a groundbreaking paradigm that leverages **vectorized timestep adaptation (VTA)** to enable fine-grained temporal control within a unified video diffusion framework. VTA is a non-destructive adaptation, meaning it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency—**surpassing the performance of Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000)** and **≤ 1/2500 of the dataset size (4K vs. ≥ 10M samples)**.
@@ -20,7 +22,11 @@ Pusa not only sets a new standard for image-to-video (I2V) generation but also *
 Pusa V1.0, with only 10 inference steps, achieves state-of-the-art performance among open-source models. It surpasses its direct baseline, `Wan-I2V`, which was trained with vastly greater resources. Our model obtains a VBench-I2V total score of **87.32%**, outperforming `Wan-I2V`'s 86.86%.
-<!-- <p align="center"><img src="pusa_benchmark_figure_dark.png" width="80%"></p> -->
 ## ✨ Key Features

 ## Overview
+<p align="center"><img src="https://huggingface.co/RaphaelLiu/PusaV1/resolve/main/pusa_benchmark_figure_dark.png" width="80%"></p>
 The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability.
 In this work, we present **Pusa**¹, a groundbreaking paradigm that leverages **vectorized timestep adaptation (VTA)** to enable fine-grained temporal control within a unified video diffusion framework. VTA is a non-destructive adaptation, meaning it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency—**surpassing the performance of Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000)** and **≤ 1/2500 of the dataset size (4K vs. ≥ 10M samples)**.
 Pusa V1.0, with only 10 inference steps, achieves state-of-the-art performance among open-source models. It surpasses its direct baseline, `Wan-I2V`, which was trained with vastly greater resources. Our model obtains a VBench-I2V total score of **87.32%**, outperforming `Wan-I2V`'s 86.86%.
+## ✨ Key Features
+- **Comprehensive Multi-task Support**:
+// ... existing code ...
 ## ✨ Key Features