GuanjieChen commited on
Commit
ce70e70
Β·
verified Β·
1 Parent(s): 1ebbdc3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -39
README.md CHANGED
@@ -1,30 +1,37 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- - zh
6
- base_model:
7
- - maxin-cn/Latte-1
8
- - facebook/DiT-XL-2-256
9
- - Tencent-Hunyuan/HunyuanDiT
10
- tags:
11
- - video
12
- - image
13
- - model-efficiency
14
- ---
15
- # Accelerating Vision Diffusion Transformers with Skip Branches
16
-
17
- This repository contains all the checkpoints of models the paper: **[Accelerating Vision Diffusion Transformers with Skip Branches](https://arxiv.org/abs/2411.17616)**. In this work, we enhance standard DiT models by introducing **Skip-DiT**, which incorporates skip branches to improve feature smoothness. We also propose **Skip-Cache**, a method that leverages skip branches to cache DiT features across timesteps during inference. The effectiveness of our approach is validated on various DiT backbones for both video and image generation, demonstrating how skip branches preserve generation quality while achieving significant speedup. Experimental results show that **Skip-Cache** provides a 1.5x speedup with minimal computational cost and a 2.2x speedup with only a slight reduction in quantitative metrics. All the codes and checkpoints are publicly available at [huggingface](https://huggingface.co/GuanjieChen/Skip-DiT/tree/main) and [github](https://github.com/OpenSparseLLMs/Skip-DiT.git). More visualizations can be found [here](#visualization).
18
-
19
- ### Pipeline
20
- ![pipeline](visuals/pipeline.jpg)
21
- Illustration of Skip-DiT and Skip-Cache for DiT visual generation caching. (a) The vanilla DiT block for image and video generation. (b) Skip-DiT modifies the vanilla DiT model using skip branches to connect shallow and deep DiT blocks. (c) Pipeline of Skip-Cache.
22
-
23
- ### Feature Smoothness
24
- ![feature](visuals/feature.jpg)
25
- Feature smoothness analysis of DiT in the class-to-video generation task using DDPM. Normalized disturbances, controlled by strength coefficients $\alpha$ and $\beta$, are introduced to the model with and without skip connections. We compare the similarity between the original and perturbed features. The feature difference surface of Latte, with and without skip connections, is visualized in steps 10 and 250 of DDPM. The results show significantly better feature smoothness in Skip-DiT. Furthermore, we identify feature smoothness as a critical factor limiting the effectiveness of cross-timestep feature caching in DiT. This insight provides a deeper understanding of caching efficiency and its impact on performance.
26
-
27
- ### Pretrained Models
 
 
 
 
 
 
 
28
  | Model | Task | Training Data | Backbone | Size(G) | Skip-Cache |
29
  |:--:|:--:|:--:|:--:|:--:|:--:|
30
  | [Latte-skip](https://huggingface.co/GuanjieChen/Skip-DiT/blob/main/DiT-XL-2-skip.pt) | text-to-video |Vimeo|Latte|8.76| βœ… |
@@ -36,22 +43,90 @@ Feature smoothness analysis of DiT in the class-to-video generation task using D
36
 
37
  Pretrained text-to-image Model of [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT) can be found in [Huggingface](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2/tree/main/t2i/model) and [Tencent-cloud](https://dit.hunyuan.tencent.com/download/HunyuanDiT/model-v1_2.zip).
38
 
39
- ### Demo
40
- ![demo1](visuals/video-demo.gif)
41
- (Results of Latte with skip-branches on text-to-video and class-to-video tasks. Left: text-to-video with 1.7x and 2.0x speedup. Right: class-to-video with 2.2x and 2.5x speedup. Latency is measured on one A100.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- ![demo2](visuals/image-demo.jpg)
44
- (Results of HunYuan-DiT with skip-branches on text-to-image task. Latency is measured on one A100.)
45
 
46
- ### Acknowledgement
47
- Skip-DiT has been greatly inspired by the following amazing works and teams: [DeepCache](https://github.com/horseee/DeepCache), [Latte](https://github.com/Vchitect/Latte), [DiT](https://github.com/facebookresearch/DiT), and [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT), we thank all the contributors for open-sourcing.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ### Visualization
50
- #### Text-to-Video
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  ![text-to-video visualizations](visuals/case_t2v.jpg)
52
- #### Class-to-Video
53
  ![class-to-video visualizations](visuals/case_c2v.jpg)
54
- #### Text-to-image
55
  ![text-to-image visualizations](visuals/case_t2i.jpg)
56
- #### Class-to-image
57
- ![class-to-image visualizations](visuals/case_c2i.jpg)
 
1
+ ## Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints
2
+
3
+ <div align="center">
4
+ <a href="https://github.com/OpenSparseLLMs/Skip-DiT"><img src="https://img.shields.io/static/v1?label=Skip-DiT-Code&message=Github&color=blue&logo=github-pages"></a> &ensp;
5
+ <a href="https://arxiv.org/abs/2411.17616"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:Skip-DiT&color=red&logo=arxiv"></a> &ensp;
6
+ <a href="https://huggingface.co/GuanjieChen/Skip-DiT"><img src="https://img.shields.io/static/v1?label=Skip-DiT&message=HuggingFace&color=yellow"></a> &ensp;
7
+ </div>
8
+
9
+
10
+ <div align="center">
11
+ <img src="visuals/teaser.jpg" width="100%" ></img>
12
+ <br>
13
+ <em>
14
+ (a) Feature similarities between standard and cache-accelerated outputs in vanilla DiT caching with FORA and Faster-Diff and Skip-DiT. Skip-DiT presents consistently higher feature similarity, demonstrating superior stability after caching. (b) Illustration of Skip-DiT that modifies vanilla DiT models using long-skip-connection to connect shallow and deep DiT blocks. Dashed arrows indicate paths where computation can be skipped in cached inference. (c) Comparison of video generation quality (PNSR) and inference speedup of different DiT caching methods. Skip-DiT maintains higher generation quality even at greater speedup factors.
15
+ </em>
16
+ </div>
17
+ <be>
18
+
19
+ ### πŸŽ‰πŸŽ‰πŸŽ‰ About
20
+ This repository contains the official PyTorch implementation of the paper: **[Towards Stabilized and Efficient Diffusion Transformers through Long-Skip-Connections with Spectral Constraints](https://arxiv.org/abs/2411.17616)**.
21
+
22
+
23
+ Diffusion Transformers (DiT) have emerged as a powerful architecture for image and video generation, offering superior quality and scalability. However, their practical application suffers from inherent dynamic feature instability, leading to error amplification during cached inference. Through systematic analysis, we identify the absence of long-range feature preservation mechanisms as the root cause of unstable feature propagation and perturbation sensitivity. To this end, we propose Skip-DiT, an image and video generative DiT variant enhanced with Long-Skip-Connections (LSCs) - the key efficiency component in U-Nets. Theoretical spectral norm and visualization analysis demonstrate how LSCs stabilize feature dynamics. Skip-DiT architecture and its stabilized dynamic feature enable an efficient statical caching mechanism that reuses deep features across timesteps while updating shallow components. Extensive experiments across the image and video generation tasks demonstrate that Skip-DiT achieves: (1) 4.4x training acceleration and faster convergence, (2) 1.5-2x inference acceleration with negligible quality loss and high fidelity to the original output, outperforming existing DiT caching methods across various quantitative metrics.Our findings establish Long-Skip-Connections as critical architectural components for stable and efficient diffusion transformers.
24
+ More visualizations can be found [here](#visualization).
25
+
26
+ <!-- > [**Accelerating Vision Diffusion Transformers with Skip Branches**](https://arxiv.org/abs/2411.17616)<br>
27
+ > [Guanjie Chen](https://scholar.google.com/citations?user=cpBU1VgAAAAJ&hl=zh-CN), [Xinyu Zhao](https://scholar.google.com/citations?hl=en&user=1cj23VYAAAAJ),[Yucheng Zhou](https://scholar.google.com/citations?user=nnbFqRAAAAAJ&hl=en), [Tianlong Chen](https://scholar.google.com/citations?user=LE3ctn0AAAAJ&hl=en), [Yu Cheng](https://scholar.google.com/citations?user=ORPxbV4AAAAJ&hl=en)
28
+ > (contact us: chenguanjie@sjtu.edu.cn, xinyu@cs.unc.edu) -->
29
+
30
+ ### 🌟 Feature Stability of Skip-DiT
31
+ ![stability](visuals/stability.jpg)
32
+ Visualization of the feature stability of Skip-DiT compared with vanilla DiT. Skip-DiT also shows superior training efficiency.
33
+
34
+ ### πŸ›’ Released Models
35
  | Model | Task | Training Data | Backbone | Size(G) | Skip-Cache |
36
  |:--:|:--:|:--:|:--:|:--:|:--:|
37
  | [Latte-skip](https://huggingface.co/GuanjieChen/Skip-DiT/blob/main/DiT-XL-2-skip.pt) | text-to-video |Vimeo|Latte|8.76| βœ… |
 
43
 
44
  Pretrained text-to-image Model of [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT) can be found in [Huggingface](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.2/tree/main/t2i/model) and [Tencent-cloud](https://dit.hunyuan.tencent.com/download/HunyuanDiT/model-v1_2.zip).
45
 
46
+ <video controls loop src="https://github.com/user-attachments/assets/90878b0e-ff69-415a-b786-e0b6587b0a0b" type="video/mp4"></video>
47
+ (Visualizations of Latte-Skip. You can replicate them [here](#text-to-video-inference))
48
+
49
+
50
+
51
+ ### πŸš€ Quick Start
52
+ #### Text-to-video Inference
53
+ To generate videos with Latte-skip, you just need 3 steps
54
+ ```shell
55
+ # 1. Prepare your conda environments
56
+ cd text-to-video ; conda env create -f environment.yaml ; conda activate latte
57
+ # 2. Download checkpoints of Latte and Latte-skip
58
+ python download.py
59
+ # 3. Generate videos with only one command line!
60
+ python sample/sample_t2v.py --config ./configs/t2v/t2v_sample_skip.yaml
61
+ # 4. (Optional) To accelerate generation with skip-cache, run following command
62
+ python sample/sample_t2v.py --config ./configs/t2v/t2v_sample_skip_cache.yaml --cache N2-700-50
63
+ ```
64
+ #### Text-to-image Inference
65
+ In the same way, to generate images with Hunyuan-DiT, you only need 3 steps
66
+ ```shell
67
+ # 1. Prepare your conda environments
68
+ cd text-to-image ; conda env create -f environment.yaml ; conda activate HunyuanDiT
69
+ # 2. Download checkpoints of Hunyuan-DiT
70
+ mkdir ckpts ; huggingface-cli download Tencent-Hunyuan/HunyuanDiT-v1.2 --local-dir ./ckpts
71
+ # 3. Generate images with only one command line!
72
+ python sample_t2i.py --prompt "ζΈ”θˆŸε”±ζ™š" --no-enhance --infer-steps 100 --image-size 1024 1024
73
+ # 4. (Optional) To accelerate generation with skip-cache, run the following command
74
+ python sample_t2i.py --prompt "ζΈ”θˆŸε”±ζ™š" --no-enhance --infer-steps 100 --image-size 1024 1024 --cache --cache-step 2
75
+ ```
76
+
77
+ About the class-to-video and class-to-image task, you can found detailed instructions in `class-to-video/README.md` and `class-to-image/README.md`
78
+
79
 
 
 
80
 
81
+ ### πŸƒ Training
82
+
83
+ We have already released the training code of Latte-skip! It takes only a few days on 8 H100 GPUs. To train the text-to-video model:
84
+ 1. Prepare your text-video datasets and implement the `text-to-video/datasets/t2v_joint_dataset.py`
85
+ 2. Run the two-stage training strategy:
86
+ 1. Freeze all the parameters except skip-branches. Set `freeze=True` in `text-to-video/configs/train_t2v.yaml`. And then run the training scripts at `text-to-video/train_scripts/t2v_joint_train_skip.sh`.
87
+ 2. Overall training. Set `freeze=False` in `text-to-video/configs/train_t2v.yaml`. And then run the training scripts.
88
+ **The text-to-video model we released is trained with only 300k text-video pairs of Vimeo for around 1 week on 8 H100 GPUs.**
89
+
90
+ The training instructions of `class-to-video` and `text-to-video` tasks can be found in `class-to-video/README.md` and `class-to-image/README.md`
91
+
92
+
93
+
94
+
95
+
96
+ ### 🌺 Acknowledgement
97
+ Skip-DiT has been greatly inspired by the following amazing works and teams: [DeepCache](https://arxiv.org/abs/2312.00858), [Latte](https://github.com/Vchitect/Latte), [DiT](https://github.com/facebookresearch/DiT), and [HunYuan-DiT](https://github.com/Tencent/HunyuanDiT), we thank all the contributors for open-sourcing.
98
+
99
+
100
+ ### License
101
+ The code and model weights are licensed under [LICENSE](./class-to-image/LICENSE).
102
+
103
 
104
  ### Visualization
105
+
106
+ ##### 1. Teasers
107
+ <div align="center">
108
+ <img src="visuals/video-demo.gif" width="85%" ></img>
109
+ <br>
110
+ <em>
111
+ (Results of Latte with skip-branches on text-to-video and class-to-video tasks with Latte. Left: text-to-video with 1.7x and 2.0x speedup. Right: class-to-video with 2.2x and 2.4x speedup. Latency is measured on one A100.)
112
+ </em>
113
+ </div>
114
+ <br>
115
+
116
+ <div align="center">
117
+ <img src="visuals/image-demo.jpg" width="100%" ></img>
118
+ <br>
119
+ <em>
120
+ (Results of HunYuan-DiT with skip-branches on text-to-image task with Hunyuan-DiT. Latency is measured on one A100.)
121
+ </em>
122
+ </div>
123
+ <be>
124
+
125
+ ##### 2. Text-to-Video
126
  ![text-to-video visualizations](visuals/case_t2v.jpg)
127
+ ##### 3. Class-to-Video
128
  ![class-to-video visualizations](visuals/case_c2v.jpg)
129
+ ##### 4. Text-to-image
130
  ![text-to-image visualizations](visuals/case_t2i.jpg)
131
+ ##### 5. Class-to-image
132
+ ![class-to-image visualizations](visuals/case_c2i.jpg)