72.5 kB

Title: A Training-Free Acceleration Method Tailored for Diffusion Transformers

URL Source: https://arxiv.org/html/2406.01125

Published Time: Tue, 04 Jun 2024 01:31:32 GMT

Markdown Content: Pengtao Chen 1† Mingzhu Shen 2† Peng Ye 1 Jianjian Cao 1 Chongjun Tu 1

Christos-Savvas Bouganis 2 Yiren Zhao 2 Tao Chen 1

1 Fudan University 2 Imperial College London

pengt.chen@gmail.com, eetchen@fudan.edu.cn

Abstract

Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of an acceleration framework tailored to the DiT architecture. To tackle these challenges, we conduct an investigation into the correlation between DiT blocks and image generation. Our findings reveal that the front blocks of DiT are associated with the outline of the generated images, while the rear blocks are linked to the details. Based on this insight, we propose an overall training-free inference acceleration framework Δ Δ\Delta roman_Δ-DiT: using a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Specifically, a DiT-specific cache mechanism called Δ Δ\Delta roman_Δ-Cache is proposed, which considers the inputs of the previous sampling image and reduces the bias in the inference. Extensive experiments on PIXART-α 𝛼\alpha italic_α and DiT-XL demonstrate that the Δ Δ\Delta roman_Δ-DiT can achieve a 1.6×1.6\times 1.6 × speedup on the 20-step generation and even improves performance in most cases. In the scenario of 4-step consistent model generation and the more challenging 1.12×1.12\times 1.12 × acceleration, our method significantly outperforms existing methods. Our code will be publicly available.

Figure 1: Accelerating diffusion transformer like PIXART-α 𝛼\alpha italic_α and DiT-XL under the 20-step sampling.

1 Introduction

In recent years, the field of generative models has experienced rapid advancements. Among these, diffusion models[1, 2, 3] have emerged as pivotal, attracting widespread attention for their ability to generate high-quality and diverse images[4]. This has also spurred the development of many meaningful applications, such as image editing[5, 6], 3D generation[7, 8, 9], and video generation[10, 11, 12, 13]. Although diffusion models have strong generation capabilities, their iterative denoising nature results in poor real-time performance.

Subsequently, numerous inference acceleration frameworks for diffusion models have been proposed. These include pruning[14, 15], quantization[16, 17, 18, 19], and distillation[20, 21, 22, 23, 24] of the noise estimation network, optimizing sampling solver[25, 26, 27, 28, 29], and cache-based acceleration methods[30, 31]. However, almost all of these acceleration techniques are designed for the UNet-based[32] architecture. Recently, Diffusion transformers (DiT)[33] have achieved unprecedented success, exemplified by models like SD3.0[34], PIXART-α 𝛼\alpha italic_α[35], and Sora[36]. It has surpassed current UNet models to some extent. SDXL-Lightning[37] has also highlighted the redundancy of the encoder of UNet. In the current landscape where DiT has such an advantage, there’s been limited inference acceleration work for DiT models. Prior work[38] introduced an early stopping strategy for DiT blocks, which requires training and is not suitable for the current context of small-step generation. For these reasons, there’s an urgent need for a new acceleration framework tailored to DiT, potentially even a training-free framework.

However, there is still a lack of deep investigation into the DiT structure which restrains the DiT accelerating research. First, unlike traditional UNet, the DiT has a unique isotropic architecture that lacks encoders, decoders, and different depth skip connections. This causes the existing feature reuse mechanism such as DeepCache[30] and Faster Diffusion[31], may result in the loss of information when applied for DiT. Because they cache the output feature map of the block, while the DiT model without skip structure will lose the sampling input from the previous step. Second, the impact of different components in a whole DiT structure on the generated image quality remains unexplored. DiT is composed of many blocks, these blocks are located in different depths and play different roles. For example, the front block is focused on low-level information, while the back block is focused on semantic information, but there is little research that can present a comprehensive qualitative and quantitative analysis for these blocks. This further causes us to be uncertain about blocks to target when accelerating the DiT network.

For the first challenge, we propose the Δ Δ\Delta roman_Δ-Cache method, which involves using the offset of features rather than the feature maps themselves as cache objects to avoid the loss of the input information. Regarding the second challenge, we have discovered that the transformer blocks in the front part of DiT are more relevant for generating image outlines, while those in the later part are more relevant for generating image details. And combining previous research[39, 40, 41], which suggests that diffusion models generate outlines in the early stages of sampling and details in the later stages, we propose a stage-adaptive inference acceleration method (Δ Δ\Delta roman_Δ-DiT) that aligns with this sampling characteristic. Specifically, we Δ Δ\Delta roman_Δ-Cache the blocks in the rear part of DiT for approximation during early-stage sampling and Δ Δ\Delta roman_Δ-Cache the blocks in the front part during later-stage sampling. We evaluated our method on multiple datasets, including MS-COCO2017[42] and PartiPrompts[43], and conducted experiments on various DiT architecture models such as PIXART-α 𝛼\alpha italic_α[35], DiT-XL[33], and PIXART-α 𝛼\alpha italic_α-LCM[35, 22]. Extensive quantitative results demonstrate the effectiveness of our method. Specifically, based on the 20-step generation, we achieved a 1.6x speedup with results that are comparable to or better than the baseline (FID: 39.002→→\to→35.882). In the more challenging 4-step generation scenarios, our method significantly outperformed existing baseline methods at more extreme acceleration ratios (1.12×\times×).

The contributions of our paper are as follows:

•We first discovered that the blocks at the front of DiT focus more on generating image outlines, while the blocks at the back concentrate on details. Similarly, diffusion models generate outlines in the early stages and details in the later stages. These insights inspired the design of Δ Δ\Delta roman_Δ-DiT.
•We propose the first inference acceleration method Δ Δ\Delta roman_Δ-DiT without training specifically designed for diffusion transformers, which achieves the acceleration by caching the back block of DiT in the early sampling stage and the front block in the later stage.
•Δ Δ\Delta roman_Δ-DiT relies on our Δ Δ\Delta roman_Δ-Cache, a specialized cache mechanism designed for DiT, which effectively avoids losing information from the input (the sampled result of the previous step).
•In the experiments: Δ Δ\Delta roman_Δ-DiT achieves a 1.6×1.6\times 1.6 × speedup in 20-step generation and achieve even better generation. In the scenario of 4-step consistent model generation and the more challenging 1.12×1.12\times 1.12 × acceleration, our method significantly outperforms existing methods.

2 Related Work

Efficient Diffusion Model.To address the problem of poor real-time performance in diffusion models, various lightweight and acceleration techniques have emerged. Currently, methods for accelerating diffusion models for image generation can be broadly categorized into three perspectives: Firstly, lightweight a noise estimation network is one approach. Similar to traditional network lightweight, many efforts focus on pruning[14, 15], quantization[16, 17, 18, 19], and distillation[20, 21, 22, 23, 24] of noise estimation networks to obtain a smaller yet comparable performance model. Secondly, optimizing the sampling time steps is a unique dimension for diffusion models. Most methods currently focus on exploring efficient ODE solvers[25, 26, 27, 28, 29], aiming to obtain high-quality samples with fewer sampling steps. Another approach attempts to optimize sampling time steps from the perspective of skips[44]. Lastly, there’s a focus on jointly optimizing noise estimation networks and time steps. Such methods often achieve higher acceleration ratios. For instance, OMS-DPM[40] and Autodiffusion[45] simultaneously optimize skips and allocate noise estimation networks of specific sizes for each time step. LCM[22] organizes noise estimation networks from the time step perspective to enable the network to generate samples with fewer steps. Approaches like deepcache[30] and faster diffusion[31] consider information between steps, utilizing a cache mechanism to indirectly modify the network structure for acceleration. However, most of the aforementioned work is implemented and validated on the UNet architecture. Only previous work[38] proposed an early stopping strategy for DiT, which significantly impacts generation results in the current focus on generating with fewer steps, and has not explored the impact of DiT block on generation. Starting from the third acceleration perspective, this paper explores the impact of DiT blocks on generation and proposes a dedicated acceleration framework urgently needed for the DiT architecture.

Cache Mechanism.Cache mechanism is a concept in computer systems that involves temporarily storing information that may be reused in the future to improve processing speed. In large language models, the KV cache[46, 47] method is widely used, which caches the K and V in attention computations for reuse in the next attention calculation, thereby accelerating inference. In diffusion model generation, there are also some methods based on cache principles. Deepcache[30] expedites computation across adjacent steps in UNet networks by caching the output feature maps of up-sampling blocks. Faster Diffusion[31] achieves acceleration in neighboring steps of UNet networks by caching the output of the UNet encoder and the feature maps at skip connections. Meanwhile, TGATE[48] accelerates the later stages of sampling by caching the output feature maps of the cross-attention module. However, these methods all cache feature maps, which are not suitable for the isotropic architecture of DiT. Directly caching block output feature maps, as in Deepcache and Faster Diffusion, would lose the previous step sampled image information in DiT, which lacks skip connections. Caching only the output feature maps of more fine-grained modules, as in TGATE, would bring limited acceleration. Therefore, this paper proposes a method called Δ Δ\Delta roman_Δ-Cache, which caches feature offsets, to address these issues encountered in DiT.

3 Preliminary

The concept of diffusion originates from a branch of non-equilibrium thermodynamics[49] in physics. In recent years, researchers have applied this concept to image generation[1, 2, 3, 4, 50], transforming the process into two stages: noise diffusion and denoising.

Noise Diffusion Stage.This is also the training phase of the diffusion model. Given an original image 𝒙 𝟎 subscript 𝒙 0\bm{x_{0}}bold_italic_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT and a random time step t∈[1,T]𝑡 1 𝑇 t\in[1,T]italic_t ∈ [ 1 , italic_T ] (where T 𝑇 T italic_T is the total steps), the image after t 𝑡 t italic_t steps of diffusion is α¯t⁢𝒙 𝟎+1−α¯t⁢ϵ subscript¯𝛼 𝑡 subscript 𝒙 0 1 subscript¯𝛼 𝑡 bold-italic-ϵ\sqrt{\bar{\alpha}{t}}\bm{x{0}}+\sqrt{1-\bar{\alpha}{t}}\bm{\epsilon}square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, where α¯t subscript¯𝛼 𝑡\bar{\alpha}{t}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is constant related to t 𝑡 t italic_t. The noise estimation network is then used to estimate the noise in the diffused result, making the estimated noise ϵ 𝜽 subscript bold-italic-ϵ 𝜽\bm{\epsilon_{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT as close as possible to the actual noise ϵ bold-italic-ϵ\bm{\epsilon}bold_italic_ϵ added during diffusion. The specific gradient representation is as follows[1]:

ℒ⁢(𝜽)=𝔼 t,𝒙 𝟎∼q⁢(𝒙),ϵ∼𝒩⁢(0,1)⁢[‖ϵ−ϵ 𝜽⁢(α¯t⁢𝒙 𝟎+1−α¯t⁢ϵ,t)‖2],ℒ 𝜽 subscript 𝔼 formulae-sequence similar-to 𝑡 subscript 𝒙 0 𝑞 𝒙 similar-to bold-italic-ϵ 𝒩 0 1 delimited-[]superscript norm bold-italic-ϵ subscript bold-italic-ϵ 𝜽 subscript¯𝛼 𝑡 subscript 𝒙 0 1 subscript¯𝛼 𝑡 bold-italic-ϵ 𝑡 2\mathcal{L}(\bm{\theta})=\mathbb{E}{t,\bm{x{0}}\sim q(\bm{x}),\bm{\epsilon}% \sim\mathcal{N}(0,1)}\left[|\bm{\epsilon}-\bm{\epsilon}{\bm{\theta}}(\sqrt{% \bar{\alpha}{t}}\bm{x_{0}}+\sqrt{1-\bar{\alpha}_{t}}\bm{\epsilon},t)|^{2}% \right],caligraphic_L ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ∼ italic_q ( bold_italic_x ) , bold_italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(1)

where q⁢(𝒙)𝑞 𝒙 q(\bm{x})italic_q ( bold_italic_x ) is the dataset distribution, and 𝒩 𝒩\mathcal{N}caligraphic_N is the Gaussian distribution. In most current works, the noise estimation networks are mostly based on UNet architecture. However, in isotropic architectures like DiT, ϵ 𝜽⁢(𝒙 t)subscript bold-italic-ϵ 𝜽 subscript 𝒙 𝑡\bm{\epsilon_{\theta}}(\bm{x}{t})bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be further transformed into 𝒇 N b(𝒇 N b−1(⋯(𝒇 1(𝒙 t)))=𝒇 N b∘𝒇 N b−1∘⋯∘𝒇 1(𝒙 t)=𝑭 1 N b(𝒙 t)\bm{f}{N_{b}}(\bm{f}{N{b}-1}(\cdots(\bm{f}{1}(\bm{x}{t})))=\bm{f}{N{b}}% \circ\bm{f}{N{b}-1}\circ\cdots\circ\bm{f}{1}(\bm{x}{t})=\bm{F}^{N_{b}}{1}% (\bm{x}{t})bold_italic_f start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_f start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( ⋯ ( bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) = bold_italic_f start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ bold_italic_f start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ∘ ⋯ ∘ bold_italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_italic_F start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where 𝒇 n subscript 𝒇 𝑛\bm{f}{n}bold_italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the mapping of the n 𝑛 n italic_n-th DiT block, and 𝑭 1 N b superscript subscript 𝑭 1 subscript 𝑁 𝑏\bm{F}{1}^{N_{b}}bold_italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the mapping of the first to the N b subscript 𝑁 𝑏 N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-th DiT blocks. In the current DiT framework, the value of N b subscript 𝑁 𝑏 N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is 28.

Denoising Stage.Also known as the inference or reverse phase. This is the process from Gaussian noise to a generated image, which is also the goal of acceleration in this paper. Initially, a random Gaussian noise 𝒙 T subscript 𝒙 𝑇\bm{x}{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is given. It is input into the noise estimation network ϵ 𝜽 subscript bold-italic-ϵ 𝜽\bm{\epsilon{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT to obtain the noise estimate ϵ 𝜽⁢(𝒙 T)subscript bold-italic-ϵ 𝜽 subscript 𝒙 𝑇\bm{\epsilon_{\theta}}(\bm{x}{T})bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). According to specific sampling solvers, the noisy image is denoised to produce the denoised sample 𝒙 T−1 subscript 𝒙 𝑇 1\bm{x}{T-1}bold_italic_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT after one step. After iterating this process T 𝑇 T italic_T times, the final generated image is obtained. Using DDPM[1] as an example, the iterative denoising process is as follows:

𝒙 t−1=1 α t⁢(𝒙 t−β t 1−α¯t⁢ϵ 𝜽⁢(𝒙 t,t))+σ t⁢𝒛,subscript 𝒙 𝑡 1 1 subscript 𝛼 𝑡 subscript 𝒙 𝑡 subscript 𝛽 𝑡 1 subscript¯𝛼 𝑡 subscript bold-italic-ϵ 𝜽 subscript 𝒙 𝑡 𝑡 subscript 𝜎 𝑡 𝒛\bm{x}{t-1}=\frac{1}{\sqrt{\alpha{t}}}\left(\bm{x}{t}-\frac{\beta{t}}{% \sqrt{1-\bar{\alpha}{t}}}\bm{\epsilon}{\bm{\theta}}(\bm{x}{t},t)\right)+% \sigma{t}\bm{z},bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_z ,(2)

where α t subscript 𝛼 𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, β t subscript 𝛽 𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is constant related to t 𝑡 t italic_t, and 𝒛∼𝒩⁢(𝟎,𝑰)similar-to 𝒛 𝒩 0 𝑰\bm{z}\sim\mathcal{N}(\bm{0},\bm{I})bold_italic_z ∼ caligraphic_N ( bold_0 , bold_italic_I ). For other solvers[25, 26, 27, 28, 29], the sampling formula differs slightly from Eq.2, but they are all functions of 𝒙 t subscript 𝒙 𝑡\bm{x}{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ϵ 𝜽 subscript bold-italic-ϵ 𝜽\bm{\epsilon{\theta}}bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT. In many scenarios, the noise estimation network ϵ 𝜽⁢(𝒙 t,t,𝒄)subscript bold-italic-ϵ 𝜽 subscript 𝒙 𝑡 𝑡 𝒄\bm{\epsilon_{\theta}}(\bm{x}_{t},t,\bm{c})bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_c ) has another input 𝒄 𝒄\bm{c}bold_italic_c. It is conditional control information, which can be either a class embedding or a text embedding.

4 Stage-adaptive Inference Acceleration for Diffusion Transformers

This section will introduce our stage-adaptive inference acceleration method employed in the diffusion transformer, which is a training-free approach. In the following, we first analyze the challenges of information reuse in the current DiT model and propose a cache method specifically designed for DiT, called Δ Δ\Delta roman_Δ-Cache. Secondly, based on this cache method, we explore the specific effects of different parts of blocks on generation. Finally, combining these specific effects with the characteristics of diffusion generation, we propose a stage-adaptive method, Δ Δ\Delta roman_Δ-DiT, for accelerating DiT generation.

4.1 Tailored Cache Method for DiT

Feature reuse is an important strategy in training-free inference acceleration, with cache methods being prominent. Recently, cache methods have made significant strides in accelerating inference for diffusion models. However, these methods primarily focus on UNet architectures, such as DeepCache[30], Faster Diffusion[31], and TGATE[48]. In this section, we will explain the challenges these methods face when applied to DiT and introduce our Δ Δ\Delta roman_Δ-Cache method, specifically designed to address these challenges in DiT structures.

Challenges. Figure2a illustrates the denoising process based on a traditional UNet over two consecutive steps (t 𝑡 t italic_t and t−1 𝑡 1 t-1 italic_t - 1). The entire UNet consists of downsampling modules D, upsampling modules U, middle module M, and skip connections. The cache method in the diffusion model is to save some intermediate feature maps of the previous UNet and reuse them for speedup in the next UNet. For the Faster Diffusion[31], the cache positions are ①-④, which means that during the next step, the computations for D1-D3 are skipped. DeepCache[30], on the other hand, caches at position ⑤, skipping D2, D3, M, U3, and U2. TGATE[48] caches the output feature maps of the cross-attention modules within D1-D3 and U1-U3, providing less benefit due to its finer granularity, which is not discussed in detail here. Comparing the first two methods, we can see that the Faster Diffusion skips the encoder (D1-D3), thus losing the input from the previous sampling step 𝒙 t−1 subscript 𝒙 𝑡 1\bm{x}{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. In contrast, DeepCache only skips the deep network structures, so the previous sampling result can still supervise the output via the D1-skip-U1 path. However, in the DiT framework, these methods essentially converge into a single approach. Both DeepCache and Faster Diffusion focus on caching the feature maps output by specific blocks. Figure2b depicts a is isotropic network composed of a series of DiT blocks. If we also use the output feature map of a specific block in DiT (assume at position ⑦) as the cache target, we find that this approach also loses information from the previous sampling step 𝒙 t−1 subscript 𝒙 𝑡 1\bm{x}{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT because it directly skips the computation before B3. This issue is similar to the problem encountered in Faster Diffusion, so we refer to this cache method in DiT as the DiT version of Faster Diffusion. This method, which lacks information from the previous samples, has disadvantages in generating images and can be particularly catastrophic in scenarios with a small number of generation steps.

Figure 2: (a) Existing cache strategies based on the UNet architecture, such as DeepCache[30] and Faster Diffusion[31]. (b) Cache strategy based on the DiT structure, where the dashed line represents the difference between two feature maps instead of the true residual structure.

Method. To address the problem of missing the previous sampling result due to caching feature maps in DiT, this paper proposes a novel cache mechanism: Δ Δ\Delta roman_Δ-Cache. Instead of the traditional approach of caching feature maps, Δ Δ\Delta roman_Δ-Cache caches the deviations between feature maps. As illustrated in Figure2b, we cache the difference between feature maps at position ⑥ and ⑦ rather than the feature map at position ⑦. This allows us to skip the computation of B1-B3 in the next step, while the sampling result of the previous step 𝒙 t−1 subscript 𝒙 𝑡 1\bm{x}{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT can still be incorporated into the output of the latter step via the virtual construction Δ Δ\Delta roman_Δ. Mathematically, by caching 𝑭 1 N c⁢(𝒙 t)−𝒙 t superscript subscript 𝑭 1 subscript 𝑁 𝑐 subscript 𝒙 𝑡 subscript 𝒙 𝑡\bm{F}{1}^{N_{c}}(\bm{x}{t})-\bm{x}{t}bold_italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can skip the first N c subscript 𝑁 𝑐 N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT blocks in DiT, which forms the initial version of the Δ Δ\Delta roman_Δ-Cache tailored for DiT. Further, we can find that Δ Δ\Delta roman_Δ-Cache has the following three advantages: (a) It resolves the issue of degraded generation quality caused by losing previous sampling information when caching the output feature maps of DiT blocks. (b) Unlike Faster Diffusion in DiT, which can only skip the first few blocks, Δ Δ\Delta roman_Δ-Cache allows skipping of front, middle, or back blocks, offering greater flexibility. (c) It is well-suited for isotropic architecture, such as DiT, where the output feature map scale of each block is consistent, thus enabling the computation of Δ Δ\Delta roman_Δ. In the experimental section, we will demonstrate the effectiveness of this method for DiT.

4.2 Effect of DiT Blocks on Generation

Figure 3: Images generated by Δ Δ\Delta roman_Δ-Cache for different parts of blocks in DiT.

In the previous part, we proposed a cache mechanism for the DiT architecture called Δ Δ\Delta roman_Δ-Cache, which is flexible and can be applied to different parts of the DiT blocks. However, current research lacks exploration into the specific impact of different parts of DiT blocks on the final generated image, leading to uncertainty about which blocks should be targeted by Δ Δ\Delta roman_Δ-Cache. Therefore, in this part, we will conduct the first exploration of this gap using Δ Δ\Delta roman_Δ-Cache.

Qualitative. DiT architectures have a large number of blocks, and classical models of DiT architectures like DiT-XL and PIXART-α 𝛼\alpha italic_α contain 28 blocks per network. So in the following, we will explore the impact of these blocks on the generated results from a coarse-grained perspective. For clarity, we designate the initial block position that requires Δ Δ\Delta roman_Δ-Cache as I 𝐼 I italic_I, and denote the number of blocks to cache as N c subscript 𝑁 𝑐 N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. In the exploratory experiments, we conducted experiments with I=𝐼 absent I=italic_I =1 1 1 1, 4 4 4 4, 7 7 7 7, and N c=21 subscript 𝑁 𝑐 21 N_{c}=21 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 21, representing Δ Δ\Delta roman_Δ-Cache on the front, middle, and back blocks of DiT, respectively. Due to the relatively small number of generation steps, the cache interval is set to N=2 𝑁 2 N=2 italic_N = 2, meaning cache occurs every 2 steps. Based on these settings, we obtain the images for the different configurations, which are shown in Figure3. By comparing the images generated without cache and those generated with various Δ Δ\Delta roman_Δ-Cache objects, several observations can be made. Caching the front blocks results in less accurate outline generation. For example, in Figure3a, the outlines of the blue car and roof on the right side are clear, whereas in Figure3b, these outlines are lost, resulting in a smoother overall image appearance. Conversely, caching the back blocks preserves the outlines better but produces poorer details. In Figure3d, the outlines of the blue car and roof are retained, but there is a lack of pixel-level details. Caching the middle blocks offers a compromise, with better detail generation compared to only caching the back blocks and better outline generation compared to only caching the front blocks.

Table 1: Quantitative results of caching different blocks in PIXART-α 𝛼\alpha italic_α on 500 MS-COCO2017 samples.

Quantitative. The above is just a demonstration with a single prompt. To validate our findings, we conducted quantitative verification on a subset of MS-COCO2017. To quantitatively characterize the ability to generate outlines, we used the average gradient of images based on the Sobel operator and the high-frequency loss based on the Discrete Fourier Transform[51]. To quantitatively assess the detail generation ability of the model, we used the classical blind image quality assessment metric PIQE[52], which effectively captures blockiness and is well-suited for this scenario. The quantitative results are shown in Table1. Caching the back blocks (i.e., only calculating the front blocks) results in lower high-frequency loss and higher average gradient, indicating better performance in outline generation. Conversely, caching the front blocks results in higher high-frequency loss, lower average gradient, and higher PIQE scores, indicating poorer local block generation and greater distortion. This further confirms that the front blocks in DiT are more related to outline generation, while the back blocks are more related to detail generation.

Figure 4: Overview of the Δ Δ\Delta roman_Δ-DiT: The diffusion model emphasizes generating outlines early in sampling and details later. Our previously proposed Δ Δ\Delta roman_Δ-Cache method caches back blocks for outline-friendly generation and front blocks for detail-friendly results. In the Δ Δ\Delta roman_Δ-DiT, the properties of the diffusion model and Δ Δ\Delta roman_Δ-Cache are aligned in stages, that is, Δ Δ\Delta roman_Δ-Cache is applied to the back blocks in the DiT during the early outline generation stage of the diffusion model, and on front blocks during the detail generation stage. The stage is bounded by a hyperparameter b 𝑏 b italic_b.

4.3 Stage-adaptive Acceleration Method for DiT

After addressing cache bottlenecks in DiT and analyzing the impact of DiT blocks on generation, this part proposes Δ Δ\Delta roman_Δ-DiT, an overall training-free inference acceleration method for the DiT architecture. It combines the above findings with the sampling characteristics of diffusion. Current research[39, 40, 41] indicates that the diffusion inference process has a characteristic: in the early stages of the denoising process, diffusion models focus on generating outlines, while in the later stages, they focus more on generating details. In Section4.2, we found that caching the back DiT blocks is favorable for outline generation while caching the front DiT blocks is favorable for detail generation. Leveraging these two findings, we propose a stage-adaptive acceleration method called Δ Δ\Delta roman_Δ-DiT. The specific process is illustrated in Figure4. From left to right, it shows the denoising process of the diffusion model (from t=T 𝑡 𝑇 t=T italic_t = italic_T to t=0 𝑡 0 t=0 italic_t = 0). The upper part of the image represents the denoising results at different timesteps, while the lower part shows the high-frequency components of the differences between adjacent timestep images. Specifically, these high-frequency components are obtained by transforming the image into the frequency domain using a Discrete Fourier Transform (DFT), then isolating the high-frequency components and transforming them back to the time domain using an Inverse Discrete Fourier Transform (IDFT). In the early sampling stage, high-frequency signals mainly focus on the outlines, while in the later sampling stage, they mainly focus on details. These high-frequency signals represent the areas the model focuses on at each timestep. Since the diffusion model focuses on outline generation in the early stage and the Δ Δ\Delta roman_Δ-Cache method for the back block is outline generation friendly, Δ Δ\Delta roman_Δ-Cache is applied to the back block in the outline generation stage. On the contrary, since the diffusion model focuses on details generation in the later stage, the Δ Δ\Delta roman_Δ-Cache of the front blocks are detail generation friendly, so the Δ Δ\Delta roman_Δ-Cache is applied to the front blocks in the detail generation stage.

In this training-free process, there is a hyperparameter denoted as b 𝑏 b italic_b, which represents the boundary between the outline generation stage and the detail generation stage. When t≤b 𝑡 𝑏 t\leq b italic_t ≤ italic_b, Δ Δ\Delta roman_Δ-Cache is applied to the back blocks; when t>b 𝑡 𝑏 t>b italic_t > italic_b, Δ Δ\Delta roman_Δ-Cache is applied to the front blocks. The number of blocks that require Δ Δ\Delta roman_Δ-Cache will be determined based on the actual computational requirements. Suppose the computation cost of one block is M b subscript 𝑀 𝑏 M_{b}italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and the expected total computation cost is M g subscript 𝑀 𝑔 M_{g}italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. As previously mentioned, the cache interval is N 𝑁 N italic_N, and the number of DiT blocks is N b subscript 𝑁 𝑏 N_{b}italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. First, we roughly determine the value of N=⌈T×N b×M b M g⌉𝑁 𝑇 subscript 𝑁 𝑏 subscript 𝑀 𝑏 subscript 𝑀 𝑔 N=\lceil\frac{T\times N_{b}\times M_{b}}{M_{g}}\rceil italic_N = ⌈ divide start_ARG italic_T × italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT × italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG ⌉. In some current low-step scenarios, the value of N 𝑁 N italic_N is often set to 2. After determining N 𝑁 N italic_N, the actual number of blocks to cache at the timestep is:

N c=[(M g−(T mod N)×N b×M b⌊T/N⌋×M b⏟the computation in each N step−N b×M b⏟the first full DiT)/(M b×(N−1))⏟the remaining cached steps]subscript 𝑁 𝑐 delimited-[]subscript⏟subscript 𝑀 𝑔 modulo 𝑇 𝑁 subscript 𝑁 𝑏 subscript 𝑀 𝑏 𝑇 𝑁 subscript 𝑀 𝑏 the computation in each N step subscript⏟subscript 𝑁 𝑏 subscript 𝑀 𝑏 the first full DiT subscript⏟subscript 𝑀 𝑏 𝑁 1 the remaining cached steps\displaystyle N_{c}=[(\underbrace{\vphantom{\frac{M_{g}-(T\bmod N)\times N_{b}% \times M_{b}}{\lfloor T/N\rfloor\times M_{b}}}\frac{M_{g}-(T\bmod N)\times N_{% b}\times M_{b}}{\lfloor T/N\rfloor\times M_{b}}}{\text{the computation in % each $N$ step}}-\underbrace{\vphantom{\frac{M{g}-(T\bmod N)\times N_{b}\times M% {b}}{\lfloor T/N\rfloor\times M{b}}}N_{b}\times M_{b}}{\text{the first full% DiT}})/\underbrace{\vphantom{\frac{M{g}-(T\bmod N)\times N_{b}\times M_{b}}{% \lfloor T/N\rfloor\times M_{b}}}(M_{b}\times(N-1))}_{\text{the remaining % cached steps}}]italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ( under⏟ start_ARG divide start_ARG italic_M start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT - ( italic_T roman_mod italic_N ) × italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT × italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_ARG ⌊ italic_T / italic_N ⌋ × italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG end_ARG start_POSTSUBSCRIPT the computation in each italic_N step end_POSTSUBSCRIPT - under⏟ start_ARG italic_N start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT × italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT the first full DiT end_POSTSUBSCRIPT ) / under⏟ start_ARG ( italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT × ( italic_N - 1 ) ) end_ARG start_POSTSUBSCRIPT the remaining cached steps end_POSTSUBSCRIPT

Although the actual computation cost cannot perfectly match the ideal computation cost, they are very close and generally meet the requirements. Once these parameters are determined, the inference process becomes fixed, enabling acceleration without the need for further training.

5 Experiment

5.1 Experimental Settings

Models, Evaluation Data and Solvers. We conduct experiments on three diffusion transformer-based architectures, DiT-XL[33], PIXART-α 𝛼\alpha italic_α[35] and PIXART-α 𝛼\alpha italic_α-LCM[35, 22]. For DiT-XL, we use 1000 classes from ImageNet[53] to generate 50k images for evaluation. For the PIXART-α 𝛼\alpha italic_α series models, we use 1.632k prompts from Partipromt[43] and 5k prompts from the validation dataset of MS-COCO2017[42] to evaluate the quality of generated images. For consistency, in our main experiment, we default to using the 20-step DPMSolver++[27], which is the default setting of PIXART-α 𝛼\alpha italic_α. For consistency model generation, we use the 4-step LCMSolver[54]. In the ablation study, we will compare more advanced solvers such as DEIS[28] and EulerD[29].

Baseline.To demonstrate the effectiveness of our method, we compared it with several existing fast-generation methods as well as some techniques adapted from UNet acceleration. These methods include the generation methods of the base models PIXART-α 𝛼\alpha italic_α[35], DiT-XL[33], and PIXART-α 𝛼\alpha italic_α-LCM[35], the DiT acceleration strategy based on the Faster Diffusion[31] concept, the TGATE[48], and the Δ Δ\Delta roman_Δ-Cache proposed for different parts of the block.

Evaluation Metrics. We use a range of metrics to evaluate the generation efficiency and image quality. To evaluate the generation efficiency, the MACs, and latency are adopted to measure the theoretical computational complexity and the practical time consumed to generate an image, respectively. Lower MACs and latency mean higher generation efficiency. And the speed is the acceleration rate. For generation quality, we choose the widely applied FID[55], IS[56], and CLIP-Score[57].

Table 2: Results on the PIXART-α 𝛼\alpha italic_α. Gate is the hyperparameter mentioned in the paper of TGATE[48]. Latency is measured in milliseconds and was evaluated on an A100.

Method MACs ↓↓\downarrow↓Speed ↑↑\uparrow↑Latency ↓↓\downarrow↓MS-COCO2017 PartiPrompts FID ↓↓\downarrow↓IS ↑↑\uparrow↑CLIP ↑↑\uparrow↑CLIP ↑↑\uparrow↑ PIXART-α 𝛼\alpha italic_α (T=20 𝑇 20 T=20 italic_T = 20)[35]85.651T 1.00×\times×2290.668 39.002 31.385 30.417 30.097 PIXART-α 𝛼\alpha italic_α (T=13 𝑇 13 T=13 italic_T = 13)[35]55.673T 1.54×\times×1565.175 39.989 30.822 30.399 29.993 Faster Diffusion (I=14 𝐼 14 I=14 italic_I = 14)[31]64.238T 1.33×\times×1777.144 41.560 31.233 30.300 29.958 Faster Diffusion (I=21 𝐼 21 I=21 italic_I = 21)[31]53.532T 1.60×\times×1517.698 42.763 30.316 30.227 29.922 TGATE (Gate=10)[48]61.075T 1.40×\times×1718.308 37.413 31.079 29.782 29.347 TGATE (Gate=8)[48]56.170T 1.52×\times×1603.250 37.539 30.124 29.021 28.654 Δ Δ\Delta roman_Δ-Cache (Front Blocks)53.532T 1.60×\times×1522.346 41.702 30.276 30.288 29.964 Δ Δ\Delta roman_Δ-Cache (Middle Blocks)53.532T 1.60×\times×1522.528 35.907 33.063 30.183 30.078 Δ Δ\Delta roman_Δ-Cache (Last Blocks)53.532T 1.60×\times×1522.669 34.819 32.736 29.898 30.099 Ours (b=12 𝑏 12 b=12 italic_b = 12)53.532T 1.60×\times×1534.551 35.882 32.222 30.404 30.123

Table 3: Results on the PIXART-α 𝛼\alpha italic_α-LCM. The default number of generation steps T 𝑇 T italic_T is 4.

Method MACs ↓↓\downarrow↓Speed ↑↑\uparrow↑Latency ↓↓\downarrow↓MS-COCO2017 PartiPrompts FID ↓↓\downarrow↓IS ↑↑\uparrow↑CLIP ↑↑\uparrow↑CLIP ↑↑\uparrow↑ PIXART-α 𝛼\alpha italic_α-LCM[35]8.565T 1.00×\times×415.255 40.433 30.447 29.989 29.669 Faster Diffusion (I=4 𝐼 4 I=4 italic_I = 4)[31]7.953T 1.08×\times×401.137 468.772 1.146-1.738 1.067 Faster Diffusion (I=6 𝐼 6 I=6 italic_I = 6)[31]7.647T 1.12×\times×391.081 468.471 1.146-1.746 1.057 TGATE (Gate=2)[48]7.936T 1.08×\times×400.256 42.038 29.683 29.908 29.549 TGATE (Gate=1)[48]7.623T 1.12×\times×398.124 44.198 27.865 29.074 28.684 Ours (b=2 𝑏 2 b=2 italic_b = 2, N c=4 subscript 𝑁 𝑐 4 N_{c}=4 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 4)7.953T 1.08×\times×400.132 39.967 29.667 29.751 29.449 Ours (b=2 𝑏 2 b=2 italic_b = 2, N c=6 subscript 𝑁 𝑐 6 N_{c}=6 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 6)7.647T 1.12×\times×393.469 40.118 29.177 29.332 29.226

5.2 Comparison with the Baseline Model

We provide a comprehensive comparison with fast generation methods for PIXART-α 𝛼\alpha italic_α on both the generation efficiency and image quality in Table2. The proposed method exceeds the baseline PIXART-α 𝛼\alpha italic_α on all metrics except for a small gap in the MS-COCO2017 CLIP-Score, with a 1.60×\times× speedup. When the inference costs are aligned, we surpass PIXART-α 𝛼\alpha italic_α on all metrics by a large margin (e.g., FID: 39.989 →→\to→ 35.882). Moreover, our proposed method also outperforms Faster Diffusion and TGATE in all metrics on both datasets with similar or even higher generation efficiency. In Table4, we demonstrate the efficiency and effectiveness of our proposed method for the DiT-XL architecture.

Table 4: Results on the DiT-XL. Because the TGATE can only handle cross-attention, it cannot be used for DiT-XL.

Similar to the results of the PIXART-α 𝛼\alpha italic_α architecture, the proposed method outperforms Faster Diffusion and the baseline DiT-XL in FID and IS metrics, with similar or less inference overhead. In both tables, Δ Δ\Delta roman_Δ-Cache shows great results. However, it does not achieve the best in all metrics. For example, on MS-COCO2017, the FID of Δ Δ\Delta roman_Δ-Cache (Last Blocks) is the best (34.819), while the CLIP-Score of it (29.898) is lower than most of the other settings. However, Δ Δ\Delta roman_Δ-DiT demonstrates outstanding performance across various metrics and datasets.

5.3 Performance under the Consistent Model

The consistency model[54, 22] proposes a new track for generation using few-steps. Essentially, the consistency model can be seen as a refined version, making it highly challenging to accelerate. We evaluated our method in this extreme scenario, with the results shown in Table3. We found that methods like Faster Diffusion[31], which lack supervision from the previous step images, perform disastrously in small-step scenarios, exhibiting poor image generation metrics (FID=383.812). Existing methods such as TGATE[48] achieve decent results in this context. However, at an acceleration ratio of approximately 1.12, these methods show significant performance drops (FID: 40.433 →→\to→ 44.198). In contrast, our method significantly outperforms this baseline across multiple datasets and metrics in these more extreme conditions. The compression phase of the consistency model’s DiT blocks undergoes more pronounced encoding changes, hence a larger I 𝐼 I italic_I value is chosen, in this case, 8. Furthermore, we can use the FlashEval[58] to quickly evaluate and determine a more appropriate value.

5.4 Ablation Study

Compatibility with the advanced solvers.The main experiment was conducted using the default solver, DPMSolver++[27]. Table5 shows the results of our method on several more advanced solvers. It can be seen that the conclusions remain consistent across these solvers, demonstrating that our method is compatible with the current advanced solver. The classic solver DDIM[25] performs poorly in the 20-step PIXART-α 𝛼\alpha italic_α, so it was not included in the table.

Table 5: Performance under different advanced solvers which are measured on MS-COCO2017.

Effect of opposite stage adaptation.The Δ Δ\Delta roman_Δ-DiT involves Δ Δ\Delta roman_Δ-Cache the back blocks during the early sampling and Δ Δ\Delta roman_Δ-Cache the front blocks during the later sampling stages. Here, we reverse the cache order, Δ Δ\Delta roman_Δ-Caching the front blocks during the early stages and the back blocks during the later stages. Table6 compares the results before and after reversing the order. We found that the results significantly deteriorated after reversing the order, which to some extent validates the correctness of our original approach.

Illustration of the increasing bound b 𝑏 b italic_b.Figure6 shows the impact of bound on the generation results. As b 𝑏 b italic_b increases from 0 to 20, FID and IS reach their best values around b=16 𝑏 16 b=16 italic_b = 16, while the CLIP score peaks around b=8 𝑏 8 b=8 italic_b = 8. At b=12 𝑏 12 b=12 italic_b = 12, the three generation metrics are the best overall.

Illustration of the increasing number of cached blocks N c subscript 𝑁 𝑐 N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.Figure6 shows the impact of the number of cached blocks N c subscript 𝑁 𝑐 N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on the generation results. As N c subscript 𝑁 𝑐 N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT increases from 0 to 28, FID reaches its optimal value around N c=14 subscript 𝑁 𝑐 14 N_{c}=14 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 14, while IS and CLIP Score peak around N c=21 subscript 𝑁 𝑐 21 N_{c}=21 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 21. At N c=14 subscript 𝑁 𝑐 14 N_{c}=14 italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = 14 or 21 21 21 21, the three generation metrics are the best overall.

Table 6: Results of opposite stage adaptation on PIXART-α 𝛼\alpha italic_α and DiT-XL.

Figure 5: Ablation study on b 𝑏 b italic_b.

Figure 6: Ablation study on N c subscript 𝑁 𝑐 N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

6 Conclusion and Limitation

This paper considers the unique structure of DiT and proposes a training-free cache mechanism, Δ Δ\Delta roman_Δ-Cache, specifically designed for DiT. Furthermore, we qualitatively and quantitatively explore the relationship between front blocks in DiT and outline generation, as well as rear blocks and detail generation. Based on these findings and the sampling properties of diffusion, we propose the stage-adaptive acceleration method, Δ Δ\Delta roman_Δ-DiT, which applies Δ Δ\Delta roman_Δ-Cache to different part blocks of DiT at different stages of sampling. Extensive experiments confirm the effectiveness of our approach. However, our exploration of the relationship between DiT blocks and the final generated image is preliminary and coarse-grained, providing a foundation for more fine-grained exploration in future work. We believe that more refined search or learning strategies will yield even greater benefits.

References

[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[2] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[3] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
[4] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[5] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
[6] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[7] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22819–22829, 2023.
[8] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
[9] Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Nießner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation. Advances in neural information processing systems, 2023.
[10] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
[11] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
[12] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[13] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
[14] Gongfan Fang, Xinyin Ma, and Xinchao Wang. Structural pruning for diffusion models. Advances in neural information processing systems, 2023.
[15] Dingkun Zhang, Sijia Li, Chen Chen, Qingsong Xie, and Haonan Lu. Laptop-diff: Layer pruning and normalized distillation for compressing diffusion models. arXiv preprint arXiv:2404.11098, 2024.
[16] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
[17] Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, and Eunhyeok Park. Temporal dynamic quantization for diffusion models. Advances in neural information processing systems, 2023.
[18] Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. PTQD: accurate post-training quantization for diffusion models. Advances in neural information processing systems, 2023.
[19] Yanjing Li, Sheng Xu, Xianbin Cao, Xiao Sun, and Baochang Zhang. Q-DM: an efficient low-bit quantized diffusion model. Advances in neural information processing systems, 2023.
[20] Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023.
[21] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
[22] Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
[23] Yang Zhao, Yanwu Xu, Zhisheng Xiao, and Tingbo Hou. Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567, 2023.
[24] Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
[25] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
[26] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
[27] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
[28] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. In International Conference on Learning Representations, 2023.
[29] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
[30] Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. arXiv preprint arXiv:2312.00858, 2023.
[31] Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, and Jian Yang. Faster diffusion: Rethinking the role of unet encoder in diffusion models. arXiv preprint arXiv:2312.09608, 2023.
[32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
[33] William Peebles and Saining Xie. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182. IEEE, 2023.
[34] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
[35] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α 𝛼\alpha italic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
[36] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024.
[37] Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024.
[38] Taehong Moon, Moonseok Choi, EungGu Yun, Jongmin Yoon, Gayoung Lee, and Juho Lee. Early exiting for accelerated inference in diffusion models. In ICML 2023 Workshop on Structured Probabilistic &&& Inference Generative Modeling, 2023.
[39] Binxu Wang and John J Vastola. Diffusion models generate images like painters: an analytical theory of outline first, details later. arXiv preprint arXiv:2303.02490, 2023.
[40] Enshu Liu, Xuefei Ning, Zinan Lin, Huazhong Yang, and Yu Wang. OMS-DPM: optimizing the model schedule for diffusion probabilistic models. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 21915–21936, 2023.
[41] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. In International Conference on Learning Representations, 2023.
[42] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[43] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. Trans. Mach. Learn. Res., 2022.
[44] Yunke Wang, Xiyu Wang, Anh-Dung Dinh, Bo Du, and Charles Xu. Learning to schedule in diffusion probabilistic models. In KDD, pages 2478–2488, 2023.
[45] Lijiang Li, Huixia Li, Xiawu Zheng, Jie Wu, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan, Fei Chao, and Rongrong Ji. Autodiffusion: Training-free optimization of time steps and architectures for automated diffusion model acceleration. In ICCV, pages 7082–7091. IEEE, 2023.
[46] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2O: heavy-hitter oracle for efficient generative inference of large language models. Advances in neural information processing systems, 2023.
[47] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023.
[48] Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, and Jürgen Schmidhuber. Cross-attention makes inference cumbersome in text-to-image diffusion models. arXiv preprint arXiv:2404.02747, 2024.
[49] Sybren Ruurds De Groot and Peter Mazur. Non-equilibrium thermodynamics. Courier Corporation, 2013.
[50] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, pages 11895–11907, 2019.
[51] Xingyi Yang, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Diffusion probabilistic model made slim. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22552–22562, 2023.
[52] N Venkatanath, D Praneeth, Maruthi Chandrasekhar Bh, Sumohana S Channappayya, and Swarup S Medasani. Blind image quality evaluation using perception based features. In 2015 twenty first national conference on communications (NCC), pages 1–6. IEEE, 2015.
[53] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[54] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
[55] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[56] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In EMNLP (1), pages 7514–7528. Association for Computational Linguistics, 2021.
[57] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
[58] Lin Zhao, Tianchen Zhao, Zinan Lin, Xuefei Ning, Guohao Dai, Huazhong Yang, and Yu Wang. Flasheval: Towards fast and accurate evaluation of text-to-image diffusion generative models. arXiv preprint arXiv:2403.16379, 2024.

Xet Storage Details

Size:: 72.5 kB
Xet hash:: 90110b5e8136832d5b6f3344b615252959ccde2a139cd47a6c41abdf17c65f7f

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.