Buckets:

|
download
raw
109 kB

Title: Target-Driven Distillation: Consistency Distillation with Target Timestep Selection and Decoupled Guidance

URL Source: https://arxiv.org/html/2409.01347

Published Time: Wed, 04 Sep 2024 01:40:33 GMT

Markdown Content:

Abstract

Consistency distillation methods have demonstrated significant success in accelerating generative tasks of diffusion models. However, since previous consistency distillation methods use simple and straightforward strategies in selecting target timesteps, they usually struggle with blurs and detail losses in generated images. To address these limitations, we introduce Target-Driven Distillation (TDD), which (1) adopts a delicate selection strategy of target timesteps, increasing the training efficiency; (2) utilizes decoupled guidances during training, making TDD open to post-tuning on guidance scale during inference periods; (3) can be optionally equipped with non-equidistant sampling and 𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping, enabling a more flexible and accurate way for image sampling. Experiments verify that TDD achieves state-of-the-art performance in few-step generation, offering a better choice among consistency distillation models.

Image 1: [Uncaptioned image]

Figure 1: Visual comparison among different methods. Additionally, we have released a detailed comparison between our method and TCD. Our method demonstrates advantages in both image complexity and clarity.

1 Introduction

Diffusion models (Sohl-Dickstein et al. 2015; Song and Ermon 2019; Karras et al. 2022) have demonstrated exceptional performance in image generation, producing high-quality and diverse images. Unlike previous models like GANs (Goodfellow et al. 2014; Karras, Laine, and Aila 2019) or VAEs (Kingma and Welling 2013; Sohn, Lee, and Yan 2015), diffusion models are good at modeling complex image distributions and conditioning on non-label conditions such as free-form text prompts. However, since diffusion models adopt iterative denoising processes, they usually take substantial time when generating images. To address such challenge, consistency distillation methods (Song et al. 2023; Luo et al. 2023a, b; Kim et al. 2023; Zheng et al. 2024; Wang et al. 2024) have been proposed as effective strategies to accelerate generation while maintaining image quality. These methods distill pretrained diffusion models following the self-consistency property i.e. the predicted results from any two neighboring timesteps towards the same target timestep are regularized to be the same. According to the choices of target timesteps, recent consistency distillation methods can be categorized as single-target distillation and multi-target distillation, illustrated in Figure 2.

Single-target distillation methods follow a one-to-one mapping when choosing target timesteps, that is, they always choose the same target timestep each time they come to a certain timestep along the trajectory of PF-ODE (Song et al. 2020). One straightforward choice is mapping any timestep to the final timestep at 0 (Song et al. 2023; Luo et al. 2023a). However, these methods usually suffer from the accumulated error of long-distance predictions. Another choice is evenly partitioning the full trajectory into several sub-trajectories and mapping a timestep to the end of the sub-trajectory it belongs to (Wang et al. 2024). Although the error can be reduced by shortening the predicting distances when training, the image quality will be suboptimal when adopting a schedule with a different number of sub-trajectories during inference periods.

On the other hand, multi-target distillation methods follow a one-to-multiple mapping, that is, possibly different target timesteps may be chosen each time they come to a certain timestep. A typical choice is mapping the current timestep to a random target timestep ahead (Kim et al. 2023; Zheng et al. 2024). Theoretically, these methods are trained to predict from any to any timestep, thus may generally achieve good performance under different schedules. Yet, practically most of these predictions are redundant since we will never go through them under common denoising schedules. Hence, multi-target distillation methods usually require a high time budget to train.

To mitigate the aforementioned issues, we propose Target-Driven Distillation (TDD), a multi-target approach that emphasizes delicately selected target timesteps during distillation processes. Our method involves three key designs: Firstly, for any timestep, it selects a nearby timestep forward that falls into a few-step equidistant denoising schedule of a predefined set of schedules (e.g. 4–8 steps), which eliminates long-distance predictions while only focusing on the timesteps we will probably pass through during inference periods under different schedules. Also, TDD incorporates a stochastic offset that further pushes the selected timestep ahead towards the final target timestep, in order to accommodate non-deterministic sampling such as γ 𝛾\gamma italic_γ-sampling (Kim et al. 2023). Secondly, while distilling classifier-free guidance (CFG) (Ho and Salimans 2022) into the distilled models, to align with the standard training process using CFG, TDD additionally replaces a portion of the text conditions with unconditional (i.e. empty) prompts. With such a design, TDD is open to a proposed inference-time tuning technique on guidance scale, allowing user-specified balances between the accuracy and the richness of image contents conditioned on text prompts. Finally, TDD is optionally equipped with a non-equidistant sampling method doing short-distance predictions at initial steps and long-distance ones at later steps, which helps to improve overall image quality. Additionally, TDD adopts 𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping to prevent out-of-bound predictions and address the overexposure issue.

Our contributions are summarized as follows:

  • •We provide a taxonomy on consistency distillation models, classifying previous works as single-target and multi-target distillation methods.
  • •We propose Target-Driven Distillation, which highlights target timestep selection and decoupled guidance during distillation processes.
  • •We present extensive experiments to validate the effectiveness of our proposed distillation method.

Image 2: Refer to caption

Figure 2: Comparison of different distillation methods. τ m k 1 subscript superscript 𝜏 subscript 𝑘 1 𝑚\tau^{k_{1}}{m}italic_τ start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and τ m k 2 subscript superscript 𝜏 subscript 𝑘 2 𝑚\tau^{k{2}}{m}italic_τ start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represent a target timestep when divided into k 1 subscript 𝑘 1 k{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and k 2 subscript 𝑘 2 k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. LCM (a) and PCM (b) are examples of single-target distillation, where 𝐱 t n subscript 𝐱 subscript 𝑡 𝑛\mathbf{x}{t{n}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT corresponds to only one target timestep. In contrast, CTM (c) and ours (d) are multi-target distillation methods, where 𝐱 t n subscript 𝐱 subscript 𝑡 𝑛\mathbf{x}{t{n}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT can correspond to multiple target timesteps.

2 Related Work

Diffusion models(Sohl-Dickstein et al. 2015; Song and Ermon 2019; Karras et al. 2022) have demonstrated significant advantages in high-quality image synthesis (Ramesh et al. 2022; Rombach et al. 2022; Dhariwal and Nichol 2021), image editing (Meng et al. 2021; Saharia et al. 2022a; Balaji et al. 2022), and specialized tasks such as layout generation (Zheng et al. 2023; Wu et al. 2024). However, their multi-step iterative process incurs significant computational costs, hindering real-time applications. Beyond developing faster samplers (Song, Meng, and Ermon 2020; Lu et al. 2022a, b; Zhang and Chen 2022), there is growing interest in model distillation approaches (Sauer et al. 2023; Liu, Gong, and Liu 2022; Sauer et al. 2024; Yin et al. 2024). Among these, distillation methods based on consistency models have proven particularly effective in accelerating processes while preserving output similarity between the original and distilled models.

Song et al. introduced the concept of consistency models, which emphasize the importance of achieving self-consistency across arbitrary pairs of points on the same probability flow ordinary differential equation (PF-ODE) trajectory (Song et al. 2020). This approach is particularly effective when distilled from a teacher model or when incorporating modules like LCM-LoRA (Luo et al. 2023b), which can achieve few-step generation with minimal retraining resources.

However, a key limitation of these models is the increased learning difficulty when mapping points further from timestep 0, leading to suboptimal performance when mapping from pure noise in a single step. Phased Consistency Models (PCM) (Wang et al. 2024) address this by dividing the ODE trajectory into multiple sub-trajectories, reducing learning difficulty by mapping each point within a sub-trajectory to its initial point. However, in these methods, each point is mapped to a unique target timestep during distillation, resulting in suboptimal inference when using other timesteps.

Recent advancements, such as Consistency Trajectory Models (CTM) (Kim et al. 2023) and Trajectory Consistency Distillation (TCD) (Zheng et al. 2024), aim to overcome this by enabling consistency models to perform anytime-to-anytime jumps, allowing all points between timestep 0 and the inference timestep to be used as target timesteps. However, the inclusion of numerous unused target timesteps reduces training efficiency and makes the model less sensitive to fewer-step denoising timesteps.

3 Method

In this section, we will first deliver some preliminaries in section 3.1, followed by detailed descriptions of our proposed Target-Driven Distillation in sections 3.2, 3.3 and3.4.

3.1 Preliminaries

Diffusion Model

Diffusion models constitute a category of generative models that draw inspiration from thermodynamics and stochastic processes, encompass both a forward process and a reverse process. The forward process is modeled as a stochastic differential equation (SDE) (Song et al. 2020; Karras et al. 2022). Let p data⁢(𝐱)subscript 𝑝 data 𝐱 p_{\text{data}}(\mathbf{x})italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_x ) denotes the data distribution and p t⁢(𝐱)subscript 𝑝 𝑡 𝐱 p_{t}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) the distribution of 𝐱 𝐱\mathbf{x}bold_x at time t 𝑡 t italic_t. For a given set {𝐱 t|t∈[0,T]}conditional-set subscript 𝐱 𝑡 𝑡 0 𝑇{\mathbf{x}_{t}|t\in[0,T]}{ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t ∈ [ 0 , italic_T ] }, the stochastic trajectory is described by:

d⁢𝐱 t=f⁢(𝐱 t,t)⁢d⁢t+g⁢(t)⁢d⁢𝐰 t,d subscript 𝐱 𝑡 𝑓 subscript 𝐱 𝑡 𝑡 d 𝑡 𝑔 𝑡 d subscript 𝐰 𝑡\mathrm{d}\mathbf{x}{t}=f(\mathbf{x}{t},t),\mathrm{d}t+g(t),\mathrm{d}% \mathbf{w}_{t},roman_d bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) roman_d italic_t + italic_g ( italic_t ) roman_d bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,(1)

where 𝐰 t subscript 𝐰 𝑡\mathbf{w}{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents standard Brownian motion, f⁢(𝐱 t,t)𝑓 subscript 𝐱 𝑡 𝑡 f(\mathbf{x}{t},t)italic_f ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the drift coefficient for deterministic changes, and g⁢(t)𝑔 𝑡 g(t)italic_g ( italic_t ) is the diffusion coefficient for stochastic variations. At t=0 𝑡 0 t=0 italic_t = 0, we have p 0⁢(𝐱)≡p data⁢(𝐱)subscript 𝑝 0 𝐱 subscript 𝑝 data 𝐱 p_{0}(\mathbf{x})\equiv p_{\text{data}}(\mathbf{x})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) ≡ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_x ).

Any diffusion process described by an SDE can be represented by a deterministic process described by an ODE that shares identical marginal distributions, referred to as a Probability Flow ODE (PF-ODE). The PF-ODE is formulated as:

d⁢𝐱=[f⁢(𝐱,t)−1 2⁢g⁢(t)2⁢∇𝐱 log⁡p t⁢(𝐱)]⁢d⁢t,d 𝐱 delimited-[]𝑓 𝐱 𝑡 1 2 𝑔 superscript 𝑡 2 subscript∇𝐱 subscript 𝑝 𝑡 𝐱 d 𝑡\mathrm{d}\mathbf{x}=\left[f(\mathbf{x},t)-\frac{1}{2}g(t)^{2}\nabla_{\mathbf{% x}}\log p_{t}(\mathbf{x})\right]\mathrm{d}t,roman_d bold_x = [ italic_f ( bold_x , italic_t ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) ] roman_d italic_t ,(2)

where ∇𝐱 log⁡p t⁢(𝐱)subscript∇𝐱 subscript 𝑝 𝑡 𝐱\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) represents the gradient of the log-density of the data distribution p t⁢(𝐱)subscript 𝑝 𝑡 𝐱 p_{t}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ), known as the score function. Empirically, we approximate this score function with a score model s ϕ⁢(𝐱,t)subscript 𝑠 italic-ϕ 𝐱 𝑡 s_{\phi}(\mathbf{x},t)italic_s start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x , italic_t ) trained via score matching techniques. Although there are numerous methods (Song, Meng, and Ermon 2020; Lu et al. 2022a, b; Karras et al. 2022) available to solve ODE trajectories, they still necessitate a large number of sampling steps to attain high-quality generation results.

Consistency Distillation

To render a unified representation across all consistency distillation methods, we define the teacher model as ϕ italic-ϕ\phi italic_ϕ, the consistency function with the student model as 𝒇 θ subscript 𝒇 𝜃\boldsymbol{f}{\theta}bold_italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, the conditional prompt as c 𝑐 c italic_c, and the ODE solver Φ⁢(⋯;ϕ)Φ⋯italic-ϕ\Phi(\cdots;\phi)roman_Φ ( ⋯ ; italic_ϕ ) predicting from a certain timestep t n+1 subscript 𝑡 𝑛 1 t{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT to its previous timestep t n subscript 𝑡 𝑛 t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT following an equidistant schedule from T 𝑇 T italic_T to 0 0. With a certain point 𝐱 t n+1 subscript 𝐱 subscript 𝑡 𝑛 1\mathbf{x}{t{n+1}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT on the trajectory at timestep t n+1 subscript 𝑡 𝑛 1 t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, and its previous point 𝐱^t n ϕ subscript superscript^𝐱 italic-ϕ subscript 𝑡 𝑛\hat{\mathbf{x}}^{\phi}{t{n}}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT predicted by Φ⁢(⋯;ϕ)Φ⋯italic-ϕ\Phi(\cdots;\phi)roman_Φ ( ⋯ ; italic_ϕ ), the core consistency loss can be formulated as

ℒ CMs:=‖𝒇 θ⁢(𝐱 t n+1,t n+1,τ)−𝒇 θ−⁢(𝐱^t n ϕ,t n,τ)‖2 2,assign subscript ℒ CMs superscript subscript norm subscript 𝒇 𝜃 subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 𝜏 subscript 𝒇 superscript 𝜃 superscript subscript^𝐱 subscript 𝑡 𝑛 italic-ϕ subscript 𝑡 𝑛 𝜏 2 2\mathcal{L}{\text{CMs}}:=\left|\boldsymbol{f}{\theta}(\mathbf{x}{t{n+1}},% t_{n+1},\tau)-\boldsymbol{f}{\theta^{-}}(\hat{\mathbf{x}}{t_{n}}^{\phi},t_{n% },\tau)\right|_{2}^{2},caligraphic_L start_POSTSUBSCRIPT CMs end_POSTSUBSCRIPT := ∥ bold_italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_τ ) - bold_italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_τ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(3)

where 𝒇 θ−subscript 𝒇 superscript 𝜃\boldsymbol{f}_{\theta^{-}}bold_italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the consistency function with a target model updated with the exponential moving average (EMA) from the student model, and τ 𝜏\tau italic_τ refers to the target timestep.

Among the mainstream distillation methods, the choices of τ 𝜏\tau italic_τ are the most critical differences (see Figure 2). Single-target distillation methods select the same τ 𝜏\tau italic_τ each time when predicting from a certain t n+1 subscript 𝑡 𝑛 1 t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. For example, CM (Song et al. 2023) sets τ=0 𝜏 0\tau=0 italic_τ = 0 for any timestep t n+1 subscript 𝑡 𝑛 1 t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, while PCM (Wang et al. 2024) segments the full trajectory into 𝒦 𝒦\mathcal{K}caligraphic_K (e.g. 4) phased sub-trajectories, and chooses the next ending point:

τ=max⁡{τ∈{0,T 𝒦,2⁢T 𝒦,…,(𝒦−1)⁢T 𝒦}∣τ<t n}.𝜏 𝜏 conditional 0 𝑇 𝒦 2 𝑇 𝒦…𝒦 1 𝑇 𝒦 𝜏 subscript 𝑡 𝑛\tau=\max\left{\tau\in{{0,{\frac{T}{\mathcal{K}}},{\frac{2T}{\mathcal{K}}},% \ldots,{\frac{(\mathcal{K}-1)T}{\mathcal{K}}}}}\mid\tau<t_{n}\right}.italic_τ = roman_max { italic_τ ∈ { 0 , divide start_ARG italic_T end_ARG start_ARG caligraphic_K end_ARG , divide start_ARG 2 italic_T end_ARG start_ARG caligraphic_K end_ARG , … , divide start_ARG ( caligraphic_K - 1 ) italic_T end_ARG start_ARG caligraphic_K end_ARG } ∣ italic_τ < italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } .(4)

On the other hand, multi-target distillation methods may select different values for τ 𝜏\tau italic_τ each time predicting from t n+1 subscript 𝑡 𝑛 1 t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. For instance, CTM (Kim et al. 2023) selects a random τ 𝜏\tau italic_τ within the interval [0,t n]0 subscript 𝑡 𝑛[0,t_{n}][ 0 , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ].

Our TDD, different from previous approaches that rely on simple trajectory segmentation or selecting from all possible τ 𝜏\tau italic_τ, employs a strategic selection of τ 𝜏\tau italic_τ, detailed in section 3.2. According to the taxonomy we provide in this work, TDD is a multi-target distillation method, yet we strive to reduce training on redundant predictions that are unnecessary for inference.

3.2 Target Timestep Selection

First, TDD pre-determines a set of equidistant denoising schedules, whose numbers of denoising steps range from 𝒦 min subscript 𝒦\mathcal{K}{\min}caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT to 𝒦 max subscript 𝒦\mathcal{K}{\max}caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, that we may adopt during inference periods. In the full trajectory of a PF-ODE from T 𝑇 T italic_T to 0 0, for each k∈[𝒦 min,𝒦 max]𝑘 subscript 𝒦 subscript 𝒦 k\in[\mathcal{K}{\min},\mathcal{K}{\max}]italic_k ∈ [ caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], the corresponding schedule include timesteps {τ m k}m=0 k−1 superscript subscript superscript subscript 𝜏 𝑚 𝑘 𝑚 0 𝑘 1{\tau_{m}^{k}}{m=0}^{k-1}{ italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT where τ m k=m⁢T k superscript subscript 𝜏 𝑚 𝑘 𝑚 𝑇 𝑘\tau{m}^{k}=\frac{mT}{k}italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG italic_m italic_T end_ARG start_ARG italic_k end_ARG. Then, we can define the union of all the timesteps of these schedules as

𝒯=⋃k=𝒦 min 𝒦 max{τ m k}m=0 k−1,𝒯 superscript subscript 𝑘 subscript 𝒦 subscript 𝒦 superscript subscript superscript subscript 𝜏 𝑚 𝑘 𝑚 0 𝑘 1\mathcal{T}=\bigcup_{k=\mathcal{K}{\min}}^{\mathcal{K}{\max}}{\tau_{m}^{k}% }_{m=0}^{k-1},caligraphic_T = ⋃ start_POSTSUBSCRIPT italic_k = caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT { italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ,(5)

which includes all the possible timesteps that we may choose as target timesteps. Note that Equation 5 is a generalized formulation, where 𝒦 min=𝒦 max=1 subscript 𝒦 subscript 𝒦 1\mathcal{K}{\min}=\mathcal{K}{\max}=1 caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1 for CM, 1<𝒦 min=𝒦 max<N 1 subscript 𝒦 subscript 𝒦 𝑁 1<\mathcal{K}{\min}=\mathcal{K}{\max}<N 1 < caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT < italic_N for PCM, and 𝒦 min=𝒦 max=N subscript 𝒦 subscript 𝒦 𝑁\mathcal{K}{\min}=\mathcal{K}{\max}=N caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = italic_N for CTM where N 𝑁 N italic_N is the total number of predictions within the equidistant schedule used by the ODE solver Φ Φ\Phi roman_Φ. As for our TDD, we cover commonly used few-step denoising schedules. For instance, typical values for 𝒦 min subscript 𝒦\mathcal{K}{\min}caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and 𝒦 max subscript 𝒦\mathcal{K}{\max}caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT are respectively 4 4 4 4 and 8 8 8 8.

Based on the condition, we establish the consistency function as:

𝒇:(𝐱 t,t,τ)↦𝐱 τ,:𝒇 maps-to subscript 𝐱 𝑡 𝑡 𝜏 subscript 𝐱 𝜏\boldsymbol{f}:(\mathbf{x}{t},t,\tau)\mapsto\mathbf{x}{\tau},bold_italic_f : ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_τ ) ↦ bold_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ,(6)

where t∈[0,T]𝑡 0 𝑇 t\in[0,T]italic_t ∈ [ 0 , italic_T ] and τ∈𝒯 𝜏 𝒯\tau\in\mathcal{T}italic_τ ∈ caligraphic_T, and we expect that the predicted results to the specific target timestep τ 𝜏\tau italic_τ will be consistent.

Training

Although 𝒯 𝒯\mathcal{T}caligraphic_T is already a selected set of timesteps, predicting to an arbitrary timestep in 𝒯 𝒯\mathcal{T}caligraphic_T still introduces redundancy, as it is unnecessary for the model to learn long-distance predictions from a large timestep t 𝑡 t italic_t to a small τ 𝜏\tau italic_τ in the context of few-step sampling. Therefore, we introduce an additional constraint e=T 𝒦 min 𝑒 𝑇 subscript 𝒦 e=\frac{T}{\mathcal{K}_{\min}}italic_e = divide start_ARG italic_T end_ARG start_ARG caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG. This constraint further narrows possible choices at timestep t 𝑡 t italic_t, reducing the learning difficulty. Formally, we uniformly select

τ m∼𝒰⁢({τ∈𝒯|t−e≤τ≤t}).similar-to subscript 𝜏 𝑚 𝒰 conditional-set 𝜏 𝒯 𝑡 𝑒 𝜏 𝑡\tau_{m}\sim\mathcal{U}({\tau\in\mathcal{T}|t-e\leq\tau\leq t}).italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ caligraphic_U ( { italic_τ ∈ caligraphic_T | italic_t - italic_e ≤ italic_τ ≤ italic_t } ) .(7)

Besides, γ 𝛾\gamma italic_γ-sampling (Kim et al. 2023) is commonly used in few-step generation to introduce randomness and stabilize outputs. To accommodate this, we introduce an additional hyperparameter η∈[0,1]𝜂 0 1\eta\in[0,1]italic_η ∈ [ 0 , 1 ]. The final consistency target timesteps are selected following

τm∼𝒰⁢([(1−η)⁢τ m,τ m]).similar-to subscript𝜏 𝑚 𝒰 1 𝜂 subscript 𝜏 𝑚 subscript 𝜏 𝑚\tilde{\tau}{m}\sim\mathcal{U}([{(1-\eta){\tau}{m}},{\tau}_{m}]).over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ caligraphic_U ( [ ( 1 - italic_η ) italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] ) .(8)

Define the solution using the master teacher model T ϕ subscript 𝑇 italic-ϕ T_{\phi}italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT from 𝐱 t n+1 subscript 𝐱 subscript 𝑡 𝑛 1\mathbf{x}{t{n+1}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to 𝐱 t n subscript 𝐱 subscript 𝑡 𝑛\mathbf{x}{t{n}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT with PF ODE solver as follows:

𝐱^t n ϕ=Φ⁢(𝐱 t n+1,t n+1,t n;T ϕ),subscript superscript^𝐱 italic-ϕ subscript 𝑡 𝑛 Φ subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 subscript 𝑇 italic-ϕ\hat{\mathbf{x}}^{\phi}{t{n}}=\Phi(\mathbf{x}{t{n+1}},t_{n+1},t_{n};T_{% \phi}),over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_Φ ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ,(9)

where Φ⁢(⋯;T ϕ)Φ⋯subscript 𝑇 italic-ϕ\Phi(\cdot\cdot\cdot;T_{\phi})roman_Φ ( ⋯ ; italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) is update function and 𝐱^t n ϕ subscript superscript^𝐱 italic-ϕ subscript 𝑡 𝑛\hat{\mathbf{x}}^{\phi}{t{n}}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is an accurate estimate of 𝐱 t n subscript 𝐱 subscript 𝑡 𝑛\mathbf{x}{t{n}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT from 𝐱 t n+1 subscript 𝐱 subscript 𝑡 𝑛 1\mathbf{x}{t{n+1}}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The loss function of TDD can be defined as:

ℒ TDD(θ,θ−;ϕ):=E[σ(t n,τm)\displaystyle\mathcal{L}{\text{TDD}}(\theta,\theta^{-};\phi):=E[\sigma(t{n},% \tilde{\tau}_{m})caligraphic_L start_POSTSUBSCRIPT TDD end_POSTSUBSCRIPT ( italic_θ , italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ; italic_ϕ ) := italic_E [ italic_σ ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )∥𝒇 θ(𝐱 t n+1,t n+1,τm)\displaystyle\left|\boldsymbol{f}{\theta}(\mathbf{x}{t_{n+1}},t_{n+1},% \tilde{\tau}{m})\vphantom{\boldsymbol{f}{\theta^{-}}(\hat{\mathbf{x}}{t{n}% }^{\phi},t_{n},\tilde{\tau}_{m})}\right.∥ bold_italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , over start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )(10) −𝒇 θ−(𝐱^t n ϕ,t n,τm)∥2 2],\displaystyle\left.-\boldsymbol{f}{\theta^{-}}(\hat{\mathbf{x}}{t_{n}}^{\phi% },t_{n},\tilde{\tau}{m})\right|{2}^{2}],- bold_italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where the expectation is over 𝐱∼p data similar-to 𝐱 subscript 𝑝 data\mathbf{x}\sim p_{\text{data}}bold_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT, n∼𝒰⁢[1,N−1]similar-to 𝑛 𝒰 1 𝑁 1 n\sim{\mathcal{U}}[1,N-1]italic_n ∼ caligraphic_U [ 1 , italic_N - 1 ], 𝐱 t n+1∼𝒩⁢(𝐱;t n+1 2⁢I)similar-to subscript 𝐱 subscript 𝑡 𝑛 1 𝒩 𝐱 superscript subscript 𝑡 𝑛 1 2 𝐼\mathbf{x}{t{n+1}}\sim\mathcal{N}(\mathbf{x};t_{n+1}^{2}I)bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_x ; italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) and 𝐱^t n ϕ superscript subscript^𝐱 subscript 𝑡 𝑛 italic-ϕ\hat{\mathbf{x}}{t{n}}^{\phi}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT is defined by Equation 9. Namely, 𝒰⁢[1,N−1]𝒰 1 𝑁 1{\mathcal{U}}[1,N-1]caligraphic_U [ 1 , italic_N - 1 ] denotes a uniform distribution over 1 to N−1 𝑁 1 N-1 italic_N - 1, where N 𝑁 N italic_N is a positive integer. σ⁢(⋅,⋅)𝜎⋅⋅\sigma(\cdot,\cdot)italic_σ ( ⋅ , ⋅ ) is a positive weighting function, following CM, we set σ⁢(t n,τm)≡1 𝜎 subscript 𝑡 𝑛 subscript𝜏 𝑚 1\sigma(t_{n},\tilde{\tau}_{m})\equiv 1 italic_σ ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ≡ 1. For a detailed description of our algorithm, please refer to Algorithm 1.

Algorithm 1 Target-Driven Distillation

Input: dataset 𝒟 𝒟\mathcal{D}caligraphic_D, , learning rate δ 𝛿\delta italic_δ, the update function of ODE solver Φ⁢(⋯;⋅)Φ⋯⋅\Phi(\cdot\cdot\cdot;\cdot)roman_Φ ( ⋯ ; ⋅ ), EMA rate μ 𝜇\mu italic_μ, noise schedule α t,σ t subscript 𝛼 𝑡 subscript 𝜎 𝑡\alpha_{t},\sigma_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, number of ODE steps N 𝑁 N italic_N, fixed gudiance scale ω′superscript 𝜔′\omega^{\prime}italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, empty prompt ratio ρ 𝜌\rho italic_ρ.

Parameter: initial model parameter θ 𝜃\theta italic_θ

Output:

1:

𝒯←∅←𝒯\mathcal{T}\leftarrow\emptyset caligraphic_T ← ∅

2:for

k∈{𝒦 min,𝒦 min+1,…,𝒦 max}𝑘 subscript 𝒦 subscript 𝒦 1…subscript 𝒦 k\in{\mathcal{K}{\min},\mathcal{K}{\min}+1,\ldots,\mathcal{K}_{\max}}italic_k ∈ { caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT + 1 , … , caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } do

3:Set time steps

τ m k∈{τ 0 k,τ 1 k,…,τ k−1 k}subscript superscript 𝜏 𝑘 𝑚 subscript superscript 𝜏 𝑘 0 subscript superscript 𝜏 𝑘 1…subscript superscript 𝜏 𝑘 𝑘 1\tau^{k}{m}\in{\tau^{k}{0},\tau^{k}{1},\ldots,\tau^{k}{k-1}}italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ { italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT }

4:Add time steps to

𝒯 𝒯\mathcal{T}caligraphic_T

5:end for

6:let

e=T 𝒦 min 𝑒 𝑇 subscript 𝒦 e=\frac{T}{\mathcal{K}_{\min}}italic_e = divide start_ARG italic_T end_ARG start_ARG caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG

7:repeat

8:Sample

(z,c)∼𝒟,τ m∼𝒰⁢({τ∈𝒯|t−e≤τ≤t})formulae-sequence similar-to 𝑧 𝑐 𝒟 similar-to subscript 𝜏 𝑚 𝒰 conditional-set 𝜏 𝒯 𝑡 𝑒 𝜏 𝑡(z,c)\sim\mathcal{D},\tau_{m}\sim\mathcal{U}({\tau\in\mathcal{T}|t-e\leq\tau% \leq t})( italic_z , italic_c ) ∼ caligraphic_D , italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ caligraphic_U ( { italic_τ ∈ caligraphic_T | italic_t - italic_e ≤ italic_τ ≤ italic_t } )

9:Sample

n∼𝒰⁢[1,N−1],τm∼𝒰⁢([(1−η)⁢τ m,τ m])formulae-sequence similar-to 𝑛 𝒰 1 𝑁 1 similar-to subscript𝜏 𝑚 𝒰 1 𝜂 subscript 𝜏 𝑚 subscript 𝜏 𝑚 n\sim\mathcal{U}[1,N-1],\tilde{\tau}{m}\sim\mathcal{U}([{(1-\eta){\tau}{m}},% {\tau}_{m}])italic_n ∼ caligraphic_U [ 1 , italic_N - 1 ] , over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ caligraphic_U ( [ ( 1 - italic_η ) italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] )

10:Sample

𝐱 t n+1∼𝒩⁢(α t n+1⁢𝐱,σ t n+1 2⁢𝐈)similar-to subscript 𝐱 subscript 𝑡 𝑛 1 𝒩 subscript 𝛼 subscript 𝑡 𝑛 1 𝐱 superscript subscript 𝜎 subscript 𝑡 𝑛 1 2 𝐈\mathbf{x}{t{n+1}}\sim\mathcal{N}(\alpha_{t_{n+1}}\mathbf{x},\sigma_{t_{n+1}% }^{2}\mathbf{I})bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_α start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_x , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I )

11:if probability

ρ 𝜌\rho italic_ρ then

12:

𝐱^t n ϕ,w′←(1+ω′)⁢Φ⁢(𝐱 t n+1,t n+1,t n,c;T ϕ)−ω′⁢Φ⁢(𝐱 t n+1,t n+1,t n;T ϕ)←subscript superscript^𝐱 italic-ϕ superscript 𝑤′subscript 𝑡 𝑛 absent 1 superscript 𝜔′Φ subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 𝑐 subscript 𝑇 italic-ϕ missing-subexpression superscript 𝜔′Φ subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 subscript 𝑇 italic-ϕ\begin{aligned} \hat{\mathbf{x}}^{\phi,w^{\prime}}{t{n}}\leftarrow&(1+\omega% ^{\prime})\Phi(\mathbf{x}{t{n+1}},t_{n+1},t_{n},c;T_{\phi})\ &-\omega^{\prime}\Phi(\mathbf{x}{t{n+1}},t_{n+1},t_{n};T_{\phi})\end{aligned}start_ROW start_CELL over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← end_CELL start_CELL ( 1 + italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) roman_Φ ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_c ; italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_Φ ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) end_CELL end_ROW

13:else

14:

𝐱^t n ϕ,w′←Φ⁢(𝐱 t n+1,t n+1,t n;T ϕ)←subscript superscript^𝐱 italic-ϕ superscript 𝑤′subscript 𝑡 𝑛 Φ subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 subscript 𝑇 italic-ϕ\hat{\mathbf{x}}^{\phi,w^{\prime}}{t{n}}\leftarrow\Phi(\mathbf{x}{t{n+1}},% t_{n+1},t_{n};T_{\phi})over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← roman_Φ ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_T start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT )

15:end if

16:

ℒ TDD w′:=‖𝒇 θ⁢(𝐱 t n+1,t n+1,τm)−𝒇 θ−⁢(𝐱^t n ϕ,w′,t n,τm)‖2 2 assign subscript superscript ℒ superscript 𝑤′TDD superscript subscript norm subscript 𝒇 𝜃 subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 subscript𝜏 𝑚 subscript 𝒇 superscript 𝜃 subscript superscript^𝐱 italic-ϕ superscript 𝑤′subscript 𝑡 𝑛 subscript 𝑡 𝑛 subscript𝜏 𝑚 2 2\mathcal{L}^{w^{\prime}}{\text{TDD}}:=\left|\boldsymbol{f}{\theta}(\mathbf{% x}{t{n+1}},t_{n+1},\tilde{\tau}{m})-\boldsymbol{f}{\theta^{-}}(\hat{% \mathbf{x}}^{\phi,w^{\prime}}{t{n}},t_{n},\tilde{\tau}{m})\right|{2}^{2}caligraphic_L start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT TDD end_POSTSUBSCRIPT := ∥ bold_italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - bold_italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over~ start_ARG italic_τ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

17:

θ←θ−δ⁢∇θ ℒ⁢(θ,θ−;ϕ)←𝜃 𝜃 𝛿 subscript∇𝜃 ℒ 𝜃 superscript 𝜃 italic-ϕ\theta\leftarrow\theta-\delta\nabla_{\theta}\mathcal{L}(\theta,\theta^{-};\phi)italic_θ ← italic_θ - italic_δ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ , italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ; italic_ϕ )

18:

θ−←sg⁢(μ⁢θ−+(1−μ)⁢θ)←superscript 𝜃 sg 𝜇 superscript 𝜃 1 𝜇 𝜃\theta^{-}\leftarrow\text{sg}(\mu\theta^{-}+(1-\mu)\theta)italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← sg ( italic_μ italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_μ ) italic_θ )

19:until convergence

Image 3: Refer to caption

Figure 3: Illustration of TDD distillation training and sampling processes. Fig (a) shows the distillation process, where τ k superscript 𝜏 𝑘\tau^{k}italic_τ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents equidistant timestep within segments. Fig (b) compares non-equidistant sampling with standard sampling for 5-step inference.

3.3 Decoupled Guidance

Distillation with Decoupled Guidance

Classifier-Free Guidance allows a model to precisely control the generation results without relying on an external classifier during the generation process, effectively modulating the influence of conditional signals. In current consistency model distillation methods, to ensure the stability of the training process, it is common to use the sample 𝐱^t n ϕ,w′subscript superscript^𝐱 italic-ϕ superscript 𝑤′subscript 𝑡 𝑛\hat{\mathbf{x}}^{\phi,w^{\prime}}{t{n}}over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT generated by the teacher model with classifier-free guidance as a reference in the optimization process for the student model’s generated samples. We believe that w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT solely represents the diversity constraint in the distillation process, controlling the complexity and generalization of the learning target. This allows for faster learning with fewer parameters. Therefore, w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT should be treated separately from the CFG scale w 𝑤 w italic_w used during inference in consistency models. Therefore, regardless of whether w′>0 superscript 𝑤′0 w^{\prime}>0 italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0, following (Ho and Salimans 2022), it is essential to include both unconditional and conditional training samples in the training process. Based on this, we will replace a portion of the condition with an empty prompt and not apply CFG enhancement. For conditions that are not empty, the loss function of TDD can be updated as follows:

ℒ TDD w′:=‖𝒇 θ⁢(𝐱 t n+1,t n+1,τ)−𝒇 θ−⁢(𝐱^t n ϕ,w′,t n,τ)‖2 2.assign subscript superscript ℒ superscript 𝑤′TDD superscript subscript norm subscript 𝒇 𝜃 subscript 𝐱 subscript 𝑡 𝑛 1 subscript 𝑡 𝑛 1 𝜏 subscript 𝒇 superscript 𝜃 subscript superscript^𝐱 italic-ϕ superscript 𝑤′subscript 𝑡 𝑛 subscript 𝑡 𝑛 𝜏 2 2\mathcal{L}^{w^{\prime}}{\text{TDD}}:=\left|\boldsymbol{f}{\theta}(\mathbf{% x}{t{n+1}},t_{n+1},\tau)-\boldsymbol{f}{\theta^{-}}(\hat{\mathbf{x}}^{\phi,% w^{\prime}}{t_{n}},t_{n},\tau)\right|_{2}^{2}.caligraphic_L start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT TDD end_POSTSUBSCRIPT := ∥ bold_italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_τ ) - bold_italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_ϕ , italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_τ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(11)

Image 4: Refer to caption

Figure 4: Qualitative comparison of different methods under NFE for 4 to 8 steps.

Guidance Scale Tuning

Define ϵ θ⁢(𝐱 t)subscript italic-ϵ 𝜃 subscript 𝐱 𝑡\epsilon_{\theta}(\mathbf{x}_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as consistency model, at each inference step, the noise predicted by the model can be expressed as:

ϵ w^=(1+w)⁢ϵ θ⁢(𝐱 t w′,t,c)−w⁢ϵ θ⁢(𝐱 t w′,t),^subscript italic-ϵ 𝑤 1 𝑤 subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡\hat{\epsilon_{w}}=(1+{w})\epsilon_{\theta}({\mathbf{x}}^{w^{\prime}}{t},t,c)% -{w}\epsilon{\theta}({\mathbf{x}}^{w^{\prime}}_{t},t),over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG = ( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ,(12)

where 𝐱 t w′subscript superscript 𝐱 superscript 𝑤′𝑡{\mathbf{x}}^{w^{\prime}}_{t}bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the input state of the consistency model at time step t 𝑡 t italic_t distilled with the diversity guidance scale w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Although setting a high value for w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (e.g., w′>7 superscript 𝑤′7 w^{\prime}>7 italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 7) can enhance certain aspects of the generated images, it simultaneously results in significantly reduced image complexity and excessively high contrast.

Is there a way to address this issue without retraining, allowing us to revert to results that enable inference with a small CFG, similar to the original model? By incorporating the unconditional into the training, we can get ϵ θ⁢(𝐱 t w′,t,c)∝(1+w′)⁢ϵ ϕ⁢(𝐱 t,t,c)−w′⁢ϵ ϕ⁢(𝐱 t,t)proportional-to subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡 𝑐 1 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\epsilon_{\theta}({\mathbf{x}}^{w^{\prime}}{t},t,c)\propto(1+{w^{\prime}})% \epsilon{\phi}({\mathbf{x}}{t},t,c)-{w^{\prime}}\epsilon{\phi}({\mathbf{x}}% {t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ∝ ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and ϵ θ⁢(𝐱 t w′,t)∝ϵ ϕ⁢(𝐱 t,t)proportional-to subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\epsilon{\theta}({\mathbf{x}}^{w^{\prime}}{t},t)\propto\epsilon{\phi}({% \mathbf{x}}{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∝ italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), where ϵ ϕ⁢(⋅)subscript italic-ϵ italic-ϕ⋅\epsilon{\phi}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) is the master model.

Denote (1+w)⁢ϵ ϕ⁢(𝐱 t,t,c)−w⁢ϵ ϕ⁢(𝐱 t,t)1 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡(1+w)\epsilon_{\phi}({\mathbf{x}}{t},t,c)-w\epsilon{\phi}({\mathbf{x}}{t},t)( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) as ϵ w subscript italic-ϵ 𝑤\epsilon{w}italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, representing the noise inferred by the original model when using the normal guidance scale w 𝑤 w italic_w at each time step. After simplification, we finally obtain:

ϵ w≈[ϵ w^+w′⁢ϵ θ⁢(𝐱 t,t)]/(1+w′).subscript italic-ϵ 𝑤 delimited-[]^subscript italic-ϵ 𝑤 superscript 𝑤′subscript italic-ϵ 𝜃 subscript 𝐱 𝑡 𝑡 1 superscript 𝑤′\epsilon_{w}\approx[\hat{\epsilon_{w}}+{w^{\prime}}\epsilon_{\theta}({\mathbf{% x}}_{t},t)]/(1+{w^{\prime}}).italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≈ [ over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] / ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .(13)

This equation suggests that regulating ϵ w^^subscript italic-ϵ 𝑤\hat{\epsilon_{w}}over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG with the normal w 𝑤 w italic_w can approximate the output of the original model ϵ w subscript italic-ϵ 𝑤\epsilon_{w}italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. For a more detailed derivation, please refer to the appendix.

In the aforementioned formula, the distillation diversity constraint w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is known and fixed. The parameter w 𝑤 w italic_w can be inferred based on the standard teacher model’s ratio. Furthermore, for the current consistency model, even though it has not yet learned the unguided path, this formula can still be approximately utilized for inference, as no other learning has been conducted. This can be expressed as follows:

ϵ w′≈[ϵ w′^+w′¯⁢ϵ θ′⁢(𝐱 t,t)]/(1+w′¯).subscript superscript italic-ϵ′𝑤 delimited-[]^subscript superscript italic-ϵ′𝑤¯superscript 𝑤′subscript superscript italic-ϵ′𝜃 subscript 𝐱 𝑡 𝑡 1¯superscript 𝑤′\epsilon^{\prime}{w}\approx[\hat{\epsilon^{\prime}{w}}+\overline{w^{\prime}}% \epsilon^{\prime}{\theta}({\mathbf{x}}{t},t)]/(1+\overline{w^{\prime}}).italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≈ [ over^ start_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + over¯ start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] / ( 1 + over¯ start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) .(14)

where ϵ θ′⁢(⋅)subscript superscript italic-ϵ′𝜃⋅\epsilon^{\prime}{\theta}(\cdot)italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is current consistency model and w′¯=w min′+w max′)/2\overline{w^{\prime}}={w^{\prime}{\min}+w^{\prime}_{\max})/2}over¯ start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) / 2.

3.4 Sample

Since we have trained on multiple equidistant target timesteps, we can extend this to non-equidistant sampling. By reducing the sampling interval during high-noise periods, we can mitigate the generation difficulty and achieve better synthesis results. As shown in Figure 3 (b), we ensure that the inference process passes through the target timesteps corresponding to 𝒦 min subscript 𝒦\mathcal{K}{\min}caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. As the inference steps increase, additional target timesteps are gradually inserted between these key timesteps. For example, with 𝒦 min=4 subscript 𝒦 4\mathcal{K}{\min}=4 caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 4, as the number of steps increases, we insert 8-step (𝒦 max subscript 𝒦\mathcal{K}_{\max}caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT) target timesteps within each adjacent 4-step target interval to enhance generation quality.

In addition, the γ 𝛾\gamma italic_γ-sampler proposed in CTM (Kim et al. 2023) alternates forward and backward jumps along the solution trajectory to solve 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, allowing control over the randomness ratio through γ 𝛾\gamma italic_γ, which can enhance generation quality to some extent. Solving for 𝐱 s subscript 𝐱 𝑠\mathbf{x}{s}bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT from 𝐱 t subscript 𝐱 𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be represented as follows:

𝐱 s=α s α t⁢𝐱 t−σ s⁢(e h s−1)⁢ϵ θ⁢(𝐱 t,t),subscript 𝐱 𝑠 subscript 𝛼 𝑠 subscript 𝛼 𝑡 subscript 𝐱 𝑡 subscript 𝜎 𝑠 superscript 𝑒 subscript ℎ 𝑠 1 subscript italic-ϵ 𝜃 subscript 𝐱 𝑡 𝑡\mathbf{x}{s}=\frac{\alpha{s}}{\alpha_{t}}\mathbf{x}{t}-\sigma{s}\left(e^{% h_{s}}-1\right)\epsilon_{\theta}(\mathbf{x}_{t},t),bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ,(15)

where h s=λ s−λ t subscript ℎ 𝑠 subscript 𝜆 𝑠 subscript 𝜆 𝑡 h_{s}=\lambda_{s}-\lambda_{t}italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and λ 𝜆\lambda italic_λ represents the log signal-to-noise ratio, λ=log⁡(α/σ)𝜆 𝛼 𝜎\lambda=\log(\alpha/\sigma)italic_λ = roman_log ( italic_α / italic_σ ).However, in few-step sampling with high CFG inference, the noise distribution after the first step significantly deviates from the expected distribution. To address this, we adopt an 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT formulation.To approximate x θ⁢(x t,t)≈x 0=(𝐱 t−σ t⁢ϵ)/(α t)subscript 𝑥 𝜃 subscript 𝑥 𝑡 𝑡 subscript 𝑥 0 subscript 𝐱 𝑡 subscript 𝜎 𝑡 italic-ϵ subscript 𝛼 𝑡 x{\theta}(x_{t},t)\approx x_{0}=(\mathbf{x}{t}-\sigma{t}\epsilon)/(\alpha_{% t})italic_x start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ ) / ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we can derive:

𝐱 s=σ s σ t⁢𝐱 t−α s⁢(e−h s−1)⁢x θ⁢(𝐱 t,t).subscript 𝐱 𝑠 subscript 𝜎 𝑠 subscript 𝜎 𝑡 subscript 𝐱 𝑡 subscript 𝛼 𝑠 superscript 𝑒 subscript ℎ 𝑠 1 subscript 𝑥 𝜃 subscript 𝐱 𝑡 𝑡\mathbf{x}{s}=\frac{\sigma{s}}{\sigma_{t}}\mathbf{x}{t}-\alpha{s}\left(e^{% -h_{s}}-1\right)x_{\theta}(\mathbf{x}_{t},t).bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 ) italic_x start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) .(16)

Following prior works (Saharia et al. 2022b; Lu et al. 2022b), we apply the clipping method 𝒞 𝒞\mathcal{C}caligraphic_C, which clips each latent variable to the specific percentile of its absolute value and normalizes it to prevent saturation of the latent variables. Let 𝐱 0^=𝒞⁢(x θ⁢(𝐱 t,t))^subscript 𝐱 0 𝒞 subscript 𝑥 𝜃 subscript 𝐱 𝑡 𝑡\hat{\mathbf{x}{0}}=\mathcal{C}(x{\theta}(\mathbf{x}_{t},t))over^ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = caligraphic_C ( italic_x start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ), we ultimately obtain:

𝐱 s^=σ s σ t⁢𝐱 t−α s⁢(e−h s−1)⁢𝐱 0^.^subscript 𝐱 𝑠 subscript 𝜎 𝑠 subscript 𝜎 𝑡 subscript 𝐱 𝑡 subscript 𝛼 𝑠 superscript 𝑒 subscript ℎ 𝑠 1^subscript 𝐱 0\hat{\mathbf{x}{s}}=\frac{\sigma{s}}{\sigma_{t}}\mathbf{x}{t}-\alpha{s}% \left(e^{-h_{s}}-1\right)\hat{\mathbf{x}_{0}}.over^ start_ARG bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 ) over^ start_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG .(17)

When using γ 𝛾\gamma italic_γ-sampler, the transition from timestep t 𝑡 t italic_t to the next timestep p 𝑝 p italic_p can be expressed as

𝐱 p=α p α s⁢𝐱 s^+1−α p 2 α s 2⁢𝐳,𝐳∈𝒩⁢(0,I),formulae-sequence subscript 𝐱 𝑝 subscript 𝛼 𝑝 subscript 𝛼 𝑠^subscript 𝐱 𝑠 1 superscript subscript 𝛼 𝑝 2 superscript subscript 𝛼 𝑠 2 𝐳 𝐳 𝒩 0 𝐼\mathbf{x}{p}=\frac{\alpha{p}}{\alpha_{s}}\hat{\mathbf{x}{s}}+\sqrt{1-\frac% {\alpha{p}^{2}}{\alpha_{s}^{2}}}\mathbf{z},\quad\mathbf{z}\in\mathcal{N}(0,I),bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = divide start_ARG italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG + square-root start_ARG 1 - divide start_ARG italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG bold_z , bold_z ∈ caligraphic_N ( 0 , italic_I ) ,(18)

where s=(1−γ)⁢p 𝑠 1 𝛾 𝑝 s=(1-\gamma)p italic_s = ( 1 - italic_γ ) italic_p.

Table 1: Quantitative Comparison under different metrics and datasets. Our method consistently outperforms others in FID and PickScore, achieving higher image complexity without sacrificing quality.

4 Experiments

4.1 Dateset

We use a subset of the Laion-5B (Schuhmann et al. 2022) High-Res dataset for training. All images in the dataset have an aesthetic score above 5.5, totaling approximately 260 million. Additionally, we evaluate the performance using the COCO-2014 (Lin et al. 2014) validation set, split into 30k (COCO-30K) and 2k (COCO-2K) captions for assessing different metrics. We also use the PartiPrompts (Yu et al. 2022) dataset to benchmark performance, which includes over 1600 prompts across various categories and challenge aspects.

4.2 Backbone

We employ SDXL (Podell et al. 2023) as the backbone for our experiments. Specifically, we trained a LoRA (Hu et al. 2021) through distillation.

4.3 Metrics

We evaluate image generation quality using Frechet Inception Distance (FID) (Heusel et al. 2017) and assess image content richness with the Image Complexity Score (IC) (Feng et al. 2022). Additionally, we use PickScore (Kirstain et al. 2023) to measure human preference. FID and IC are tested on the COCO-30K dataset, while PickScore is evaluated on COCO-2K and PartiPrompts.

4.4 Performance Comparison

In this section, we present a comprehensive performance evaluation of our proposed method against several baselines, including LCM, PCM, and TCD, across the COCO-30K, PartiPrompts, and COCO-2K datasets, as detailed in Table 1. We introduce two methods, “Ours” and “Ours*”, representing normal sampling and non-equidistant sampling (4-step and 8-step as normal sampling), respectively. Additionally, “Ours(adv)” incorporates PCM’s (Wang et al. 2024) adversarial process during distillation, demonstrating that our method can effectively integrate adversarial training.

As shown in Figure 4, we qualitatively compare Ours and Ours(adv) with other methods across different inference steps. Our model outperforms others in image quality and text-image alignment, especially in the 4 to 8-step range. Quantitative results in show that Ours achieves the best FID, though FID values tend to increase with more steps. While our method may not always achieve the top Image Complexity (IC), it avoids generating less detailed or cluttered images, unlike CTM, as seen in Figure 4. However, the high image complexity observed in CTM may also be attributed to certain visual artifacts and high-frequency noise, which we will elaborate on in the Appendix. Furthermore, evaluations using PickScore on the COCO-2K and PartiPrompt datasets show that Ours and Ours* consistently rank first or second in most cases. Overall, our method demonstrates superior performance and a well-balanced approach across metrics.

4.5 Ablation Study

Effect of Target Timestep Selection

To demonstrate the advantages of Target-Timestep-Selection, we compared the performance of models trained on mappings required for 4-step inference (e.g., PCM) against those trained on mappings required for 4–8 step inference. We maintained consistent settings with a batch size of 128, a learning rate of 5e-06, and trained for a total of 15,000 steps. As shown in Figure 5, when using deterministic sampling (i.e., γ=0 𝛾 0\gamma=0 italic_γ = 0), the model trained on mappings for 4–8 step inference showed only a slight advantage at 4 and 5 steps. However, when incorporating randomness into sampling (i.e., γ=0.2 𝛾 0.2\gamma=0.2 italic_γ = 0.2), the model trained on 4–8 step mappings outperformed the model trained on 4-step mappings across all 4–8 steps. Furthermore, when we extended the mapping range by η=0.3 𝜂 0.3\eta=0.3 italic_η = 0.3 to better accommodate the randomness in sampling (as indicated by the red line in Figure 5), inference with γ=0.2 𝛾 0.2\gamma=0.2 italic_γ = 0.2 achieved a well-balanced performance across 4–8 steps, avoiding poor performance at 4–5 steps while also maintaining solid performance at 6–8 steps.

Image 5: Refer to caption

Figure 5: Qualitative comparison between Target-Driven Multi-Target Distillation (TDD, 4-8 step target timesteps distillation) and Single-Target Distillation (PCM, 4-step target timesteps distillation).

Effect of Distillation with Decoupled Guidance and Guidance Scale Tuning

To demonstrate the advantages of distillation with decoupled guidance, we conducted experiments with a batch size of 128, 𝒦 min=4 subscript 𝒦 4\mathcal{K}{\min}=4 caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 4, 𝒦 max=8 subscript 𝒦 8\mathcal{K}{\max}=8 caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 8, and η=0.3 𝜂 0.3\eta=0.3 italic_η = 0.3, training two models with empty prompt ratios of 0 and 0.2. After 15k steps, we performed inference using a CFG of 3, as shown in Figure 6 (b). This approach effectively stabilized image quality and reduced visual artifacts. Additionally, we applied the guidance scale tuning to models like TCD and PCM, which were distilled with higher CFG values (corresponding to original CFGs of 9 and 5.5, respectively, when inferred with CFG 1). Guidance scale tuning successfully converted these models to use normal CFG values during inference, significantly enhancing image content richness by reducing the CFG.

Image 6: Refer to caption

Figure 6: (a) Ablation comparison of distillation with decoupled guidance. (b) Ablation comparison of guidance scale tuning.

Effect of 𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Clipping Sample

In Figure 6, we demonstrate the advantages of 𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping. For some samples inferred with a low guidance scale, such as the one on the far left, certain defects may appear during inference. Increasing the guidance can alleviate these issues to some extent, but it also increases contrast, as seen in the middle image of Figure 6. By applying clipping in the initial steps, we can partially correct these defects without increasing contrast. Additional examples and results from applying clipping beyond the first denoising step are provided in the Appendix.

5 Conclusion

Consistency distillation methods have proven effective in accelerating diffusion models’ generative tasks. However, previous methods often face issues such as blurriness and detail loss due to simplistic strategies in target timestep selection. We propose Target-Driven Distillation (TDD), which addresses these limitations by (1) employing a refined strategy for selecting target timesteps, thus enhancing training efficiency; (2) using decoupled guidance during training, which allows for post-tuning of the guidance scale during inference; and (3) incorporating optional non-equidistant sampling and 𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping for more flexible and precise image sampling. Experiments demonstrate that TDD achieves state-of-the-art performance in few-step generation, providing a superior option among consistency distillation methods.

Image 7: Refer to caption

Figure 7: Ablation comparison of 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping sample. The prompt used is “a dog wearing a blue dress”. The images on the left are with CFG = 2, the middle with CFG = 3, and the right with CFG = 3 and 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping applied.

References

  • Balaji et al. (2022) Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Zhang, Q.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; et al. 2022. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324.
  • Dhariwal and Nichol (2021) Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794.
  • Feng et al. (2022) Feng, T.; Zhai, Y.; Yang, J.; Liang, J.; Fan, D.-P.; Zhang, J.; Shao, L.; and Tao, D. 2022. IC9600: a benchmark dataset for automatic image complexity assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7): 8577–8593.
  • Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. Advances in neural information processing systems, 27.
  • Heusel et al. (2017) Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; and Hochreiter, S. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  • Ho and Salimans (2022) Ho, J.; and Salimans, T. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
  • Hu et al. (2021) Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  • Karras et al. (2022) Karras, T.; Aittala, M.; Aila, T.; and Laine, S. 2022. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35: 26565–26577.
  • Karras, Laine, and Aila (2019) Karras, T.; Laine, S.; and Aila, T. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4401–4410.
  • Kim et al. (2023) Kim, D.; Lai, C.-H.; Liao, W.-H.; Murata, N.; Takida, Y.; Uesaka, T.; He, Y.; Mitsufuji, Y.; and Ermon, S. 2023. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279.
  • Kingma and Welling (2013) Kingma, D.P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Kirstain et al. (2023) Kirstain, Y.; Polyak, A.; Singer, U.; Matiana, S.; Penna, J.; and Levy, O. 2023. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36: 36652–36663.
  • Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C.L. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
  • Liu, Gong, and Liu (2022) Liu, X.; Gong, C.; and Liu, Q. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.
  • Lu et al. (2022a) Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; and Zhu, J. 2022a. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095.
  • Lu et al. (2022b) Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; and Zhu, J. 2022b. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095.
  • Luo et al. (2023a) Luo, S.; Tan, Y.; Huang, L.; Li, J.; and Zhao, H. 2023a. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378.
  • Luo et al. (2023b) Luo, S.; Tan, Y.; Patil, S.; Gu, D.; von Platen, P.; Passos, A.; Huang, L.; Li, J.; and Zhao, H. 2023b. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556.
  • Meng et al. (2021) Meng, C.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; and Ermon, S. 2021. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
  • Podell et al. (2023) Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952.
  • Ramesh et al. (2022) Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; and Chen, M. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2): 3.
  • Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
  • Saharia et al. (2022a) Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; and Norouzi, M. 2022a. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings, 1–10.
  • Saharia et al. (2022b) Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. 2022b. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35: 36479–36494.
  • Sauer et al. (2024) Sauer, A.; Boesel, F.; Dockhorn, T.; Blattmann, A.; Esser, P.; and Rombach, R. 2024. Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015.
  • Sauer et al. (2023) Sauer, A.; Lorenz, D.; Blattmann, A.; and Rombach, R. 2023. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042.
  • Schuhmann et al. (2022) Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 25278–25294.
  • Sohl-Dickstein et al. (2015) Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and Ganguli, S. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, 2256–2265. PMLR.
  • Sohn, Lee, and Yan (2015) Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
  • Song, Meng, and Ermon (2020) Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
  • Song et al. (2023) Song, Y.; Dhariwal, P.; Chen, M.; and Sutskever, I. 2023. Consistency models. arXiv preprint arXiv:2303.01469.
  • Song and Ermon (2019) Song, Y.; and Ermon, S. 2019. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32.
  • Song et al. (2020) Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; and Poole, B. 2020. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
  • Wang et al. (2024) Wang, F.-Y.; Huang, Z.; Bergman, A.W.; Shen, D.; Gao, P.; Lingelbach, M.; Sun, K.; Bian, W.; Song, G.; Liu, Y.; et al. 2024. Phased Consistency Model. arXiv preprint arXiv:2405.18407.
  • Wu et al. (2024) Wu, T.; Li, X.; Qi, Z.; Hu, D.; Wang, X.; Shan, Y.; and Li, X. 2024. SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 6126–6134.
  • Yin et al. (2024) Yin, T.; Gharbi, M.; Zhang, R.; Shechtman, E.; Durand, F.; Freeman, W.T.; and Park, T. 2024. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6613–6623.
  • Yu et al. (2022) Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. 2022. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3): 5.
  • Zhang and Chen (2022) Zhang, Q.; and Chen, Y. 2022. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902.
  • Zheng et al. (2023) Zheng, G.; Zhou, X.; Li, X.; Qi, Z.; Shan, Y.; and Li, X. 2023. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22490–22499.
  • Zheng et al. (2024) Zheng, J.; Hu, M.; Fan, Z.; Wang, C.; Ding, C.; Tao, D.; and Cham, T.-J. 2024. Trajectory consistency distillation. arXiv preprint arXiv:2402.19159.

Appendix A Appendix

A.1 Guidance Scale Tuning Details

Define ϵ θ⁢(𝐱 t)subscript italic-ϵ 𝜃 subscript 𝐱 𝑡\epsilon_{\theta}(\mathbf{x}_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as the noise predicted by the consistency model at each inference step. According to the Classifier-Free Guidance (CFG), the noise can be expressed as:

ϵ w^=(1+w)⁢ϵ θ⁢(𝐱 t w′,t,c)−w⁢ϵ θ⁢(𝐱 t w′,t),^subscript italic-ϵ 𝑤 1 𝑤 subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡\hat{\epsilon_{w}}=(1+{w})\epsilon_{\theta}({\mathbf{x}}^{w^{\prime}}{t},t,c)% -{w}\epsilon{\theta}({\mathbf{x}}^{w^{\prime}}_{t},t),over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG = ( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ,(19)

where w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the guidance scale used during the distillation process, and 𝐱 t w′subscript superscript 𝐱 superscript 𝑤′𝑡{\mathbf{x}}^{w^{\prime}}_{t}bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the sample at time step t 𝑡 t italic_t obtained from the model distilled with this guidance scale.

By replacing a portion of the conditions with empty prompts, we obtain the approximations:

ϵ θ⁢(𝐱 t w′,t,c)≈(1+w′)⁢ϵ ϕ⁢(𝐱 t,t,c)−w′⁢ϵ ϕ⁢(𝐱 t,t)subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡 𝑐 1 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\epsilon_{\theta}({\mathbf{x}}^{w^{\prime}}{t},t,c)\approx(1+{w^{\prime}})% \epsilon{\phi}({\mathbf{x}}{t},t,c)-{w^{\prime}}\epsilon{\phi}({\mathbf{x}}% _{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ≈ ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t )(20)

and

ϵ θ⁢(𝐱 t w′,t)≈ϵ ϕ⁢(𝐱 t,t),subscript italic-ϵ 𝜃 subscript superscript 𝐱 superscript 𝑤′𝑡 𝑡 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\epsilon_{\theta}({\mathbf{x}}^{w^{\prime}}{t},t)\approx\epsilon{\phi}({% \mathbf{x}}_{t},t),italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈ italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ,(21)

where ϵ ϕ⁢(∗)subscript italic-ϵ italic-ϕ\epsilon_{\phi}(*)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ∗ ) is the master model.Thus, we derive:

ϵ w^≈^subscript italic-ϵ 𝑤 absent\displaystyle\hat{\epsilon_{w}}\approx over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ≈(1+w)⁢[(1+w′)⁢ϵ ϕ⁢(𝐱 t,t,c)−w′⁢ϵ ϕ⁢(𝐱 t,t)]1 𝑤 delimited-[]1 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle(1+{w})[(1+{w^{\prime}})\epsilon_{\phi}({\mathbf{x}}{t},t,c)-{w^% {\prime}}\epsilon_{\phi}({\mathbf{x}}_{t},t)]( 1 + italic_w ) ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) −w⁢ϵ ϕ⁢(𝐱 t,t)𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle-{w}\epsilon{\phi}({\mathbf{x}}{t},t)- italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈\displaystyle\approx≈(1+w)⁢(1+w′)⁢ϵ ϕ⁢(𝐱 t,t,c)−(1+w)⁢w′⁢ϵ ϕ⁢(𝐱 t,t)1 𝑤 1 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 1 𝑤 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle(1+w)(1+{w^{\prime}})\epsilon{\phi}({\mathbf{x}}{t},t,c)-(1+w){% w^{\prime}}\epsilon{\phi}({\mathbf{x}}{t},t)( 1 + italic_w ) ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - ( 1 + italic_w ) italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) −w⁢ϵ ϕ⁢(𝐱 t,t)𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle-{w}\epsilon{\phi}({\mathbf{x}}{t},t)- italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈\displaystyle\approx≈(1+w′)⁢[(1+w)⁢ϵ ϕ⁢(𝐱 t,t,c)−w⁢ϵ ϕ⁢(𝐱 t,t)]1 superscript 𝑤′delimited-[]1 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle(1+{w^{\prime}})[(1+w)\epsilon_{\phi}({\mathbf{x}}_{t},t,c)-w% \epsilon_{\phi}({\mathbf{x}}_{t},t)]( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) [ ( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] +(1+w′)⁢w⁢ϵ ϕ⁢(𝐱 t,t)−(1+w)⁢w′⁢ϵ ϕ⁢(𝐱 t,t)1 superscript 𝑤′𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 1 𝑤 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle+(1+{w^{\prime}})w\epsilon{\phi}({\mathbf{x}}{t},t)-(1+w){w^{% \prime}}\epsilon{\phi}({\mathbf{x}}{t},t)+ ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - ( 1 + italic_w ) italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) −w⁢ϵ ϕ⁢(𝐱 t,t)𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle-{w}\epsilon{\phi}({\mathbf{x}}{t},t)- italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈\displaystyle\approx≈(1+w′)⁢[(1+w)⁢ϵ ϕ⁢(𝐱 t,t,c)−w⁢ϵ ϕ⁢(𝐱 t,t)]1 superscript 𝑤′delimited-[]1 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle(1+{w^{\prime}})[(1+w)\epsilon_{\phi}({\mathbf{x}}_{t},t,c)-w% \epsilon_{\phi}({\mathbf{x}}_{t},t)]( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) [ ( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] +w⁢ϵ ϕ⁢(𝐱 t,t)+w′⁢w⁢ϵ ϕ⁢(𝐱 t,t)−w′⁢ϵ ϕ⁢(𝐱 t,t)𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 superscript 𝑤′𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle+w\epsilon{\phi}({\mathbf{x}}{t},t)+{w^{\prime}}w\epsilon{\phi% }({\mathbf{x}}{t},t)-{w^{\prime}}\epsilon{\phi}({\mathbf{x}}{t},t)+ italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) −w′⁢w⁢ϵ ϕ⁢(𝐱 t,t)−w⁢ϵ ϕ⁢(𝐱 t,t)superscript 𝑤′𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle-{w^{\prime}}w\epsilon{\phi}({\mathbf{x}}{t},t)-{w}\epsilon{% \phi}({\mathbf{x}}{t},t)- italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈\displaystyle\approx≈(1+w′)⁢[(1+w)⁢ϵ ϕ⁢(𝐱 t,t,c)−w⁢ϵ ϕ⁢(𝐱 t,t)]1 superscript 𝑤′delimited-[]1 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle(1+{w^{\prime}})[(1+w)\epsilon_{\phi}({\mathbf{x}}_{t},t,c)-w% \epsilon_{\phi}({\mathbf{x}}_{t},t)]( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) [ ( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] −w′⁢ϵ ϕ⁢(𝐱 t,t).superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\displaystyle-{w^{\prime}}\epsilon{\phi}({\mathbf{x}}_{t},t).- italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) .

Let (1+w)⁢ϵ ϕ⁢(𝐱 t,t,c)−w⁢ϵ ϕ⁢(𝐱 t,t)1 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡 𝑐 𝑤 subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡(1+w)\epsilon_{\phi}({\mathbf{x}}{t},t,c)-w\epsilon{\phi}({\mathbf{x}}{t},t)( 1 + italic_w ) italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) - italic_w italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) denote ϵ w subscript italic-ϵ 𝑤\epsilon{w}italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, which represents the noise predicted by the original model using the normal guidance scale w 𝑤 w italic_w at each time step. We then obtain:

ϵ w^≈(1+w′)⁢ϵ w−w′⁢ϵ ϕ⁢(𝐱 t,t).^subscript italic-ϵ 𝑤 1 superscript 𝑤′subscript italic-ϵ 𝑤 superscript 𝑤′subscript italic-ϵ italic-ϕ subscript 𝐱 𝑡 𝑡\hat{\epsilon_{w}}\approx(1+{w^{\prime}})\epsilon_{w}-{w^{\prime}}\epsilon_{% \phi}({\mathbf{x}}_{t},t).over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ≈ ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) .(23)

Given ϵ w subscript italic-ϵ 𝑤\epsilon_{w}italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT as the expected output, we get:

ϵ w≈[ϵ w^+w′⁢ϵ θ⁢(𝐱 t,t)]/(1+w′).subscript italic-ϵ 𝑤 delimited-[]^subscript italic-ϵ 𝑤 superscript 𝑤′subscript italic-ϵ 𝜃 subscript 𝐱 𝑡 𝑡 1 superscript 𝑤′\epsilon_{w}\approx[\hat{\epsilon_{w}}+{w^{\prime}}\epsilon_{\theta}({\mathbf{% x}}_{t},t)]/(1+{w^{\prime}}).italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≈ [ over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] / ( 1 + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .(24)

Since w′superscript 𝑤′{w^{\prime}}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a fixed constant and both ϵ w subscript italic-ϵ 𝑤\epsilon_{w}italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ϵ w^^subscript italic-ϵ 𝑤\hat{\epsilon_{w}}over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG are governed by the same guidance scale w 𝑤 w italic_w, the formula allows us to tune the guidance scale to its normal range by incorporating an additional unconditional noise.

Additionally, for models distilled using a range of guidance scales, we can use the average of this range w′¯=w min′+w max′)/2\overline{w^{\prime}}={w^{\prime}{\min}+w^{\prime}{\max})/2}over¯ start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT + italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) / 2, as a substitute for w′superscript 𝑤′w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The expression then becomes:

ϵ w′≈[ϵ w′^+w′¯⁢ϵ θ′⁢(𝐱 t,t)]/(1+w′¯).subscript superscript italic-ϵ′𝑤 delimited-[]^subscript superscript italic-ϵ′𝑤¯superscript 𝑤′subscript superscript italic-ϵ′𝜃 subscript 𝐱 𝑡 𝑡 1¯superscript 𝑤′\epsilon^{\prime}{w}\approx[\hat{\epsilon^{\prime}{w}}+\overline{w^{\prime}}% \epsilon^{\prime}{\theta}({\mathbf{x}}{t},t)]/(1+\overline{w^{\prime}}).italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≈ [ over^ start_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG + over¯ start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ] / ( 1 + over¯ start_ARG italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) .(25)

Image 8: Refer to caption

Figure 8: Comparison of image complexity. We present complexity maps for selected generated images from TCD and TDD.

A.2 Experimental Details

Datasets

For our experiments, we utilize a carefully curated subset of the Laion-5B High-Resolution dataset, specifically selecting images with an aesthetic score exceeding 5.5. This subset comprises approximately 260 million high-quality images, providing a diverse and extensive foundation for training our models. To evaluate our models’ performance comprehensively, we employ the COCO-2014 validation set. This dataset is divided into two subsets: COCO-30K, containing 30,000 captions, and COCO-2K, with 2000 captions. These subsets are used to assess a range of metrics, ensuring a robust evaluation across different aspects of image captioning and understanding. Additionally, we benchmark our models using the PartiPrompts dataset, which consists of over 1600 prompts spanning various categories and challenging aspects. This dataset is particularly valuable for testing the model’s generalization and adaptability across diverse and complex scenarios.

Training Details

In our experiments, for the main results comparison, we utilized the SDXL LoRA versions of LCM, TCD, and PCM with open-source weights, where PCM was distilled with small CFG across 4 phases. For our model, we similarly chose SDXL as the backbone for distillation, setting the LoRA rank to 64. For the non-adversarial version, we employed a learning rate of 1e-6 with a batch size of 512 for 20,000 iterations. For the adversarial version, the learning rate was 2e-6 with the adversarial model set to 1e-5 and a batch size of 448. 𝒦 min subscript 𝒦\mathcal{K}{\min}caligraphic_K start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and 𝒦 max subscript 𝒦\mathcal{K}{\max}caligraphic_K start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT are set to 4 and 8, respectively, with η 𝜂\eta italic_η set to 0.3 and the ratio of empty prompts set to 0.2.We utilized DDIM as the solver with N=250 𝑁 250 N=250 italic_N = 250. Additionally, we used a fixed guidance scale of w′=3.5 superscript 𝑤′3.5 w^{\prime}=3.5 italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 3.5.

In our non-equidistant sampling method, we incrementally inserted timesteps from an 8-step denoising process into the intervals of a 4-step denoising process. Specifically, for timesteps between steps 4 and 8, the selection was as follows:

  • •For 5 steps: [999., 875., 751., 499., 251.].
  • •For 6 steps: [999., 875., 751., 627., 499., 251.].
  • •For 7 steps: [999., 875., 751., 627., 499., 375., 251.].

For the ablation experiments, we consistently employed non-adversarial distillation approach with the following settings: a learning rate of 5e-6, a batch size of 128, and the DDIM solver with N=50 𝑁 50 N=50 italic_N = 50. We trained all the ablation models using 15,000 iterations.

To ensure fairness in the main results comparison, given the varying CFG values used in prior work for distillation, we standardize the guidance scales for inference based on the distillation guidance scales used in LCM and TCD. Specifically, we use CFG = 1.0 for LCM and TCD, CFG = 1.6 for PCM, and CFG = 2.0 for our method.

Image 9: Refer to caption

Figure 9: Further comparison of image complexity results.

Analyzing Image Complexity

Although our model does not achieve the highest image complexity metrics, we identify factors that may influence this assessment. These factors primarily fall into two categories: visual artifacts and high-frequency noise, which can cause the IC model to misinterpret additional content, and unstable generation, which results in chaotic images that inflate the IC score.

As shown in Figure 8, visual artifacts appear in animal fur and elderly facial hair, leading the model to mistakenly perceive these areas as more complex. This is merely a result of generation defects. Additionally, as shown in Figure 9, the bodies of the tiger and mouse exhibit line irregularities and content disarray due to generation instability, which also contributes to inflated complexity metrics. The presence of extraneous lines and colors in the backgrounds further increases complexity. This issue is particularly pronounced in the “Yin-Yang” image, where instability causes a more noticeable rise in complexity. However, this increased complexity is a result of defects rather than meaningful content. Our goal is to achieve clean, coherent, and meaningful image content. Thus, despite a slight decrease in image complexity, our method effectively balances image quality and content richness.

Image 10: Refer to caption

Figure 10: Qualitative results of TDD with 4-step inference, comparing the effect of applying 𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping for different numbers of initial steps.

𝐱 0 subscript 𝐱 0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT Clipping Sample Details

We further examine the 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping sample using the TDD model (non-adversarial) with a guidance scale of 3.5 and 4-step sampling. Without 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping, the image exhibits excessively high contrast, resulting in an overall unrealistic appearance. In contrast, when applying 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping at the initial step, the image becomes more natural and reveals additional details, such as the scenery outside the window and the garnishes in the food, as shown in Figure 10. However, increasing the number of steps where 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping is applied, from just the first step to every step, results in images that progressively become more washed out and blurry. Subsequent 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping operations do not enhance realism but instead lead to reduced clarity. Therefore, we recommend applying 𝐱 0 subscript 𝐱 0\mathbf{x}{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT clipping only at the first step for higher CFG values to ensure an improvement in image quality.

More Visualizations

We also illustrate some additional samples:

  • •using other base models finetuned from SDXL in Figure 11;

Image 11: Refer to caption

Figure 11: Qualitative results of TDD using different base models. Base 1: realvisxlV40; Base 2: SDXLUnstableDiffusers-YamerMIX.

Image 12: Refer to caption

Figure 12: Qualitative results of TDD using different LoRA adapters with 4-step inference. LoRA 1: SDXL-GundamV3; LoRA 2: Ice; LoRA 3: Papercut; LoRA 4: CLAYMATEV2.03.

Image 13: Refer to caption

Figure 13: Qualitative results of TDD using ControlNets based on Canny (top) and Depth (bottom) with 4-step inference.

Image 14: Refer to caption

Figure 14: Samples generated by TDD using four or five steps with Stable Diffusion XL.

Image 15: Refer to caption

Figure 15: Samples generated by TDD using six or seven steps with Stable Diffusion XL.

Image 16: Refer to caption

Figure 16: Samples generated by TDD using eight steps with Stable Diffusion XL.

Xet Storage Details

Size:
109 kB
·
Xet hash:
35b26351354447f17f335eaa7799c36b4ebe0d85943603e6ae32b86a2aba93da

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.