169 kB

Title: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

URL Source: https://arxiv.org/html/2310.00224

Published Time: Tue, 03 Oct 2023 01:01:05 GMT

Markdown Content: Nithin Gopalakrishnan Nair 1⁣1{}^{1}start_FLOATSUPERSCRIPT 1 * end_FLOATSUPERSCRIPT Anoop Cherian 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Suhas Lohit 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Ye Wang 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

Toshiaki Koike-Akino 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Vishal M. Patel 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Tim K. Marks 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Johns Hopkins University 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Mitsubishi Electric Research Laboratories (MERL)

{ngopala2,vpatel36}@jhu.edu{acherian,slohit,yewang,koike,tmarks}@merl.com

https://merl.com/demos/steered-diffusion

Abstract

Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidance is typically useful only towards synthesizing high-level semantics rather than editing fine-grained details as in image-to-image translation tasks. To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation. The key idea is to steer the image generation of the diffusion model at inference time via designing a loss using a pre-trained inverse model that characterizes the conditional task. This loss modulates the sampling trajectory of the diffusion process. Our framework allows for easy incorporation of multiple conditions during inference. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution. Our results demonstrate clear qualitative and quantitative improvements over state-of-the-art diffusion-based plug-and-play models while adding negligible additional computational cost.

Figure 1: An illustration of various applications of our method. We use a diffusion model trained unconditionally and condition using our proposed algorithm only during the test time. We present the results on six tasks: (a) image inpainting, (b) colorization, (c) image super-resolution, (d) semantic generation, (e) identity replication, and (f) text-based image editing. In part (f), the text prompts for the the first and second columns, respectively, are “This person has blonde hair” and “This person has wavy hair.”

††* Work done during internship at MERL. 1 Introduction

Deep diffusion-based probabilistic generative models[14, 40, 8] are quickly emerging as one of the most powerful methods to synthesize high-quality content and have shown the potential to revolutionize content creation not only in computer vision, but also in many other areas including speech, audio, and language. Such models (e.g., ImagGen[36], Stable Diffusion[34]) have demonstrated outstanding synthesis results in conditional generation tasks, such as text-conditioned image synthesis [2, 33] and image reconstruction [35, 38, 31]. However, these models do not typically possess zero-shot conditional generative abilities when used directly (zero-shot capabilities as are commonly seen in language foundation models such as GPT-3[4]), and often demand large amounts of annotated and paired (multimodal) data for conditional generation, which may be challenging to obtain[15].

One way to circumvent this need for large annotated training sets is to leverage predefined models as plug-and-play modules [25, 12, 24] in an otherwise unconditionally trained diffusion model. Specifically, in such plug-and-play models, a model is first trained in an unconditional setting (without labels). During inference, the plug-and-play modules (networks separately trained for a particular conditional task, e.g., image captioning) are incorporated in the reverse diffusion process to produce intermediate samples guided in the Markov chain in specific directions to satisfy the desired condition. Prior works, such as [25, 26], have proposed similar methods in which the authors derive text- or class-conditioned samples from Generative Adversarial Networks (GANs)[11] that were trained without labels. To achieve this, they iteratively refine the noise input of the GAN until the desired sample satisfies the condition. Very recently, Grakios et al. [12] proposed a diffusion-based plug-and-play method that enables using unconditional diffusion models for conditional generation utilizing class labels. Both these methods are specifically designed for tasks involving label-level semantics. However, these methods do not address the usage of unconditional models for general image-to-image translation tasks, which require synthesizing visual content conditioned on fine-grained details in the source image. There are also works that propose diffusion models for image-to-image translation, such as for image super-resolution and inpainting [5, 20]; however, these methods are task-specific and do not generalize well to new tasks or new types of inverse problems (as demonstrated in Section5). In this work, we present a generic framework that can generalize to any image-to-image translation task.

Figure 2: An illustration of the difference between existing plug-and-play generation approaches (e.g.,[12]) and the proposed approach. Existing plug-and-play works operate with an energy function V 𝑉 V italic_V of the noisy latent x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In contrast, our model uses the implicit prediction of the diffusion model (i.e., a coarse estimate of the clean image x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) in its energy function V 1 subscript 𝑉 1 V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which allows the use of any pre-trained network for steering. In addition, our model provides a looping mechanism V 2 subscript 𝑉 2 V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which iterates N 𝑁 N italic_N times at each timestep t 𝑡 t italic_t to enhance generation quality.

In this paper, we derive the necessary theory and formulate an algorithm, which we call Steered Diffusion, for diffusion-based image editing and image-to-image translation; our model is subsequently validated on a wide range of tasks. Steered diffusion is motivated by the energy-based formulation of diffusion probabilistic models[10]. In general, inference in a generative model can be thought of as deriving samples from a learned distribution. Recall that every probability density function can be formulated as an energy field that describes an unnormalized estimate of how the distribution density varies in space[13, 25]. If one needs to find points in space that are the closest match to a given condition, one can utilize gradient-based optimization algorithms to find points in the field that have the highest density value for the condition. The gradient-based optimization scheme can be viewed as a modulation of the energy toward the desired direction. Previous work has utilized this idea on GANs[25, 26] and obtained reasonable results for label-based generation tasks. Due to their model structure, diffusion models are ideal candidates for such an energy modulation. One key challenge remains to design a good energy estimator that is robust to all noise levels. Previously, classifier-based guidance [27, 8] has been proposed and thought of as an energy modulation utilizing a pretrained classifier trained on noisy images. This poses a limitation that the guiding function should be noise-robust. In this work, we propose an alternative solution that does not need noise-robust networks but could use any network by utilizing the diffusion model as an implicit denoiser. Figure 2 gives a brief overview of how our approach is different from existing methods.

We present experiments using steered diffusion on multiple conditional generative tasks on faces as well as generic images as portrayed in Figure 1, We present results on (i) identity replication [7], (ii) semantic image generation[29], (iii) linear inverse problems[21], and (iv) text-conditioned image editing. Although our method is generic, for evaluations we perform experiments on faces. Before presenting our framework in detail, we now summarize the key contributions of our work:

• We propose steered diffusion, a general plug-and-play framework that can utilize various pre-existing models to steer an unconditional diffusion model.
• We present the first work applicable to both label-level synthesis and image-to-image translation tasks and demonstrate its effectiveness for various applications.
• We propose an implicit conditioning-based sampling strategy that significantly boosts the performance of conditional sampling from unconditional diffusion models compared with previous methods.
• We introduce a new strategy that uses multiple steps of projected gradient descent to improve sample quality.

2 Background

2.1 Related Work

Early works on unpaired image-to-image translation utilize a cycle consistency loss between the input and the target domains [46, 9]. Newer works, such as[16], have introduced a contrastive learning-based approach where a contrastive loss between corresponding patches of the input and target domains is minimized. The consistency-based method often fails to generate photorealistic images; hence, conditional generative models are preferred when labeled data are available. A few works [37, 38] utilize diffusion models for conditional image-to-image translation because of their photorealistic generation quality.

Guiding diffusion models during inference time has been explored by several works, such as[24]. The first method that proposed inference-time conditioning [8] uses a pretrained noise-robust classifier to guide the inference of an unconditional model. GLIDE[27] proposed a method for conditioning using text. Earlier work in plug-and-play modelling for generative models utilized GANs and performed iterative refinement on the latent space of GANs[25]. This method uses a predefined classifier or text captioning network to estimate a loss between the desired label output or text caption and the one generated from the GAN generator. This loss is backpropagated to refine the noise input of the GAN iteratively until the generator predicts the desired output. Recently,[12] proposed a method that uses diffusion models as a plug-and-play prior for class-conditioned generation. Several works have addressed the task of image-to-image translation using unconditional diffusion models [18, 5, 20, 1], but each of these proposes a task-specific inference scheme. For example, ILVR[5] performs image super-resolution, and Reinpaint[20] performs image inpainting. Blended diffusion[1] proposes a method for text-conditioned image editing. DDRM[18] proposes an inference-time scheme offering a general solution for linear inverse problems such as colorization and super-resolution.

2.2 Concurrent Work

In concurrent work to ours that also explores zero-shot conditional generation using diffusion models, [3] used a text to image model[34] and a two-step forward and backward universal guidance process, but it works well only after heavy optimization on network-based inverse problems such as semantic generation and identity generation. Another concurrent work[42] explored the usage of unconditional diffusion models for linear inverse problems using a pseudo-inverse model. In contrast to all these prior works, our steered diffusion algorithm generalizes well to both image-to-image translation tasks and high-level label-based generation tasks.

2.3 Denoising Diffusion Probabilistic Models

Denoising diffusion probabilistic models (DDPMs) [40, 14] belong to a class of generative models in which the model learns the distribution of data through a Markovian sampling process. DDPMs consist of a forward process and a reverse process. Let x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the latent state of an input image at timestep t 𝑡 t italic_t in a diffusion process. The sampling operation q⁢(⋅)𝑞⋅q(\cdot)italic_q ( ⋅ ) for the forward process in DDPM is defined as:

q⁢(x t|x t−1):=𝒩⁢(x t;1−β t⁢x t−1,β t⁢I),assign 𝑞 conditional subscript 𝑥 𝑡 subscript 𝑥 𝑡 1 𝒩 subscript 𝑥 𝑡 1 subscript 𝛽 𝑡 subscript 𝑥 𝑡 1 subscript 𝛽 𝑡 𝐼 q(x_{t}|x_{t-1}):=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}},x_{t-1},\beta_{t}I),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) ,(1)

where {β t}subscript 𝛽 𝑡{\beta_{t}}{ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } is a predefined variance schedule and I 𝐼 I italic_I is the identity matrix.

The forward process can be considered as a noising operation, where the next state x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained from the current state x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT by adding a small amount of Gaussian noise according to the sampled timestep. The state x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t 𝑡 t italic_t can also be sampled directly from the initial state x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, using:

q⁢(x t|x 0):=𝒩⁢(x t;α¯t⁢x 0,(1−α¯t)⁢I),assign 𝑞 conditional subscript 𝑥 𝑡 subscript 𝑥 0 𝒩 subscript 𝑥 𝑡 subscript¯𝛼 𝑡 subscript 𝑥 0 1 subscript¯𝛼 𝑡 𝐼 q(x_{t}|x_{0}):=\mathcal{N}\big{(}x_{t};\sqrt{\bar{\alpha}{t}}x{0},(1-\bar{% \alpha}_{t})I\big{)},italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) ,(2)

or equivalently,

x t=x 0⁢α¯t+ϵ⁢1−α¯t,ϵ∼𝒩⁢(0,I),formulae-sequence subscript 𝑥 𝑡 subscript 𝑥 0 subscript¯𝛼 𝑡 italic-ϵ 1 subscript¯𝛼 𝑡 similar-to italic-ϵ 𝒩 0 𝐼 x_{t}=x_{0}\sqrt{\bar{\alpha}{t}}+\epsilon\sqrt{1-\bar{\alpha}{t}},\quad% \epsilon\sim\mathcal{N}(0,I),italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) ,(3)

where α¯t=∏s=1 t α s subscript¯𝛼 𝑡 superscript subscript product 𝑠 1 𝑡 subscript 𝛼 𝑠\bar{\alpha}{t}=\prod{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and α t=1−β t subscript 𝛼 𝑡 1 subscript 𝛽 𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

In[40], it is shown that if the number of time steps is large and the increment in {β i}subscript 𝛽 𝑖{\beta_{i}}{ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is small, then each step in the reverse sampling process can also be approximated by a Gaussian. If μ θ subscript 𝜇 𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Σ θ subscript Σ 𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT respectively denote the mean and the covariance of this Gaussian, modelled via neural networks with parameters θ 𝜃\theta italic_θ, then each reverse step samples the state x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT according to:

p θ⁢(x t−1|x t):=𝒩⁢(x t−1;μ θ⁢(x t,t),Σ θ⁢(x t,t)).assign subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝒩 subscript 𝑥 𝑡 1 subscript 𝜇 𝜃 subscript 𝑥 𝑡 𝑡 subscript Σ 𝜃 subscript 𝑥 𝑡 𝑡 p_{\theta}(x_{t-1}|x_{t}):=\mathcal{N}\big{(}x_{t-1};\mu_{\theta}(x_{t},t),% \Sigma_{\theta}(x_{t},t)\big{)}.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) .(4)

The parameters θ 𝜃\theta italic_θ are obtained by minimizing the variational lower bound of the negative log-likelihood of the data distribution.

3 Proposed Method

Figure 3: An illustration of Steered Diffusion. During each step of the sampling process the implicit prediction is steered to the direction of the condition using a steering network or predefined function. Note that this figure is only for illustrating the idea and does not show the actual sampled images; in the actual sampling, the steering process is much more gradual, not sudden as potrayed in this image.

3.1 Steered Diffusion at Inference Time

Our work is motivated by the energy-based formulation of diffusion models. For any probability density function, the corresponding energy-based model (EBM) is defined by:

p θ⁢(x)subscript 𝑝 𝜃 𝑥\displaystyle p_{\theta}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x )=exp⁡(−V⁢(x)⁢missing)Z,absent 𝑉 𝑥 missing 𝑍\displaystyle=\frac{\exp\big(-V(x)\big{missing})}{Z},= divide start_ARG roman_exp ( start_ARG - italic_V ( italic_x ) roman_missing end_ARG ) end_ARG start_ARG italic_Z end_ARG ,(5)

where V⁢(x)𝑉 𝑥 V(x)italic_V ( italic_x ) denotes the corresponding energy function across states x 𝑥 x italic_x, and Z 𝑍 Z italic_Z denotes a normalization constant. To derive samples from this distribution, one can utilize the Langevin equation[39] describing the state transition of a particle in the presence of an energy field. For diffusion models the sampling step is

x t−1=x t−∇x t log⁡p θ⁢(x t−1|x t)+ϵ,ϵ∼𝒩⁢(0,I).formulae-sequence subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript∇subscript 𝑥 𝑡 subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 italic-ϵ similar-to italic-ϵ 𝒩 0 𝐼\displaystyle x_{t-1}!=x_{t}-\nabla_{x_{t}}\log p_{\theta}(x_{t-1}|x_{t})+% \epsilon,;\epsilon!\sim!\mathcal{N}(0,I).italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) .(6)

The term ∇x t log⁢p θ⁢(x t−1|x t)subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡\nabla_{x_{t}}\text{log }p_{\theta}(x_{t-1}|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is called the score function of the density p θ⁢(x t−1|x t)subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 p_{\theta}(x_{t-1}|x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). One key advantage of the energy-based formulation is that it allows modulation of the energy function to satisfy given criteria. This was initially introduced as classifier guidance [8], which allows label conditional sampling from an unconditionally trained diffusion model utilizing a noise-robust classifier. In the remaining part of this section, we motivate how we can extend the functionality of unconditional diffusion models to conditional tasks. Consider a conditional sampling scenario based on a condition c 𝑐 c italic_c, for sampling from a state x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to state x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The conditional transition probability p θ⁢(x t−1|x t,c)subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 p_{\theta}(x_{t-1}|x_{t},c)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) can be decomposed as

p θ⁢(x t−1|x t,c)∝p θ⁢(x t−1|x t)⁢p⁢(c|x t−1)p⁢(c|x t).proportional-to subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑝 conditional 𝑐 subscript 𝑥 𝑡 1 𝑝 conditional 𝑐 subscript 𝑥 𝑡\displaystyle p_{\theta}(x_{t-1}|x_{t},c)\propto\frac{p_{\theta}(x_{t-1}|x_{t}% )p(c|x_{t-1})}{p(c|x_{t})}.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) ∝ divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .(7)

Hence, for any timestep t 𝑡 t italic_t, the effective score for conditional transition can be found by utilizing using the log of probability density in the EBM formulations (5) of the individual densities and can be represented as

∇x t log⁢p θ⁢(x t−1|x t,c)=subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 absent\displaystyle\nabla_{x_{t}}\text{log }p_{\theta}(x_{t-1}|x_{t},c)=∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) = ∇x t log⁢p θ⁢(x t−1|x t)−∇x t V 1⁢(x t,c)+∇x t V 2⁢(x t−1,c),subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 𝑡 1 𝑐\displaystyle\nabla_{x_{t}}\text{log }p_{\theta}(x_{t-1}|x_{t})-\nabla_{x_{t}}% V_{1}(x_{t},c)+\nabla_{x_{t}}V_{2}(x_{t-1},c),∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c ) ,(8)

where V 1 subscript 𝑉 1 V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and V 2 subscript 𝑉 2 V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the corresponding energy functions that model the conditional distributions of x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT given a condition c 𝑐 c italic_c. Specifically, they project the higher dimensional x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the lower dimensional space of c 𝑐 c italic_c and measure the distance between the mapped value and c 𝑐 c italic_c. The better this particular measure, the more effectively it can be used to generate conditional samples from an unconditional model. Using Eq.(8), the the conditional sampling equation for the reverse process is

x t−1=subscript 𝑥 𝑡 1 absent\displaystyle x_{t-1}=italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =1 α t⁢(x t−1−α t 1−α¯t⁢ϵ θ(t)⁢(x t,t))1 subscript 𝛼 𝑡 subscript 𝑥 𝑡 1 subscript 𝛼 𝑡 1 subscript¯𝛼 𝑡 subscript superscript italic-ϵ 𝑡 𝜃 subscript 𝑥 𝑡 𝑡\displaystyle\frac{1}{\sqrt{\alpha_{t}}}\bigg{(}x_{t}-\frac{1-\alpha_{t}}{% \sqrt{1-\bar{\alpha}{t}}}\epsilon^{(t)}{\theta}(x_{t},t)\bigg{)}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) −∇x t V 1⁢(x t,c)+∇x t V 2⁢(x t−1,c)+σ t⁢ϵ.subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 𝑡 1 𝑐 subscript 𝜎 𝑡 italic-ϵ\displaystyle-\nabla_{x_{t}}V_{1}(x_{t},c)+\nabla_{x_{t}}{V_{2}}(x_{t-1},c)+% \sigma_{t}\epsilon.- ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ .(9)

Here ϵ θ(t)⁢(x t,t)subscript superscript italic-ϵ 𝑡 𝜃 subscript 𝑥 𝑡 𝑡\epsilon^{(t)}{\theta}(x{t},t)italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the network prediction at a timestep t 𝑡 t italic_t, and σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the corresponding variance of the reverse step. The formulation in Eq. (3.1) shows that the energy function requires a functional mapping from a noisy x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c.

In many applications of interest, the mapping function from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c is complex and can be modeled effectively using deep networks. For example, in image-to-image translation tasks such as image generation from semantic maps[29], the image is x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the semantic map is the condition c 𝑐 c italic_c. Similarly, for text to image generation, the image is x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and c 𝑐 c italic_c corresponds to the text. Ideally, we would like to employ an existing (pre-trained) deep neural network for the mapping from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c. However, deep networks are usually trained on clean images, which limits the usability of existing pre-trained deep networks for mapping directly from the noisy image x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c. One workaround would be to use noise-robust networks to map from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c, but training noise-robust networks for conditional mapping can be computationally expensive. Moreover, a network that is trained with multiple different noise levels often results in lower mapping performance, as it cannot denoise all noise levels accurately; we validate this claim experimentally in Section5.6. Alternatively, one could include two mapping functions: a first that denoises x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and a second that maps from the denoised image to c 𝑐 c italic_c. Rather than training a seperate denoising network, however, we realized that diffusion models are inherently trained as denoisers, and reconstruction quality improves as time proceeds in the reverse sampling of the diffusion process. Because of this capacity, we can use a reverse sampling step to make coarse predictions of the denoised image from any time step t 𝑡 t italic_t.

Hence, we modify our original energy expression (8) to:

∇x t subscript∇subscript 𝑥 𝑡\displaystyle\nabla_{x_{t}}∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log⁢p θ⁢(x t−1|x t,c)=∇x t log⁢p θ⁢(x t−1|x t)log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡\displaystyle\text{log }p_{\theta}(x_{t-1}|x_{t},c)=\nabla_{x_{t}}\text{log }p% {\theta}(x{t-1}|x_{t})log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) = ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) −∇x t V 1⁢(x 0|t,c)−δ 1+∇x t V 2⁢(x 0|t−1,c)+δ 2,subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 conditional 0 𝑡 𝑐 subscript 𝛿 1 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 conditional 0 𝑡 1 𝑐 subscript 𝛿 2\displaystyle-\nabla_{x_{t}}V_{1}(x_{0|t},c)-\delta_{1}+\nabla_{x_{t}}V_{2}(x_% {0|t-1},c)+\delta_{2},- ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_c ) - italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT , italic_c ) + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(10)

where, we define the implicit step prediction x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT as:

x 0|t subscript 𝑥 conditional 0 𝑡\displaystyle x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT=x t−1−α¯t⁢ϵ θ(t)⁢(x t)α t¯.absent subscript 𝑥 𝑡 1 subscript¯𝛼 𝑡 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡¯subscript 𝛼 𝑡\displaystyle=\frac{x_{t}-\sqrt{1-\bar{\alpha}{t}},\epsilon{\theta}^{(t)}(x% {t})}{\sqrt{\bar{\alpha{t}}}}.= divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG .(11)

Here, we assume x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are first denoised to x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT and x 0|t−1 subscript 𝑥 conditional 0 𝑡 1 x_{0|t-1}italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT, respectively. The terms δ 1 subscript 𝛿 1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and δ 2 subscript 𝛿 2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT capture the errors arising from the shift in the domain from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and from x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; for large t 𝑡 t italic_t, δ 1≈δ 2 subscript 𝛿 1 subscript 𝛿 2\delta_{1}\approx\delta_{2}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as the implicit predictions at nearby steps tend to be similar.

As shown in our experiments, the energy function should be selected according to the task. An easy way to choose the energy function is by looking at the training loss of the mapping network. For example, in the case of semantic generation, a good energy function is the cross-entropy loss between the predicted semantic map at any timestep and the input semantic map. In the case of identity replication, a good choice of regularization would be the negative cosine similarity score between the embeddings from a recognition network for the input and target image. In the case of text-to-image generation, it would be CLIP loss [32]. As a rule of thumb, an energy function could be chosen easily by looking at the loss function used to train the pre-trained network (or an inverse function) that maps from the image x 𝑥 x italic_x to the condition c 𝑐 c italic_c.

Algorithm 1 Steered Diffusion

1:Energy function

V 𝑉 V italic_V , condition

c 𝑐 c italic_c

x T∼𝒩⁢(x T;0,I)similar-to subscript 𝑥 𝑇 𝒩 subscript 𝑥 𝑇 0 𝐼 x_{T}\sim\mathcal{N}(x_{T};0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , italic_I )

3:for

t=T−1,…,1 𝑡 𝑇 1…1 t={T-1},\ldots,1 italic_t = italic_T - 1 , … , 1 do

4:for

n=N,…,1 𝑛 𝑁…1 n=N,\ldots,1 italic_n = italic_N , … , 1 do

ϵ∼𝒩⁢(ϵ;0,I)similar-to italic-ϵ 𝒩 italic-ϵ 0 𝐼\epsilon\sim\mathcal{N}(\epsilon;0,I)italic_ϵ ∼ caligraphic_N ( italic_ϵ ; 0 , italic_I )

x 0|t=x t−1−α¯t⁢ϵ θ(t)⁢(x t)α t¯subscript 𝑥 conditional 0 𝑡 subscript 𝑥 𝑡 1 subscript¯𝛼 𝑡 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡¯subscript 𝛼 𝑡 x_{0|t}=\frac{x_{t}-\sqrt{1-\bar{\alpha}{t}},\epsilon{\theta}^{(t)}(x_{t})}% {\sqrt{\bar{\alpha_{t}}}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT = divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG

7:Compute

x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT with

V,c 𝑉 𝑐 V,c italic_V , italic_c using Eq. (15)

8:if

(n>1)𝑛 1(n>1)( italic_n > 1 ) then

9:Compute

x t u⁢c superscript subscript 𝑥 𝑡 𝑢 𝑐 x_{t}^{uc}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_c end_POSTSUPERSCRIPT using Eq. (3.2)

10:else

11:Compute

x t−1 u⁢c superscript subscript 𝑥 𝑡 1 𝑢 𝑐 x_{t-1}^{uc}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_c end_POSTSUPERSCRIPT using Eq. (3.2)

12:end if

13:end for

14:end for

15:return

x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

3.2 Revisiting Sampling in Diffusion Models

To obtain a closed-form expression for plugging the energy-based formulation into the reverse sampling process efficiently, we take inspiration from DDIM[41] and revisit the reverse sampling operation of diffusion models. From p θ⁢(x 1:T)subscript 𝑝 𝜃 subscript 𝑥:1 𝑇 p_{\theta}(x_{1:T})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), one can generate a sample x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from a sample x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by:

x t−1 uc superscript subscript 𝑥 𝑡 1 uc\displaystyle x_{t-1}^{\text{uc}}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT uc end_POSTSUPERSCRIPT=α¯t−1⋅x t−1−α¯t⁢ϵ θ(t)⁢(x t)α t¯⏟“ predicted⁢x 0⁢”absent⋅subscript¯𝛼 𝑡 1 subscript⏟subscript 𝑥 𝑡 1 subscript¯𝛼 𝑡 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡¯subscript 𝛼 𝑡“ predicted subscript 𝑥 0”\displaystyle=\sqrt{\bar{\alpha}{t-1}}\cdot\underbrace{\frac{x{t}-\sqrt{1-% \bar{\alpha}{t}},\epsilon{\theta}^{(t)}(x_{t})}{\sqrt{\bar{\alpha_{t}}}}}{% \text{`` predicted }x{0}\text{''}}= square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ⋅ under⏟ start_ARG divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG end_ARG start_POSTSUBSCRIPT “ predicted italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ” end_POSTSUBSCRIPT +\displaystyle++1−α¯t−1−σ t 2⋅ϵ θ(t)⁢(x t)⏟“direction pointing to⁢x t⁢”+σ t⁢ϵ⏟random noise,subscript⏟⋅1 subscript¯𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡“direction pointing to subscript 𝑥 𝑡”subscript⏟subscript 𝜎 𝑡 italic-ϵ random noise\displaystyle\underbrace{\sqrt{1-\bar{\alpha}{t-1}-\sigma{t}^{2}}\cdot% \epsilon_{\theta}^{(t)}(x_{t})}{\text{``direction pointing to }x{t}\text{''}% }+\underbrace{\sigma_{t}\epsilon}_{\text{random noise}},under⏟ start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT “direction pointing to italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ” end_POSTSUBSCRIPT + under⏟ start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ end_ARG start_POSTSUBSCRIPT random noise end_POSTSUBSCRIPT ,(12)

as in Song et al. [41]. Using Eq.(11), we can rewrite the unconditional sampling step Eq.(12) in terms of x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT:

x t−1 uc=superscript subscript 𝑥 𝑡 1 uc absent\displaystyle x_{t-1}^{\text{uc}}=italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT uc end_POSTSUPERSCRIPT =α¯t−1⁢x 0|t+limit-from subscript¯𝛼 𝑡 1 subscript 𝑥 conditional 0 𝑡\displaystyle\sqrt{\bar{\alpha}{t-1}}x{0|t};+square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT + 1−α¯t−1−σ t 2⋅x t−α t¯⁢x 0|t 1−α¯t+σ t⁢ϵ,⋅1 subscript¯𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 subscript 𝑥 𝑡¯subscript 𝛼 𝑡 subscript 𝑥 conditional 0 𝑡 1 subscript¯𝛼 𝑡 subscript 𝜎 𝑡 italic-ϵ\displaystyle\sqrt{1-\bar{\alpha}{t-1}-\sigma{t}^{2}}\cdot\frac{x_{t}-\sqrt{% \bar{\alpha_{t}}}x_{0|t}}{\sqrt{1-\bar{\alpha}{t}}}+\sigma{t}\epsilon,square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ ,(13)

Here the superscript uc denotes the unconditional sample, which is obtained without any steering while transitioning from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The conditional sampling step(8) can then be rewritten as

x t−1=x t−1 uc−∇x t V 1⁢(x 0|t,c)+∇x t V 2⁢(x 0|t−1,c).subscript 𝑥 𝑡 1 superscript subscript 𝑥 𝑡 1 uc subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 conditional 0 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 conditional 0 𝑡 1 𝑐\displaystyle x_{t-1}=x_{t-1}^{\text{uc}}-\nabla_{x_{t}}V_{1}(x_{0|t},c)+% \nabla_{x_{t}}V_{2}(x_{0|t-1},c).italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT uc end_POSTSUPERSCRIPT - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_c ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT , italic_c ) .(14)

Through this, we can modulate x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly, as x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT is also a function of ϵ θ(t)⁢(x t)superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡\epsilon_{\theta}^{(t)}(x_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Following Eq.(14), a rough estimate of the desired x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for conditional sampling, represented by x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT, can be obtained using

x 0|t feas=x 0|t−k⁢(t)⁢∇x t(V 1⁢(x 0|t,c)−V 2⁢(x 0|t−1,c)),superscript subscript 𝑥 conditional 0 𝑡 feas subscript 𝑥 conditional 0 𝑡 𝑘 𝑡 subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 conditional 0 𝑡 𝑐 subscript 𝑉 2 subscript 𝑥 conditional 0 𝑡 1 𝑐\displaystyle x_{0|t}^{\mathrm{feas}}=x_{0|t}-k(t)\nabla_{x_{t}}\left(V_{1}(x_% {0|t},c)-V_{2}(x_{0|t-1},c)\right),italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT - italic_k ( italic_t ) ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_c ) - italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT , italic_c ) ) ,(15)

where k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) is a scaling factor defining the strength of regularization. We call the process of finding x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT from x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT Implicit Steering Control (ISC), and we call the new sampling process steered diffusion. The exact algorithm is illustrated in Fig.3 and explained in Algorithm1.

4 Tips for Improved Performance

4.1 Linear Inverse Problems

For optimization-based inverse problems such as text-to-image generation and semantic map to natural image generation, the exact mapping function is not always available. On the other hand, for linear inverse problems such as colorization, super-resolution, and image inpainting, the mapping is simply a linear function, and we can write Eq.(15) more simply. In these cases, the exact mapping function to the latent space of the condition is known. Hence, one can decompose the implicit prediction at each timestep along the direction of the condition and simply replace this component by its desired ideal condition, i.e., if our predicted sample needs to map to a condition c, then the modified implicit prediction step becomes

x 0|t feas=x 0|t+k⁢(t)⁢(D⁢(y)−D⁢(x 0|t)),where⁢c=D⁢(y).formulae-sequence subscript superscript 𝑥 feas conditional 0 𝑡 subscript 𝑥 conditional 0 𝑡 𝑘 𝑡 𝐷 𝑦 𝐷 subscript 𝑥 conditional 0 𝑡 where 𝑐 𝐷 𝑦\displaystyle x^{\mathrm{feas}}{0|t}=x{0|t}+k(t)(D(y)-D(x_{0|t})),\quad\text% {where }c=D(y).italic_x start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT + italic_k ( italic_t ) ( italic_D ( italic_y ) - italic_D ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT ) ) , where italic_c = italic_D ( italic_y ) .(16)

Here, y 𝑦 y italic_y is the clean image and D 𝐷 D italic_D is the known degradation model. Our sampling procedure ensures that the series of operations preserve the consistency of domains of x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT to the original data distribution at the corresponding timesteps. An illustration is shown in Figure 4.

Figure 4: An illustration of the steering function for linear inverse problems. For linear inverse problems, the component of the implicit prediction along the degradation direction can be replaced by the ground truth condition.

4.2 Multi-Step Implicit Modulation

Our experiments show (see Fig.7) that performing refinement using Eq.(15) on the implicit step prediction multiple times for each timestep significantly boosts the conditioning quality for more ill-posed conditions such as image inpainting and colorization. A similar observation was also found by[20]. Specifically, at a particular timestep t 𝑡 t italic_t, we iterate the procedure of steering towards the next sampling step x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and then adding noise to return to x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We present the corresponding algorithm in Algorithm1. An example is shown in Fig.7, where more realistic images are generated using the multiple-step sampling scheme row labeled “OURS multi.” Effectively, the V 2 subscript 𝑉 2 V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT term in Eq.(12) can be thought of as enabling a multistep sampling in which we modulate the current step by looking ahead to the next sampling step. On a careful analysis of Eq.(7), i.e. from the score contribution due to the different regularization functions, we can see that the term for ∇x t V 1⁢(x t,c)subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 𝑡 𝑐\nabla_{x_{t}}V_{1}(x_{t},c)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) modulates x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on its current state, and the term from ∇x t V 2⁢(x t−1,c)subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 𝑡 1 𝑐\nabla_{x_{t}}V_{2}(x_{t-1},c)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c ) is a look-ahead correction where the derivative with respect to the future prediction is found. This is exactly what happens in the case of looping back from x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is modulated iteratively by looking forward to what the future prediction would be.

4.3 Choosing the Scaling Factor k⁢(t)𝑘 𝑡{k(t)}italic_k ( italic_t )

In Eq.(15), k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) denotes the strength of the regularization constraint. A very small k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) would denote no effective regularization, and a large k 𝑘 k italic_k would lead to the diffusion process going out of the latent space manifold. Since the derivative of the regularization function by itself is a score value, similar to the normal scaling value of the score function, the appropriate time-varying normalization factor is 1−α t¯1¯subscript 𝛼 𝑡\sqrt{1-\bar{\alpha_{t}}}square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG. The exact value for k(t) for each task is defined in Table1. For linear inverse problems we use a constant k⁢(t)=1 𝑘 𝑡 1 k(t)=1 italic_k ( italic_t ) = 1, which provided the best results.

5 Experiments

We evaluate the performance of our network qualitatively and quantitatively using four image-to-image translation tasks—semantic layout to face image translation, face inpainting, face colorization, and face super-resolution—as well as two high-level vision tasks: identity-based image generation and text-guided image editing. Unlike existing approaches, our method is completely zero-shot and applies to a wide variety of tasks. We compare with other diffusion-based approaches best applicable for each task for a fair evaluation. We also compare the semantic layout to image translation performance with that of task-specific unsupervised methods. We choose the unconditional model released by [5] as the unconditional pretrained diffusion model in all of our experiments with faces. Note that the sampling scheme in ILVR[5] and Repaint[20] can be thought of as happening at time t 𝑡 t italic_t rather than at the implicit step as in our method. Hence, comparing these methods for super-resolution and inpainting in our experiments below can be considered an additional ablation study highlighting the improvement from our implicit sampling.

Table 1: Parameter set for each application.

5.1 Implementation Details

For our experiments, we utilize pixel-level unconditional diffusion models. For faces, we utilize the model trained on the FFHQ dataset[17] that was released by[5]. For generic images, we use the model trained on ImageNet[6] released in ADM[8]. All our experiments use 100 steps of sampling.

5.2 Semantic Face Generation

(a)Labels

(b)CycleGAN

(c)CUT

(d)ILVR

(e)OURS

Figure 5: Qualitative comparisons for semantic generation.

Figure 6: Results on 8×8\times 8 × super resolution.

To evaluate how our method performs in generic image-to-image translation tasks, we evaluate our method’s performance for the task of semantic layout to face generation. We utilize the CelebA dataset for this. To generate the semantic labels, we use facer[45] and create 11 11 11 11 label classes for each face. Since there is no other unconditional model that can perform fully test-time semantic generation, to evaluate the performance of our method, we compare with fully unsupervised image translation methods: CycleGAN[46], CUT[28], and ILVR [5].

The corresponding qualitative results are shown in Fig.5. It is clear that CUT, CycleGAN, and ILVR produce unrealistic facial images with huge artifacts or create low-resolution faces. In contrast, our method always creates good-quality realistic faces. We present the quantitative results in Table2. The table shows that our method obtains the best FID scores of all methods and obtains the best mIOU score among the inference-time techniques.

5.3 Face Super-Resolution

Figure 7: Qualitative comparisons for colorization. The row labeled “OURS multi” refers to the use of multi-step sampling, as described in Sec.4.2.

(a)Degraded

(b)Reinpaint

(c)Ours

(d)Degraded

(e)Reinpaint

(f)Ours

(g)Degraded

(h)Reinpaint

(i)Ours

Figure 8: Qualitative comparisons for inpainting for thin, medium and thick masks.

(a)Original Image

(b)Photo of a young man

(c)She has wavy hair

(d)She has blonde hair

(e)Photo of an old woman

(f)She is angry

(g)She is sad

Figure 9: Qualitative comparisons for text-based image editing.

We evaluate the performance for the face super-resolution task using the CelebA dataset[19]. As baselines, we utilize fully inference-time methods in which no task-specific training is used. As the first baseline, we choose PULSE[22], a self-supervised upsampling technique utilizing GANs. As the next comparison method, we choose ILVR[5] which, like our method, performs super-resolution utilizing an unconditional pre-trained diffusion model. However, in ILVR, sampling happens at timestep t 𝑡 t italic_t rather than at the implicit step in our algorithm. In total, we utilize 300 images for evaluations. We present some qualitative results in Fig.6. For ILVR[5], we utilize 100 timesteps of sampling, the same as in our case. As we can see, PULSE[22] and ILVR[5] are unable to restore the correct identity and also contain blur artifacts after restoration. On the other hand, steered diffusion (our method) is able to restore photorealistic facial images. The qualitative evaluations are presented in Table 3; our method yields a 0.18 improvement in perceptual similarity, 6.95 dB improvement in PSNR, and 0.24 improvement in SSIM versus all of the other comparison methods.

Table 2: Quantitative results for semantic generation

Table 3: Quantitative results for super-resolution

5.4 Face Colorization

As a baseline method, we modify ILVR[5] to suit the task of colorization. For this, rather than performing the constraint at every step, we start the sampling process from a noised grayscale image and enforce consistency between the generated and original grayscale images. In total, we utilize 300 images for evaluation. The corresponding results can be seen in Fig.7. As we can see, our method is able to reconstruct photorealistic faces with naturalistic colours compared to ILVR[5]. The corresponding quantitative metrics are presented in Table.4. We get a significant boost in performance, with an FID score of 19 19 19 19, LPIPS[44] score of 0.19 0.19 0.19 0.19, and NIQE[23] score of 1.5 1.5 1.5 1.5 for our method.

Table 4: Quantitative results for colorization

Table 5: Quantitative results for inpainting.

(a)Semantic Map

(b)K=0 𝐾 0 K=0 italic_K = 0

(c)K=200 𝐾 200 K=200 italic_K = 200

(d)K=2000 𝐾 2000 K=2000 italic_K = 2000

(e)K=20000 𝐾 20000 K=20000 italic_K = 20000

(f)K=2×10 5 𝐾 2 superscript 10 5 K=2\times 10^{5}italic_K = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT

(g)K=2×10 6 𝐾 2 superscript 10 6 K=2\times 10^{6}italic_K = 2 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT

Figure 10: Figure showing sample variation with scaling factor k⁢(t)=K⁢1−α t¯𝑘 𝑡 𝐾 1¯subscript 𝛼 𝑡 k(t)=K\sqrt{1-\bar{\alpha_{t}}}italic_k ( italic_t ) = italic_K square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG

(a)Degraded

(b)Noise robust

(c)OURS

Figure 11: Qualitative comparisons for Colorization.

5.5 Inpainting, Image Editing, Identity replication

Our method can also utilize multiple conditions simultaneously; we provide an illustration in Figure 9 where we condition with an identity-preserving network and a text caption simultaneously. From the figure, one can see that our method can generalize well to a diverse range of captions. For preserving identity, we used the VGGFace network[30]. We utilize FARL[45], which is pre-trained with face and corresponding text pairs to enforce the captions. For generic identity replication as in Figure 1 we use FARL face embedder.

For our image inpainting experiments, we use the subset released by [43] and evaluate three different kinds of masks. Our method obtains better results than existing baselines across all mask variations. Qualitative and quantitative results are shown in Fig.8 and Table5, respectively.

5.6 Ablation Study

Effect of scaling factor k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) for semantic generation: In this section, we analyze how the scaling factor affects the quality of the sample in the case of complex conditioning of semantic generation. Fig.10 shows the variation of sample quality starting from the same initial noise with different scaling factors k⁢(t)=K⁢1−α t¯𝑘 𝑡 𝐾 1¯subscript 𝛼 𝑡 k(t)=K\sqrt{1-\bar{\alpha_{t}}}italic_k ( italic_t ) = italic_K square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG. The sample quality is bad for very low scaling factors, and for very high scaling factors, the diffusion process escapes the manifold of natural face images. We show the variation in sample quality for a fixed scaling factor versus a time-varying scaling factor in Fig.12 , which demonstrates that a time-varying scale factor produces more realistic samples. This is because the effective variance of the noise scheduling, which effectively controls the amount of regularization possible at a particular timestep, reduces as the generation process proceeds. Hence a larger tweak is permissible at the early steps of diffusion, and only very small tweaks are permitted in the later steps.

Noise-Robust classifier: To validate the claims in section 3.1, we train a noise-robust inverse mapper for the task of colorization and show the output of the noise-robust classifier and the diffusion outputs for different noise levels in Fig.11. The noise-robust classifier fails to preserve key details that our approach preserves.

Limitations Although our method can generalize to a wide series of tasks, one limitation that persists is the value of the scaling factor k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ). The value of k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) has to be empirically found based on the task, but once a few images are used to tune the value of k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ), the model generalizes well to other conditioning images of the same task. Like any other conditional generation models that can perform image editing, our method also has societal impacts, and care must be taken in applying these methods.

Figure 12: A comparison of a time-varying scaling factor with non-time-varying.

6 Conclusion

In this paper, we propose the first framework for plug-and-play conditional generation that can generalize well to both image-to-image translation tasks and label-based generation tasks. For this, we use the energy-based formulation of diffusion models and modulate the inference process using a task-specific predefined network or other preexisting function. Furthermore, we introduce a novel implicit sampling-based technique that improves the sampling quality across multiple tasks. We performed experiments on various tasks to show that our method can generalize across multiple tasks and outperforms existing methods that do not require additional training.

References

[1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
[2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
[3] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. arXiv preprint arXiv:2302.07121, 2023.
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[5] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[7] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
[8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
[9] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and Dacheng Tao. Geometry-Consistent Generative Adversarial Networks for One-Sided Unsupervised Domain Mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[10] Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P Kingma. Learning energy-based models by diffusion recovery likelihood. arXiv preprint arXiv:2012.08125, 2020.
[11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
[12] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012, 2022.
[13] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
[15] Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product-of-experts gans. arXiv preprint arXiv:2112.05130, 2021.
[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
[17] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
[18] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. arXiv preprint arXiv:2201.11793, 2022.
[19] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[20] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
[21] Kangfu Mei, Nithin Gopalakrishnan Nair, and Vishal M Patel. Bi-noising diffusion: Towards conditional diffusion models with generative restoration priors. arXiv preprint arXiv:2212.07352, 2022.
[22] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 2437–2445, 2020.
[23] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
[24] Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Unite and conquer: Plug & play multi-modal synthesis using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6070–6079, 2023.
[25] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4467–4477, 2017.
[26] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
[27] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[28] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
[29] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
[30] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
[31] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[33] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
[35] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
[36] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
[37] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
[38] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[39] Ken Sekimoto. Langevin equation and thermodynamics. Progress of Theoretical Physics Supplement, 130:17–27, 1998.
[40] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[42] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2022.
[43] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
[44] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[45] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. arXiv preprint arXiv:2112.03109, 2021.
[46] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2017.

{strip} Steered Diffusion: A Generalized Framework for Plug-and-Play

Conditional Image Synthesis

Supplementary Material

A1. Visualization of diffusion steering process

We present a visualization of intermediate outputs of the diffusion steering process in Figure14 and Figure15. The left image denotes the case without the steering loss and the right denotes the case with a steering loss. As we can see, with the presence of the steering loss, the generated images are consistent with the semantic maps from an early stage and the results continue to improve. The consistency grows stronger as timesteps progress.

A2. Illustrating sample diversity using our method

We present non-cherry-picked results for various conditional generation tasks to demonstrate our method’s photorealistic image generation quality and the diversity of examples generated by our method. Here we use the same noise levels across different examples, and the generation condition is shown in the first image in each sequence of images.

(a)Labels

(b)CycleGAN

(c)CycleGAN

(d)CUT

(e)CUT

(f)ILVR

(g)ILVR

(h)OURS

(i)OURS

Figure 13: Qualitative comparisons for segmentation labels from Figure 5 from original paper.

Figure 14: Evolution of sampling process without and with guidance

Figure 15: Evolution of sampling process without and with guidance

Figure 16: Non-cherry-picked samples from our method corresponding to the same semantic map. The upper left figure shows the corresponding semantic map. Samples across different examples with same sampling location has same random seeds

Figure 17: Non-cherry-picked samples from our method corresponding to the same semantic map. The upper left figure shows the corresponding semantic map. Samples across different examples with same sampling location has same random seeds

Figure 18: Non-cherry-picked samples from our method corresponding to the same identity image. The identity image projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Figure 19: Non-cherry-picked samples from our method corresponding to the same identity image. The identity image projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Figure 20: Non-cherry-picked samples from our method corresponding to grayscale to RGB. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Figure 21: Non-cherry-picked samples from our method corresponding to grayscale to RGB. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Figure 22: Non-cherry-picked samples from our method corresponding to LR to SR. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Figure 23: Non-cherry-picked samples from our method corresponding to LR to SR. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Xet Storage Details

Size:: 169 kB
Xet hash:: 501755a68f401b1f93931c4dd0a381a19f5152686b7748e34bf72cdb963bc714

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.