Buckets:

|
download
raw
169 kB

Title: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

URL Source: https://arxiv.org/html/2310.00224

Published Time: Tue, 03 Oct 2023 01:01:05 GMT

Markdown Content: Nithin Gopalakrishnan Nair 1⁣1{}^{1}start_FLOATSUPERSCRIPT 1 * end_FLOATSUPERSCRIPT Anoop Cherian 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Suhas Lohit 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Ye Wang 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

Toshiaki Koike-Akino 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Vishal M. Patel 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Tim K. Marks 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Johns Hopkins University 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Mitsubishi Electric Research Laboratories (MERL)

{ngopala2,vpatel36}@jhu.edu{acherian,slohit,yewang,koike,tmarks}@merl.com

https://merl.com/demos/steered-diffusion

Abstract

Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidance is typically useful only towards synthesizing high-level semantics rather than editing fine-grained details as in image-to-image translation tasks. To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation. The key idea is to steer the image generation of the diffusion model at inference time via designing a loss using a pre-trained inverse model that characterizes the conditional task. This loss modulates the sampling trajectory of the diffusion process. Our framework allows for easy incorporation of multiple conditions during inference. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution. Our results demonstrate clear qualitative and quantitative improvements over state-of-the-art diffusion-based plug-and-play models while adding negligible additional computational cost.

Figure 1: An illustration of various applications of our method. We use a diffusion model trained unconditionally and condition using our proposed algorithm only during the test time. We present the results on six tasks: (a) image inpainting, (b) colorization, (c) image super-resolution, (d) semantic generation, (e) identity replication, and (f) text-based image editing. In part (f), the text prompts for the the first and second columns, respectively, are “This person has blonde hair” and “This person has wavy hair.”

††* Work done during internship at MERL. 1 Introduction

Deep diffusion-based probabilistic generative models[14, 40, 8] are quickly emerging as one of the most powerful methods to synthesize high-quality content and have shown the potential to revolutionize content creation not only in computer vision, but also in many other areas including speech, audio, and language. Such models (e.g., ImagGen[36], Stable Diffusion[34]) have demonstrated outstanding synthesis results in conditional generation tasks, such as text-conditioned image synthesis [2, 33] and image reconstruction [35, 38, 31]. However, these models do not typically possess zero-shot conditional generative abilities when used directly (zero-shot capabilities as are commonly seen in language foundation models such as GPT-3[4]), and often demand large amounts of annotated and paired (multimodal) data for conditional generation, which may be challenging to obtain[15].

One way to circumvent this need for large annotated training sets is to leverage predefined models as plug-and-play modules [25, 12, 24] in an otherwise unconditionally trained diffusion model. Specifically, in such plug-and-play models, a model is first trained in an unconditional setting (without labels). During inference, the plug-and-play modules (networks separately trained for a particular conditional task, e.g., image captioning) are incorporated in the reverse diffusion process to produce intermediate samples guided in the Markov chain in specific directions to satisfy the desired condition. Prior works, such as [25, 26], have proposed similar methods in which the authors derive text- or class-conditioned samples from Generative Adversarial Networks (GANs)[11] that were trained without labels. To achieve this, they iteratively refine the noise input of the GAN until the desired sample satisfies the condition. Very recently, Grakios et al. [12] proposed a diffusion-based plug-and-play method that enables using unconditional diffusion models for conditional generation utilizing class labels. Both these methods are specifically designed for tasks involving label-level semantics. However, these methods do not address the usage of unconditional models for general image-to-image translation tasks, which require synthesizing visual content conditioned on fine-grained details in the source image. There are also works that propose diffusion models for image-to-image translation, such as for image super-resolution and inpainting [5, 20]; however, these methods are task-specific and do not generalize well to new tasks or new types of inverse problems (as demonstrated in Section5). In this work, we present a generic framework that can generalize to any image-to-image translation task.

Image 1: Refer to caption

Figure 2: An illustration of the difference between existing plug-and-play generation approaches (e.g.,[12]) and the proposed approach. Existing plug-and-play works operate with an energy function V 𝑉 V italic_V of the noisy latent x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In contrast, our model uses the implicit prediction of the diffusion model (i.e., a coarse estimate of the clean image x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) in its energy function V 1 subscript 𝑉 1 V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which allows the use of any pre-trained network for steering. In addition, our model provides a looping mechanism V 2 subscript 𝑉 2 V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which iterates N 𝑁 N italic_N times at each timestep t 𝑡 t italic_t to enhance generation quality.

In this paper, we derive the necessary theory and formulate an algorithm, which we call Steered Diffusion, for diffusion-based image editing and image-to-image translation; our model is subsequently validated on a wide range of tasks. Steered diffusion is motivated by the energy-based formulation of diffusion probabilistic models[10]. In general, inference in a generative model can be thought of as deriving samples from a learned distribution. Recall that every probability density function can be formulated as an energy field that describes an unnormalized estimate of how the distribution density varies in space[13, 25]. If one needs to find points in space that are the closest match to a given condition, one can utilize gradient-based optimization algorithms to find points in the field that have the highest density value for the condition. The gradient-based optimization scheme can be viewed as a modulation of the energy toward the desired direction. Previous work has utilized this idea on GANs[25, 26] and obtained reasonable results for label-based generation tasks. Due to their model structure, diffusion models are ideal candidates for such an energy modulation. One key challenge remains to design a good energy estimator that is robust to all noise levels. Previously, classifier-based guidance [27, 8] has been proposed and thought of as an energy modulation utilizing a pretrained classifier trained on noisy images. This poses a limitation that the guiding function should be noise-robust. In this work, we propose an alternative solution that does not need noise-robust networks but could use any network by utilizing the diffusion model as an implicit denoiser. Figure 2 gives a brief overview of how our approach is different from existing methods.

We present experiments using steered diffusion on multiple conditional generative tasks on faces as well as generic images as portrayed in Figure 1, We present results on (i) identity replication [7], (ii) semantic image generation[29], (iii) linear inverse problems[21], and (iv) text-conditioned image editing. Although our method is generic, for evaluations we perform experiments on faces. Before presenting our framework in detail, we now summarize the key contributions of our work:

  • • We propose steered diffusion, a general plug-and-play framework that can utilize various pre-existing models to steer an unconditional diffusion model.

  • • We present the first work applicable to both label-level synthesis and image-to-image translation tasks and demonstrate its effectiveness for various applications.

  • • We propose an implicit conditioning-based sampling strategy that significantly boosts the performance of conditional sampling from unconditional diffusion models compared with previous methods.

  • • We introduce a new strategy that uses multiple steps of projected gradient descent to improve sample quality.

2 Background

2.1 Related Work

Early works on unpaired image-to-image translation utilize a cycle consistency loss between the input and the target domains [46, 9]. Newer works, such as[16], have introduced a contrastive learning-based approach where a contrastive loss between corresponding patches of the input and target domains is minimized. The consistency-based method often fails to generate photorealistic images; hence, conditional generative models are preferred when labeled data are available. A few works [37, 38] utilize diffusion models for conditional image-to-image translation because of their photorealistic generation quality.

Guiding diffusion models during inference time has been explored by several works, such as[24]. The first method that proposed inference-time conditioning [8] uses a pretrained noise-robust classifier to guide the inference of an unconditional model. GLIDE[27] proposed a method for conditioning using text. Earlier work in plug-and-play modelling for generative models utilized GANs and performed iterative refinement on the latent space of GANs[25]. This method uses a predefined classifier or text captioning network to estimate a loss between the desired label output or text caption and the one generated from the GAN generator. This loss is backpropagated to refine the noise input of the GAN iteratively until the generator predicts the desired output. Recently,[12] proposed a method that uses diffusion models as a plug-and-play prior for class-conditioned generation. Several works have addressed the task of image-to-image translation using unconditional diffusion models [18, 5, 20, 1], but each of these proposes a task-specific inference scheme. For example, ILVR[5] performs image super-resolution, and Reinpaint[20] performs image inpainting. Blended diffusion[1] proposes a method for text-conditioned image editing. DDRM[18] proposes an inference-time scheme offering a general solution for linear inverse problems such as colorization and super-resolution.

2.2 Concurrent Work

In concurrent work to ours that also explores zero-shot conditional generation using diffusion models, [3] used a text to image model[34] and a two-step forward and backward universal guidance process, but it works well only after heavy optimization on network-based inverse problems such as semantic generation and identity generation. Another concurrent work[42] explored the usage of unconditional diffusion models for linear inverse problems using a pseudo-inverse model. In contrast to all these prior works, our steered diffusion algorithm generalizes well to both image-to-image translation tasks and high-level label-based generation tasks.

2.3 Denoising Diffusion Probabilistic Models

Denoising diffusion probabilistic models (DDPMs) [40, 14] belong to a class of generative models in which the model learns the distribution of data through a Markovian sampling process. DDPMs consist of a forward process and a reverse process. Let x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the latent state of an input image at timestep t 𝑡 t italic_t in a diffusion process. The sampling operation q⁢(⋅)𝑞⋅q(\cdot)italic_q ( ⋅ ) for the forward process in DDPM is defined as:

q⁢(x t|x t−1):=𝒩⁢(x t;1−β t⁢x t−1,β t⁢I),assign 𝑞 conditional subscript 𝑥 𝑡 subscript 𝑥 𝑡 1 𝒩 subscript 𝑥 𝑡 1 subscript 𝛽 𝑡 subscript 𝑥 𝑡 1 subscript 𝛽 𝑡 𝐼 q(x_{t}|x_{t-1}):=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}},x_{t-1},\beta_{t}I),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) ,(1)

where {β t}subscript 𝛽 𝑡{\beta_{t}}{ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } is a predefined variance schedule and I 𝐼 I italic_I is the identity matrix.

The forward process can be considered as a noising operation, where the next state x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained from the current state x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT by adding a small amount of Gaussian noise according to the sampled timestep. The state x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at timestep t 𝑡 t italic_t can also be sampled directly from the initial state x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, using:

q⁢(x t|x 0):=𝒩⁢(x t;α¯t⁢x 0,(1−α¯t)⁢I),assign 𝑞 conditional subscript 𝑥 𝑡 subscript 𝑥 0 𝒩 subscript 𝑥 𝑡 subscript¯𝛼 𝑡 subscript 𝑥 0 1 subscript¯𝛼 𝑡 𝐼 q(x_{t}|x_{0}):=\mathcal{N}\big{(}x_{t};\sqrt{\bar{\alpha}{t}}x{0},(1-\bar{% \alpha}_{t})I\big{)},italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) ,(2)

or equivalently,

x t=x 0⁢α¯t+ϵ⁢1−α¯t,ϵ∼𝒩⁢(0,I),formulae-sequence subscript 𝑥 𝑡 subscript 𝑥 0 subscript¯𝛼 𝑡 italic-ϵ 1 subscript¯𝛼 𝑡 similar-to italic-ϵ 𝒩 0 𝐼 x_{t}=x_{0}\sqrt{\bar{\alpha}{t}}+\epsilon\sqrt{1-\bar{\alpha}{t}},\quad% \epsilon\sim\mathcal{N}(0,I),italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_ϵ square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) ,(3)

where α¯t=∏s=1 t α s subscript¯𝛼 𝑡 superscript subscript product 𝑠 1 𝑡 subscript 𝛼 𝑠\bar{\alpha}{t}=\prod{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and α t=1−β t subscript 𝛼 𝑡 1 subscript 𝛽 𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

In[40], it is shown that if the number of time steps is large and the increment in {β i}subscript 𝛽 𝑖{\beta_{i}}{ italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } is small, then each step in the reverse sampling process can also be approximated by a Gaussian. If μ θ subscript 𝜇 𝜃\mu_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Σ θ subscript Σ 𝜃\Sigma_{\theta}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT respectively denote the mean and the covariance of this Gaussian, modelled via neural networks with parameters θ 𝜃\theta italic_θ, then each reverse step samples the state x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT according to:

p θ⁢(x t−1|x t):=𝒩⁢(x t−1;μ θ⁢(x t,t),Σ θ⁢(x t,t)).assign subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝒩 subscript 𝑥 𝑡 1 subscript 𝜇 𝜃 subscript 𝑥 𝑡 𝑡 subscript Σ 𝜃 subscript 𝑥 𝑡 𝑡 p_{\theta}(x_{t-1}|x_{t}):=\mathcal{N}\big{(}x_{t-1};\mu_{\theta}(x_{t},t),% \Sigma_{\theta}(x_{t},t)\big{)}.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) := caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) .(4)

The parameters θ 𝜃\theta italic_θ are obtained by minimizing the variational lower bound of the negative log-likelihood of the data distribution.

3 Proposed Method

Image 2: Refer to caption

Figure 3: An illustration of Steered Diffusion. During each step of the sampling process the implicit prediction is steered to the direction of the condition using a steering network or predefined function. Note that this figure is only for illustrating the idea and does not show the actual sampled images; in the actual sampling, the steering process is much more gradual, not sudden as potrayed in this image.

3.1 Steered Diffusion at Inference Time

Our work is motivated by the energy-based formulation of diffusion models. For any probability density function, the corresponding energy-based model (EBM) is defined by:

p θ⁢(x)subscript 𝑝 𝜃 𝑥\displaystyle p_{\theta}(x)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x )=exp⁡(−V⁢(x)⁢missing)Z,absent 𝑉 𝑥 missing 𝑍\displaystyle=\frac{\exp\big(-V(x)\big{missing})}{Z},= divide start_ARG roman_exp ( start_ARG - italic_V ( italic_x ) roman_missing end_ARG ) end_ARG start_ARG italic_Z end_ARG ,(5)

where V⁢(x)𝑉 𝑥 V(x)italic_V ( italic_x ) denotes the corresponding energy function across states x 𝑥 x italic_x, and Z 𝑍 Z italic_Z denotes a normalization constant. To derive samples from this distribution, one can utilize the Langevin equation[39] describing the state transition of a particle in the presence of an energy field. For diffusion models the sampling step is

x t−1=x t−∇x t log⁡p θ⁢(x t−1|x t)+ϵ,ϵ∼𝒩⁢(0,I).formulae-sequence subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript∇subscript 𝑥 𝑡 subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 italic-ϵ similar-to italic-ϵ 𝒩 0 𝐼\displaystyle x_{t-1}!=x_{t}-\nabla_{x_{t}}\log p_{\theta}(x_{t-1}|x_{t})+% \epsilon,;\epsilon!\sim!\mathcal{N}(0,I).italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) .(6)

The term ∇x t log⁢p θ⁢(x t−1|x t)subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡\nabla_{x_{t}}\text{log }p_{\theta}(x_{t-1}|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is called the score function of the density p θ⁢(x t−1|x t)subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 p_{\theta}(x_{t-1}|x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). One key advantage of the energy-based formulation is that it allows modulation of the energy function to satisfy given criteria. This was initially introduced as classifier guidance [8], which allows label conditional sampling from an unconditionally trained diffusion model utilizing a noise-robust classifier. In the remaining part of this section, we motivate how we can extend the functionality of unconditional diffusion models to conditional tasks. Consider a conditional sampling scenario based on a condition c 𝑐 c italic_c, for sampling from a state x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to state x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The conditional transition probability p θ⁢(x t−1|x t,c)subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 p_{\theta}(x_{t-1}|x_{t},c)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) can be decomposed as

p θ⁢(x t−1|x t,c)∝p θ⁢(x t−1|x t)⁢p⁢(c|x t−1)p⁢(c|x t).proportional-to subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑝 conditional 𝑐 subscript 𝑥 𝑡 1 𝑝 conditional 𝑐 subscript 𝑥 𝑡\displaystyle p_{\theta}(x_{t-1}|x_{t},c)\propto\frac{p_{\theta}(x_{t-1}|x_{t}% )p(c|x_{t-1})}{p(c|x_{t})}.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) ∝ divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG .(7)

Hence, for any timestep t 𝑡 t italic_t, the effective score for conditional transition can be found by utilizing using the log of probability density in the EBM formulations (5) of the individual densities and can be represented as

∇x t log⁢p θ⁢(x t−1|x t,c)=subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 absent\displaystyle\nabla_{x_{t}}\text{log }p_{\theta}(x_{t-1}|x_{t},c)=∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) = ∇x t log⁢p θ⁢(x t−1|x t)−∇x t V 1⁢(x t,c)+∇x t V 2⁢(x t−1,c),subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 𝑡 1 𝑐\displaystyle\nabla_{x_{t}}\text{log }p_{\theta}(x_{t-1}|x_{t})-\nabla_{x_{t}}% V_{1}(x_{t},c)+\nabla_{x_{t}}V_{2}(x_{t-1},c),∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c ) ,(8)

where V 1 subscript 𝑉 1 V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and V 2 subscript 𝑉 2 V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the corresponding energy functions that model the conditional distributions of x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT given a condition c 𝑐 c italic_c. Specifically, they project the higher dimensional x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the lower dimensional space of c 𝑐 c italic_c and measure the distance between the mapped value and c 𝑐 c italic_c. The better this particular measure, the more effectively it can be used to generate conditional samples from an unconditional model. Using Eq.(8), the the conditional sampling equation for the reverse process is

x t−1=subscript 𝑥 𝑡 1 absent\displaystyle x_{t-1}=italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =1 α t⁢(x t−1−α t 1−α¯t⁢ϵ θ(t)⁢(x t,t))1 subscript 𝛼 𝑡 subscript 𝑥 𝑡 1 subscript 𝛼 𝑡 1 subscript¯𝛼 𝑡 subscript superscript italic-ϵ 𝑡 𝜃 subscript 𝑥 𝑡 𝑡\displaystyle\frac{1}{\sqrt{\alpha_{t}}}\bigg{(}x_{t}-\frac{1-\alpha_{t}}{% \sqrt{1-\bar{\alpha}{t}}}\epsilon^{(t)}{\theta}(x_{t},t)\bigg{)}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) −∇x t V 1⁢(x t,c)+∇x t V 2⁢(x t−1,c)+σ t⁢ϵ.subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 𝑡 1 𝑐 subscript 𝜎 𝑡 italic-ϵ\displaystyle-\nabla_{x_{t}}V_{1}(x_{t},c)+\nabla_{x_{t}}{V_{2}}(x_{t-1},c)+% \sigma_{t}\epsilon.- ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ .(9)

Here ϵ θ(t)⁢(x t,t)subscript superscript italic-ϵ 𝑡 𝜃 subscript 𝑥 𝑡 𝑡\epsilon^{(t)}{\theta}(x{t},t)italic_ϵ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the network prediction at a timestep t 𝑡 t italic_t, and σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the corresponding variance of the reverse step. The formulation in Eq. (3.1) shows that the energy function requires a functional mapping from a noisy x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c.

In many applications of interest, the mapping function from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c is complex and can be modeled effectively using deep networks. For example, in image-to-image translation tasks such as image generation from semantic maps[29], the image is x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the semantic map is the condition c 𝑐 c italic_c. Similarly, for text to image generation, the image is x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and c 𝑐 c italic_c corresponds to the text. Ideally, we would like to employ an existing (pre-trained) deep neural network for the mapping from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c. However, deep networks are usually trained on clean images, which limits the usability of existing pre-trained deep networks for mapping directly from the noisy image x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c. One workaround would be to use noise-robust networks to map from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to c 𝑐 c italic_c, but training noise-robust networks for conditional mapping can be computationally expensive. Moreover, a network that is trained with multiple different noise levels often results in lower mapping performance, as it cannot denoise all noise levels accurately; we validate this claim experimentally in Section5.6. Alternatively, one could include two mapping functions: a first that denoises x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and a second that maps from the denoised image to c 𝑐 c italic_c. Rather than training a seperate denoising network, however, we realized that diffusion models are inherently trained as denoisers, and reconstruction quality improves as time proceeds in the reverse sampling of the diffusion process. Because of this capacity, we can use a reverse sampling step to make coarse predictions of the denoised image from any time step t 𝑡 t italic_t.

Hence, we modify our original energy expression (8) to:

∇x t subscript∇subscript 𝑥 𝑡\displaystyle\nabla_{x_{t}}∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log⁢p θ⁢(x t−1|x t,c)=∇x t log⁢p θ⁢(x t−1|x t)log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 log subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡\displaystyle\text{log }p_{\theta}(x_{t-1}|x_{t},c)=\nabla_{x_{t}}\text{log }p% {\theta}(x{t-1}|x_{t})log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) = ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) −∇x t V 1⁢(x 0|t,c)−δ 1+∇x t V 2⁢(x 0|t−1,c)+δ 2,subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 conditional 0 𝑡 𝑐 subscript 𝛿 1 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 conditional 0 𝑡 1 𝑐 subscript 𝛿 2\displaystyle-\nabla_{x_{t}}V_{1}(x_{0|t},c)-\delta_{1}+\nabla_{x_{t}}V_{2}(x_% {0|t-1},c)+\delta_{2},- ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_c ) - italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT , italic_c ) + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(10)

where, we define the implicit step prediction x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT as:

x 0|t subscript 𝑥 conditional 0 𝑡\displaystyle x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT=x t−1−α¯t⁢ϵ θ(t)⁢(x t)α t¯.absent subscript 𝑥 𝑡 1 subscript¯𝛼 𝑡 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡¯subscript 𝛼 𝑡\displaystyle=\frac{x_{t}-\sqrt{1-\bar{\alpha}{t}},\epsilon{\theta}^{(t)}(x% {t})}{\sqrt{\bar{\alpha{t}}}}.= divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG .(11)

Here, we assume x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are first denoised to x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT and x 0|t−1 subscript 𝑥 conditional 0 𝑡 1 x_{0|t-1}italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT, respectively. The terms δ 1 subscript 𝛿 1\delta_{1}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and δ 2 subscript 𝛿 2\delta_{2}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT capture the errors arising from the shift in the domain from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and from x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT; for large t 𝑡 t italic_t, δ 1≈δ 2 subscript 𝛿 1 subscript 𝛿 2\delta_{1}\approx\delta_{2}italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as the implicit predictions at nearby steps tend to be similar.

As shown in our experiments, the energy function should be selected according to the task. An easy way to choose the energy function is by looking at the training loss of the mapping network. For example, in the case of semantic generation, a good energy function is the cross-entropy loss between the predicted semantic map at any timestep and the input semantic map. In the case of identity replication, a good choice of regularization would be the negative cosine similarity score between the embeddings from a recognition network for the input and target image. In the case of text-to-image generation, it would be CLIP loss [32]. As a rule of thumb, an energy function could be chosen easily by looking at the loss function used to train the pre-trained network (or an inverse function) that maps from the image x 𝑥 x italic_x to the condition c 𝑐 c italic_c.

Algorithm 1 Steered Diffusion

1:Energy function

V 𝑉 V italic_V , condition

c 𝑐 c italic_c

2:

x T∼𝒩⁢(x T;0,I)similar-to subscript 𝑥 𝑇 𝒩 subscript 𝑥 𝑇 0 𝐼 x_{T}\sim\mathcal{N}(x_{T};0,I)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , italic_I )

3:for

t=T−1,…,1 𝑡 𝑇 1…1 t={T-1},\ldots,1 italic_t = italic_T - 1 , … , 1 do

4:for

n=N,…,1 𝑛 𝑁…1 n=N,\ldots,1 italic_n = italic_N , … , 1 do

5:

ϵ∼𝒩⁢(ϵ;0,I)similar-to italic-ϵ 𝒩 italic-ϵ 0 𝐼\epsilon\sim\mathcal{N}(\epsilon;0,I)italic_ϵ ∼ caligraphic_N ( italic_ϵ ; 0 , italic_I )

6:

x 0|t=x t−1−α¯t⁢ϵ θ(t)⁢(x t)α t¯subscript 𝑥 conditional 0 𝑡 subscript 𝑥 𝑡 1 subscript¯𝛼 𝑡 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡¯subscript 𝛼 𝑡 x_{0|t}=\frac{x_{t}-\sqrt{1-\bar{\alpha}{t}},\epsilon{\theta}^{(t)}(x_{t})}% {\sqrt{\bar{\alpha_{t}}}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT = divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG

7:Compute

x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT with

V,c 𝑉 𝑐 V,c italic_V , italic_c using Eq. (15)

8:if

(n>1)𝑛 1(n>1)( italic_n > 1 ) then

9:Compute

x t u⁢c superscript subscript 𝑥 𝑡 𝑢 𝑐 x_{t}^{uc}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_c end_POSTSUPERSCRIPT using Eq. (3.2)

10:else

11:Compute

x t−1 u⁢c superscript subscript 𝑥 𝑡 1 𝑢 𝑐 x_{t-1}^{uc}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_u italic_c end_POSTSUPERSCRIPT using Eq. (3.2)

12:end if

13:end for

14:end for

15:return

x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

3.2 Revisiting Sampling in Diffusion Models

To obtain a closed-form expression for plugging the energy-based formulation into the reverse sampling process efficiently, we take inspiration from DDIM[41] and revisit the reverse sampling operation of diffusion models. From p θ⁢(x 1:T)subscript 𝑝 𝜃 subscript 𝑥:1 𝑇 p_{\theta}(x_{1:T})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ), one can generate a sample x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT from a sample x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by:

x t−1 uc superscript subscript 𝑥 𝑡 1 uc\displaystyle x_{t-1}^{\text{uc}}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT uc end_POSTSUPERSCRIPT=α¯t−1⋅x t−1−α¯t⁢ϵ θ(t)⁢(x t)α t¯⏟“ predicted⁢x 0⁢”absent⋅subscript¯𝛼 𝑡 1 subscript⏟subscript 𝑥 𝑡 1 subscript¯𝛼 𝑡 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡¯subscript 𝛼 𝑡“ predicted subscript 𝑥 0”\displaystyle=\sqrt{\bar{\alpha}{t-1}}\cdot\underbrace{\frac{x{t}-\sqrt{1-% \bar{\alpha}{t}},\epsilon{\theta}^{(t)}(x_{t})}{\sqrt{\bar{\alpha_{t}}}}}{% \text{`` predicted }x{0}\text{''}}= square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ⋅ under⏟ start_ARG divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG end_ARG start_POSTSUBSCRIPT “ predicted italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ” end_POSTSUBSCRIPT +\displaystyle++1−α¯t−1−σ t 2⋅ϵ θ(t)⁢(x t)⏟“direction pointing to⁢x t⁢”+σ t⁢ϵ⏟random noise,subscript⏟⋅1 subscript¯𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡“direction pointing to subscript 𝑥 𝑡”subscript⏟subscript 𝜎 𝑡 italic-ϵ random noise\displaystyle\underbrace{\sqrt{1-\bar{\alpha}{t-1}-\sigma{t}^{2}}\cdot% \epsilon_{\theta}^{(t)}(x_{t})}{\text{``direction pointing to }x{t}\text{''}% }+\underbrace{\sigma_{t}\epsilon}_{\text{random noise}},under⏟ start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT “direction pointing to italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ” end_POSTSUBSCRIPT + under⏟ start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ end_ARG start_POSTSUBSCRIPT random noise end_POSTSUBSCRIPT ,(12)

as in Song et al. [41]. Using Eq.(11), we can rewrite the unconditional sampling step Eq.(12) in terms of x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT:

x t−1 uc=superscript subscript 𝑥 𝑡 1 uc absent\displaystyle x_{t-1}^{\text{uc}}=italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT uc end_POSTSUPERSCRIPT =α¯t−1⁢x 0|t+limit-from subscript¯𝛼 𝑡 1 subscript 𝑥 conditional 0 𝑡\displaystyle\sqrt{\bar{\alpha}{t-1}}x{0|t};+square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT + 1−α¯t−1−σ t 2⋅x t−α t¯⁢x 0|t 1−α¯t+σ t⁢ϵ,⋅1 subscript¯𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 subscript 𝑥 𝑡¯subscript 𝛼 𝑡 subscript 𝑥 conditional 0 𝑡 1 subscript¯𝛼 𝑡 subscript 𝜎 𝑡 italic-ϵ\displaystyle\sqrt{1-\bar{\alpha}{t-1}-\sigma{t}^{2}}\cdot\frac{x_{t}-\sqrt{% \bar{\alpha_{t}}}x_{0|t}}{\sqrt{1-\bar{\alpha}{t}}}+\sigma{t}\epsilon,square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ ,(13)

Here the superscript uc denotes the unconditional sample, which is obtained without any steering while transitioning from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. The conditional sampling step(8) can then be rewritten as

x t−1=x t−1 uc−∇x t V 1⁢(x 0|t,c)+∇x t V 2⁢(x 0|t−1,c).subscript 𝑥 𝑡 1 superscript subscript 𝑥 𝑡 1 uc subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 conditional 0 𝑡 𝑐 subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 conditional 0 𝑡 1 𝑐\displaystyle x_{t-1}=x_{t-1}^{\text{uc}}-\nabla_{x_{t}}V_{1}(x_{0|t},c)+% \nabla_{x_{t}}V_{2}(x_{0|t-1},c).italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT uc end_POSTSUPERSCRIPT - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_c ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT , italic_c ) .(14)

Through this, we can modulate x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT directly, as x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT is also a function of ϵ θ(t)⁢(x t)superscript subscript italic-ϵ 𝜃 𝑡 subscript 𝑥 𝑡\epsilon_{\theta}^{(t)}(x_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Following Eq.(14), a rough estimate of the desired x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for conditional sampling, represented by x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT, can be obtained using

x 0|t feas=x 0|t−k⁢(t)⁢∇x t(V 1⁢(x 0|t,c)−V 2⁢(x 0|t−1,c)),superscript subscript 𝑥 conditional 0 𝑡 feas subscript 𝑥 conditional 0 𝑡 𝑘 𝑡 subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 conditional 0 𝑡 𝑐 subscript 𝑉 2 subscript 𝑥 conditional 0 𝑡 1 𝑐\displaystyle x_{0|t}^{\mathrm{feas}}=x_{0|t}-k(t)\nabla_{x_{t}}\left(V_{1}(x_% {0|t},c)-V_{2}(x_{0|t-1},c)\right),italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT - italic_k ( italic_t ) ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT , italic_c ) - italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 | italic_t - 1 end_POSTSUBSCRIPT , italic_c ) ) ,(15)

where k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) is a scaling factor defining the strength of regularization. We call the process of finding x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT from x 0|t subscript 𝑥 conditional 0 𝑡 x_{0|t}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT Implicit Steering Control (ISC), and we call the new sampling process steered diffusion. The exact algorithm is illustrated in Fig.3 and explained in Algorithm1.

4 Tips for Improved Performance

4.1 Linear Inverse Problems

For optimization-based inverse problems such as text-to-image generation and semantic map to natural image generation, the exact mapping function is not always available. On the other hand, for linear inverse problems such as colorization, super-resolution, and image inpainting, the mapping is simply a linear function, and we can write Eq.(15) more simply. In these cases, the exact mapping function to the latent space of the condition is known. Hence, one can decompose the implicit prediction at each timestep along the direction of the condition and simply replace this component by its desired ideal condition, i.e., if our predicted sample needs to map to a condition c, then the modified implicit prediction step becomes

x 0|t feas=x 0|t+k⁢(t)⁢(D⁢(y)−D⁢(x 0|t)),where⁢c=D⁢(y).formulae-sequence subscript superscript 𝑥 feas conditional 0 𝑡 subscript 𝑥 conditional 0 𝑡 𝑘 𝑡 𝐷 𝑦 𝐷 subscript 𝑥 conditional 0 𝑡 where 𝑐 𝐷 𝑦\displaystyle x^{\mathrm{feas}}{0|t}=x{0|t}+k(t)(D(y)-D(x_{0|t})),\quad\text% {where }c=D(y).italic_x start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT + italic_k ( italic_t ) ( italic_D ( italic_y ) - italic_D ( italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT ) ) , where italic_c = italic_D ( italic_y ) .(16)

Here, y 𝑦 y italic_y is the clean image and D 𝐷 D italic_D is the known degradation model. Our sampling procedure ensures that the series of operations preserve the consistency of domains of x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and x 0|t feas superscript subscript 𝑥 conditional 0 𝑡 feas x_{0|t}^{\mathrm{feas}}italic_x start_POSTSUBSCRIPT 0 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_feas end_POSTSUPERSCRIPT to the original data distribution at the corresponding timesteps. An illustration is shown in Figure 4.

Image 3: Refer to caption

Figure 4: An illustration of the steering function for linear inverse problems. For linear inverse problems, the component of the implicit prediction along the degradation direction can be replaced by the ground truth condition.

4.2 Multi-Step Implicit Modulation

Our experiments show (see Fig.7) that performing refinement using Eq.(15) on the implicit step prediction multiple times for each timestep significantly boosts the conditioning quality for more ill-posed conditions such as image inpainting and colorization. A similar observation was also found by[20]. Specifically, at a particular timestep t 𝑡 t italic_t, we iterate the procedure of steering towards the next sampling step x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and then adding noise to return to x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We present the corresponding algorithm in Algorithm1. An example is shown in Fig.7, where more realistic images are generated using the multiple-step sampling scheme row labeled “OURS multi.” Effectively, the V 2 subscript 𝑉 2 V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT term in Eq.(12) can be thought of as enabling a multistep sampling in which we modulate the current step by looking ahead to the next sampling step. On a careful analysis of Eq.(7), i.e. from the score contribution due to the different regularization functions, we can see that the term for ∇x t V 1⁢(x t,c)subscript∇subscript 𝑥 𝑡 subscript 𝑉 1 subscript 𝑥 𝑡 𝑐\nabla_{x_{t}}V_{1}(x_{t},c)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c ) modulates x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on its current state, and the term from ∇x t V 2⁢(x t−1,c)subscript∇subscript 𝑥 𝑡 subscript 𝑉 2 subscript 𝑥 𝑡 1 𝑐\nabla_{x_{t}}V_{2}(x_{t-1},c)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_c ) is a look-ahead correction where the derivative with respect to the future prediction is found. This is exactly what happens in the case of looping back from x t−1 subscript 𝑥 𝑡 1 x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is modulated iteratively by looking forward to what the future prediction would be.

4.3 Choosing the Scaling Factor k⁢(t)𝑘 𝑡{k(t)}italic_k ( italic_t )

In Eq.(15), k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) denotes the strength of the regularization constraint. A very small k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) would denote no effective regularization, and a large k 𝑘 k italic_k would lead to the diffusion process going out of the latent space manifold. Since the derivative of the regularization function by itself is a score value, similar to the normal scaling value of the score function, the appropriate time-varying normalization factor is 1−α t¯1¯subscript 𝛼 𝑡\sqrt{1-\bar{\alpha_{t}}}square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG. The exact value for k(t) for each task is defined in Table1. For linear inverse problems we use a constant k⁢(t)=1 𝑘 𝑡 1 k(t)=1 italic_k ( italic_t ) = 1, which provided the best results.

5 Experiments

We evaluate the performance of our network qualitatively and quantitatively using four image-to-image translation tasks—semantic layout to face image translation, face inpainting, face colorization, and face super-resolution—as well as two high-level vision tasks: identity-based image generation and text-guided image editing. Unlike existing approaches, our method is completely zero-shot and applies to a wide variety of tasks. We compare with other diffusion-based approaches best applicable for each task for a fair evaluation. We also compare the semantic layout to image translation performance with that of task-specific unsupervised methods. We choose the unconditional model released by [5] as the unconditional pretrained diffusion model in all of our experiments with faces. Note that the sampling scheme in ILVR[5] and Repaint[20] can be thought of as happening at time t 𝑡 t italic_t rather than at the implicit step as in our method. Hence, comparing these methods for super-resolution and inpainting in our experiments below can be considered an additional ablation study highlighting the improvement from our implicit sampling.

Table 1: Parameter set for each application.

5.1 Implementation Details

For our experiments, we utilize pixel-level unconditional diffusion models. For faces, we utilize the model trained on the FFHQ dataset[17] that was released by[5]. For generic images, we use the model trained on ImageNet[6] released in ADM[8]. All our experiments use 100 steps of sampling.

5.2 Semantic Face Generation

Image 4: Refer to caption

Image 5: Refer to caption

Image 6: Refer to caption

(a)Labels

Image 7: Refer to caption

Image 8: Refer to caption

Image 9: Refer to caption

(b)CycleGAN

Image 10: Refer to caption

Image 11: Refer to caption

Image 12: Refer to caption

(c)CUT

Image 13: Refer to caption

Image 14: Refer to caption

Image 15: Refer to caption

(d)ILVR

Image 16: Refer to caption

Image 17: Refer to caption

Image 18: Refer to caption

(e)OURS

Figure 5: Qualitative comparisons for semantic generation.

Figure 6: Results on 8×8\times 8 × super resolution.

To evaluate how our method performs in generic image-to-image translation tasks, we evaluate our method’s performance for the task of semantic layout to face generation. We utilize the CelebA dataset for this. To generate the semantic labels, we use facer[45] and create 11 11 11 11 label classes for each face. Since there is no other unconditional model that can perform fully test-time semantic generation, to evaluate the performance of our method, we compare with fully unsupervised image translation methods: CycleGAN[46], CUT[28], and ILVR [5].

The corresponding qualitative results are shown in Fig.5. It is clear that CUT, CycleGAN, and ILVR produce unrealistic facial images with huge artifacts or create low-resolution faces. In contrast, our method always creates good-quality realistic faces. We present the quantitative results in Table2. The table shows that our method obtains the best FID scores of all methods and obtains the best mIOU score among the inference-time techniques.

5.3 Face Super-Resolution

Figure 7: Qualitative comparisons for colorization. The row labeled “OURS multi” refers to the use of multi-step sampling, as described in Sec.4.2.

Image 19: Refer to caption

Image 20: Refer to caption

Image 21: Refer to caption

(a)Degraded

Image 22: Refer to caption

Image 23: Refer to caption

Image 24: Refer to caption

(b)Reinpaint

Image 25: Refer to caption

Image 26: Refer to caption

Image 27: Refer to caption

(c)Ours

Image 28: Refer to caption

Image 29: Refer to caption

Image 30: Refer to caption

(d)Degraded

Image 31: Refer to caption

Image 32: Refer to caption

Image 33: Refer to caption

(e)Reinpaint

Image 34: Refer to caption

Image 35: Refer to caption

Image 36: Refer to caption

(f)Ours

Image 37: Refer to caption

Image 38: Refer to caption

Image 39: Refer to caption

(g)Degraded

Image 40: Refer to caption

Image 41: Refer to caption

Image 42: Refer to caption

(h)Reinpaint

Image 43: Refer to caption

Image 44: Refer to caption

Image 45: Refer to caption

(i)Ours

Figure 8: Qualitative comparisons for inpainting for thin, medium and thick masks.

Image 46: Refer to caption

(a)Original Image

Image 47: Refer to caption

(b)Photo of a young man

Image 48: Refer to caption

(c)She has wavy hair

Image 49: Refer to caption

(d)She has blonde hair

Image 50: Refer to caption

(e)Photo of an old woman

Image 51: Refer to caption

(f)She is angry

Image 52: Refer to caption

(g)She is sad

Figure 9: Qualitative comparisons for text-based image editing.

We evaluate the performance for the face super-resolution task using the CelebA dataset[19]. As baselines, we utilize fully inference-time methods in which no task-specific training is used. As the first baseline, we choose PULSE[22], a self-supervised upsampling technique utilizing GANs. As the next comparison method, we choose ILVR[5] which, like our method, performs super-resolution utilizing an unconditional pre-trained diffusion model. However, in ILVR, sampling happens at timestep t 𝑡 t italic_t rather than at the implicit step in our algorithm. In total, we utilize 300 images for evaluations. We present some qualitative results in Fig.6. For ILVR[5], we utilize 100 timesteps of sampling, the same as in our case. As we can see, PULSE[22] and ILVR[5] are unable to restore the correct identity and also contain blur artifacts after restoration. On the other hand, steered diffusion (our method) is able to restore photorealistic facial images. The qualitative evaluations are presented in Table 3; our method yields a 0.18 improvement in perceptual similarity, 6.95 dB improvement in PSNR, and 0.24 improvement in SSIM versus all of the other comparison methods.

Table 2: Quantitative results for semantic generation

Table 3: Quantitative results for super-resolution

5.4 Face Colorization

As a baseline method, we modify ILVR[5] to suit the task of colorization. For this, rather than performing the constraint at every step, we start the sampling process from a noised grayscale image and enforce consistency between the generated and original grayscale images. In total, we utilize 300 images for evaluation. The corresponding results can be seen in Fig.7. As we can see, our method is able to reconstruct photorealistic faces with naturalistic colours compared to ILVR[5]. The corresponding quantitative metrics are presented in Table.4. We get a significant boost in performance, with an FID score of 19 19 19 19, LPIPS[44] score of 0.19 0.19 0.19 0.19, and NIQE[23] score of 1.5 1.5 1.5 1.5 for our method.

Table 4: Quantitative results for colorization

Table 5: Quantitative results for inpainting.

Image 53: Refer to caption

(a)Semantic Map

Image 54: Refer to caption

(b)K=0 𝐾 0 K=0 italic_K = 0

Image 55: Refer to caption

(c)K=200 𝐾 200 K=200 italic_K = 200

Image 56: Refer to caption

(d)K=2000 𝐾 2000 K=2000 italic_K = 2000

Image 57: Refer to caption

(e)K=20000 𝐾 20000 K=20000 italic_K = 20000

Image 58: Refer to caption

(f)K=2×10 5 𝐾 2 superscript 10 5 K=2\times 10^{5}italic_K = 2 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT

Image 59: Refer to caption

(g)K=2×10 6 𝐾 2 superscript 10 6 K=2\times 10^{6}italic_K = 2 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT

Figure 10: Figure showing sample variation with scaling factor k⁢(t)=K⁢1−α t¯𝑘 𝑡 𝐾 1¯subscript 𝛼 𝑡 k(t)=K\sqrt{1-\bar{\alpha_{t}}}italic_k ( italic_t ) = italic_K square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG

Image 60: Refer to caption

Image 61: Refer to caption

(a)Degraded

Image 62: Refer to caption

Image 63: Refer to caption

(b)Noise robust

Image 64: Refer to caption

Image 65: Refer to caption

(c)OURS

Figure 11: Qualitative comparisons for Colorization.

5.5 Inpainting, Image Editing, Identity replication

Our method can also utilize multiple conditions simultaneously; we provide an illustration in Figure 9 where we condition with an identity-preserving network and a text caption simultaneously. From the figure, one can see that our method can generalize well to a diverse range of captions. For preserving identity, we used the VGGFace network[30]. We utilize FARL[45], which is pre-trained with face and corresponding text pairs to enforce the captions. For generic identity replication as in Figure 1 we use FARL face embedder.

For our image inpainting experiments, we use the subset released by [43] and evaluate three different kinds of masks. Our method obtains better results than existing baselines across all mask variations. Qualitative and quantitative results are shown in Fig.8 and Table5, respectively.

5.6 Ablation Study

Effect of scaling factor k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) for semantic generation: In this section, we analyze how the scaling factor affects the quality of the sample in the case of complex conditioning of semantic generation. Fig.10 shows the variation of sample quality starting from the same initial noise with different scaling factors k⁢(t)=K⁢1−α t¯𝑘 𝑡 𝐾 1¯subscript 𝛼 𝑡 k(t)=K\sqrt{1-\bar{\alpha_{t}}}italic_k ( italic_t ) = italic_K square-root start_ARG 1 - over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG. The sample quality is bad for very low scaling factors, and for very high scaling factors, the diffusion process escapes the manifold of natural face images. We show the variation in sample quality for a fixed scaling factor versus a time-varying scaling factor in Fig.12 , which demonstrates that a time-varying scale factor produces more realistic samples. This is because the effective variance of the noise scheduling, which effectively controls the amount of regularization possible at a particular timestep, reduces as the generation process proceeds. Hence a larger tweak is permissible at the early steps of diffusion, and only very small tweaks are permitted in the later steps.

Noise-Robust classifier: To validate the claims in section 3.1, we train a noise-robust inverse mapper for the task of colorization and show the output of the noise-robust classifier and the diffusion outputs for different noise levels in Fig.11. The noise-robust classifier fails to preserve key details that our approach preserves.

Limitations Although our method can generalize to a wide series of tasks, one limitation that persists is the value of the scaling factor k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ). The value of k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ) has to be empirically found based on the task, but once a few images are used to tune the value of k⁢(t)𝑘 𝑡 k(t)italic_k ( italic_t ), the model generalizes well to other conditioning images of the same task. Like any other conditional generation models that can perform image editing, our method also has societal impacts, and care must be taken in applying these methods.

Figure 12: A comparison of a time-varying scaling factor with non-time-varying.

6 Conclusion

In this paper, we propose the first framework for plug-and-play conditional generation that can generalize well to both image-to-image translation tasks and label-based generation tasks. For this, we use the energy-based formulation of diffusion models and modulate the inference process using a task-specific predefined network or other preexisting function. Furthermore, we introduce a novel implicit sampling-based technique that improves the sampling quality across multiple tasks. We performed experiments on various tasks to show that our method can generalize across multiple tasks and outperforms existing methods that do not require additional training.

References

  • [1] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  • [2] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  • [3] Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Universal guidance for diffusion models. arXiv preprint arXiv:2302.07121, 2023.
  • [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [5] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938, 2021.
  • [6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [7] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
  • [8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 2021.
  • [9] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and Dacheng Tao. Geometry-Consistent Generative Adversarial Networks for One-Sided Unsupervised Domain Mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [10] Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P Kingma. Learning energy-based models by diffusion recovery likelihood. arXiv preprint arXiv:2012.08125, 2020.
  • [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • [12] Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. Diffusion models as plug-and-play priors. arXiv preprint arXiv:2206.09012, 2022.
  • [13] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263, 2019.
  • [14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • [15] Xun Huang, Arun Mallya, Ting-Chun Wang, and Ming-Yu Liu. Multimodal conditional image synthesis with product-of-experts gans. arXiv preprint arXiv:2112.05130, 2021.
  • [16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [17] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • [18] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. arXiv preprint arXiv:2201.11793, 2022.
  • [19] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  • [20] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  • [21] Kangfu Mei, Nithin Gopalakrishnan Nair, and Vishal M Patel. Bi-noising diffusion: Towards conditional diffusion models with generative restoration priors. arXiv preprint arXiv:2212.07352, 2022.
  • [22] Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 2437–2445, 2020.
  • [23] Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer. IEEE Signal processing letters, 20(3):209–212, 2012.
  • [24] Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, and Vishal M Patel. Unite and conquer: Plug & play multi-modal synthesis using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6070–6079, 2023.
  • [25] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4467–4477, 2017.
  • [26] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
  • [27] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [28] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 319–345. Springer, 2020.
  • [29] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2337–2346, 2019.
  • [30] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
  • [31] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, and Supasorn Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
  • [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [33] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [34] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • [35] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
  • [36] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  • [37] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
  • [38] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [39] Ken Sekimoto. Langevin equation and thermodynamics. Progress of Theoretical Physics Supplement, 130:17–27, 1998.
  • [40] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • [41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [42] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2022.
  • [43] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
  • [44] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • [45] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. arXiv preprint arXiv:2112.03109, 2021.
  • [46] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision (ICCV), 2017.

{strip} Steered Diffusion: A Generalized Framework for Plug-and-Play

Conditional Image Synthesis

Supplementary Material

A1. Visualization of diffusion steering process

We present a visualization of intermediate outputs of the diffusion steering process in Figure14 and Figure15. The left image denotes the case without the steering loss and the right denotes the case with a steering loss. As we can see, with the presence of the steering loss, the generated images are consistent with the semantic maps from an early stage and the results continue to improve. The consistency grows stronger as timesteps progress.

A2. Illustrating sample diversity using our method

We present non-cherry-picked results for various conditional generation tasks to demonstrate our method’s photorealistic image generation quality and the diversity of examples generated by our method. Here we use the same noise levels across different examples, and the generation condition is shown in the first image in each sequence of images.

Image 66: Refer to caption

Image 67: Refer to caption

Image 68: Refer to caption

(a)Labels

Image 69: Refer to caption

Image 70: Refer to caption

Image 71: Refer to caption

(b)CycleGAN

Image 72: Refer to caption

Image 73: Refer to caption

Image 74: Refer to caption

(c)CycleGAN

Image 75: Refer to caption

Image 76: Refer to caption

Image 77: Refer to caption

(d)CUT

Image 78: Refer to caption

Image 79: Refer to caption

Image 80: Refer to caption

(e)CUT

Image 81: Refer to caption

Image 82: Refer to caption

Image 83: Refer to caption

(f)ILVR

Image 84: Refer to caption

Image 85: Refer to caption

Image 86: Refer to caption

(g)ILVR

Image 87: Refer to caption

Image 88: Refer to caption

Image 89: Refer to caption

(h)OURS

Image 90: Refer to caption

Image 91: Refer to caption

Image 92: Refer to caption

(i)OURS

Figure 13: Qualitative comparisons for segmentation labels from Figure 5 from original paper.

Image 93: Refer to captionImage 94: Refer to captionImage 95: Refer to captionImage 96: Refer to captionImage 97: Refer to captionImage 98: Refer to captionImage 99: Refer to captionImage 100: Refer to captionImage 101: Refer to captionImage 102: Refer to caption Image 103: Refer to captionImage 104: Refer to captionImage 105: Refer to captionImage 106: Refer to captionImage 107: Refer to captionImage 108: Refer to captionImage 109: Refer to captionImage 110: Refer to captionImage 111: Refer to captionImage 112: Refer to caption Image 113: Refer to captionImage 114: Refer to captionImage 115: Refer to captionImage 116: Refer to captionImage 117: Refer to captionImage 118: Refer to captionImage 119: Refer to captionImage 120: Refer to captionImage 121: Refer to captionImage 122: Refer to caption Image 123: Refer to captionImage 124: Refer to captionImage 125: Refer to captionImage 126: Refer to captionImage 127: Refer to captionImage 128: Refer to captionImage 129: Refer to captionImage 130: Refer to captionImage 131: Refer to captionImage 132: Refer to caption Image 133: Refer to captionImage 134: Refer to captionImage 135: Refer to captionImage 136: Refer to captionImage 137: Refer to captionImage 138: Refer to captionImage 139: Refer to captionImage 140: Refer to captionImage 141: Refer to captionImage 142: Refer to caption

Figure 14: Evolution of sampling process without and with guidance

Image 143: Refer to captionImage 144: Refer to captionImage 145: Refer to captionImage 146: Refer to captionImage 147: Refer to captionImage 148: Refer to captionImage 149: Refer to captionImage 150: Refer to captionImage 151: Refer to captionImage 152: Refer to caption Image 153: Refer to captionImage 154: Refer to captionImage 155: Refer to captionImage 156: Refer to captionImage 157: Refer to captionImage 158: Refer to captionImage 159: Refer to captionImage 160: Refer to captionImage 161: Refer to captionImage 162: Refer to caption Image 163: Refer to captionImage 164: Refer to captionImage 165: Refer to captionImage 166: Refer to captionImage 167: Refer to captionImage 168: Refer to captionImage 169: Refer to captionImage 170: Refer to captionImage 171: Refer to captionImage 172: Refer to caption Image 173: Refer to captionImage 174: Refer to captionImage 175: Refer to captionImage 176: Refer to captionImage 177: Refer to captionImage 178: Refer to captionImage 179: Refer to captionImage 180: Refer to captionImage 181: Refer to captionImage 182: Refer to caption Image 183: Refer to captionImage 184: Refer to captionImage 185: Refer to captionImage 186: Refer to captionImage 187: Refer to captionImage 188: Refer to captionImage 189: Refer to captionImage 190: Refer to captionImage 191: Refer to captionImage 192: Refer to caption

Figure 15: Evolution of sampling process without and with guidance

Image 193: Refer to captionImage 194: Refer to captionImage 195: Refer to captionImage 196: Refer to captionImage 197: Refer to captionImage 198: Refer to captionImage 199: Refer to captionImage 200: Refer to captionImage 201: Refer to captionImage 202: Refer to caption Image 203: Refer to captionImage 204: Refer to captionImage 205: Refer to captionImage 206: Refer to captionImage 207: Refer to captionImage 208: Refer to captionImage 209: Refer to captionImage 210: Refer to captionImage 211: Refer to captionImage 212: Refer to caption Image 213: Refer to captionImage 214: Refer to captionImage 215: Refer to captionImage 216: Refer to captionImage 217: Refer to captionImage 218: Refer to captionImage 219: Refer to captionImage 220: Refer to captionImage 221: Refer to captionImage 222: Refer to caption Image 223: Refer to captionImage 224: Refer to captionImage 225: Refer to captionImage 226: Refer to captionImage 227: Refer to captionImage 228: Refer to captionImage 229: Refer to captionImage 230: Refer to captionImage 231: Refer to captionImage 232: Refer to caption Image 233: Refer to captionImage 234: Refer to captionImage 235: Refer to captionImage 236: Refer to captionImage 237: Refer to captionImage 238: Refer to captionImage 239: Refer to captionImage 240: Refer to captionImage 241: Refer to captionImage 242: Refer to caption

Figure 16: Non-cherry-picked samples from our method corresponding to the same semantic map. The upper left figure shows the corresponding semantic map. Samples across different examples with same sampling location has same random seeds

Image 243: [Uncaptioned image]Image 244: [Uncaptioned image]Image 245: [Uncaptioned image]Image 246: [Uncaptioned image]Image 247: [Uncaptioned image]Image 248: [Uncaptioned image]Image 249: [Uncaptioned image]Image 250: [Uncaptioned image]Image 251: [Uncaptioned image]Image 252: [Uncaptioned image] Image 253: [Uncaptioned image]Image 254: [Uncaptioned image]Image 255: [Uncaptioned image]Image 256: [Uncaptioned image]Image 257: [Uncaptioned image]Image 258: [Uncaptioned image]Image 259: [Uncaptioned image]Image 260: [Uncaptioned image]Image 261: [Uncaptioned image]Image 262: [Uncaptioned image] Image 263: [Uncaptioned image]Image 264: [Uncaptioned image]Image 265: [Uncaptioned image]Image 266: [Uncaptioned image]Image 267: [Uncaptioned image]Image 268: [Uncaptioned image]Image 269: [Uncaptioned image]Image 270: [Uncaptioned image]Image 271: [Uncaptioned image]Image 272: [Uncaptioned image] Image 273: [Uncaptioned image]Image 274: [Uncaptioned image]Image 275: [Uncaptioned image]Image 276: [Uncaptioned image]Image 277: [Uncaptioned image]Image 278: [Uncaptioned image]Image 279: [Uncaptioned image]Image 280: [Uncaptioned image]Image 281: [Uncaptioned image]Image 282: [Uncaptioned image] Image 283: [Uncaptioned image]Image 284: [Uncaptioned image]Image 285: [Uncaptioned image]Image 286: [Uncaptioned image]Image 287: [Uncaptioned image]Image 288: [Uncaptioned image]Image 289: [Uncaptioned image]Image 290: [Uncaptioned image]Image 291: [Uncaptioned image]Image 292: [Uncaptioned image]

Image 293: Refer to captionImage 294: Refer to captionImage 295: Refer to captionImage 296: Refer to captionImage 297: Refer to captionImage 298: Refer to captionImage 299: Refer to captionImage 300: Refer to captionImage 301: Refer to captionImage 302: Refer to caption Image 303: Refer to captionImage 304: Refer to captionImage 305: Refer to captionImage 306: Refer to captionImage 307: Refer to captionImage 308: Refer to captionImage 309: Refer to captionImage 310: Refer to captionImage 311: Refer to captionImage 312: Refer to caption Image 313: Refer to captionImage 314: Refer to captionImage 315: Refer to captionImage 316: Refer to captionImage 317: Refer to captionImage 318: Refer to captionImage 319: Refer to captionImage 320: Refer to captionImage 321: Refer to captionImage 322: Refer to caption Image 323: Refer to captionImage 324: Refer to captionImage 325: Refer to captionImage 326: Refer to captionImage 327: Refer to captionImage 328: Refer to captionImage 329: Refer to captionImage 330: Refer to captionImage 331: Refer to captionImage 332: Refer to caption Image 333: Refer to captionImage 334: Refer to captionImage 335: Refer to captionImage 336: Refer to captionImage 337: Refer to captionImage 338: Refer to captionImage 339: Refer to captionImage 340: Refer to captionImage 341: Refer to captionImage 342: Refer to caption

Figure 17: Non-cherry-picked samples from our method corresponding to the same semantic map. The upper left figure shows the corresponding semantic map. Samples across different examples with same sampling location has same random seeds

Image 343: [Uncaptioned image]Image 344: [Uncaptioned image]Image 345: [Uncaptioned image]Image 346: [Uncaptioned image]Image 347: [Uncaptioned image]Image 348: [Uncaptioned image]Image 349: [Uncaptioned image]Image 350: [Uncaptioned image]Image 351: [Uncaptioned image]Image 352: [Uncaptioned image]Image 353: [Uncaptioned image]Image 354: [Uncaptioned image] Image 355: [Uncaptioned image]Image 356: [Uncaptioned image]Image 357: [Uncaptioned image]Image 358: [Uncaptioned image]Image 359: [Uncaptioned image]Image 360: [Uncaptioned image]Image 361: [Uncaptioned image]Image 362: [Uncaptioned image]Image 363: [Uncaptioned image]Image 364: [Uncaptioned image] Image 365: [Uncaptioned image]Image 366: [Uncaptioned image]Image 367: [Uncaptioned image]Image 368: [Uncaptioned image]Image 369: [Uncaptioned image]Image 370: [Uncaptioned image]Image 371: [Uncaptioned image]Image 372: [Uncaptioned image]Image 373: [Uncaptioned image]Image 374: [Uncaptioned image] Image 375: [Uncaptioned image]Image 376: [Uncaptioned image]Image 377: [Uncaptioned image]Image 378: [Uncaptioned image]Image 379: [Uncaptioned image]Image 380: [Uncaptioned image]Image 381: [Uncaptioned image]Image 382: [Uncaptioned image]Image 383: [Uncaptioned image]Image 384: [Uncaptioned image] Image 385: [Uncaptioned image]Image 386: [Uncaptioned image]Image 387: [Uncaptioned image]Image 388: [Uncaptioned image]Image 389: [Uncaptioned image]Image 390: [Uncaptioned image]Image 391: [Uncaptioned image]Image 392: [Uncaptioned image]Image 393: [Uncaptioned image]Image 394: [Uncaptioned image]

Image 395: Refer to captionImage 396: Refer to captionImage 397: Refer to captionImage 398: Refer to captionImage 399: Refer to captionImage 400: Refer to captionImage 401: Refer to captionImage 402: Refer to captionImage 403: Refer to captionImage 404: Refer to captionImage 405: Refer to captionImage 406: Refer to caption Image 407: Refer to captionImage 408: Refer to captionImage 409: Refer to captionImage 410: Refer to captionImage 411: Refer to captionImage 412: Refer to captionImage 413: Refer to captionImage 414: Refer to captionImage 415: Refer to captionImage 416: Refer to caption Image 417: Refer to captionImage 418: Refer to captionImage 419: Refer to captionImage 420: Refer to captionImage 421: Refer to captionImage 422: Refer to captionImage 423: Refer to captionImage 424: Refer to captionImage 425: Refer to captionImage 426: Refer to caption Image 427: Refer to captionImage 428: Refer to captionImage 429: Refer to captionImage 430: Refer to captionImage 431: Refer to captionImage 432: Refer to captionImage 433: Refer to captionImage 434: Refer to captionImage 435: Refer to captionImage 436: Refer to caption Image 437: Refer to captionImage 438: Refer to captionImage 439: Refer to captionImage 440: Refer to captionImage 441: Refer to captionImage 442: Refer to captionImage 443: Refer to captionImage 444: Refer to captionImage 445: Refer to captionImage 446: Refer to caption

Figure 18: Non-cherry-picked samples from our method corresponding to the same identity image. The identity image projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Image 447: Refer to captionImage 448: Refer to captionImage 449: Refer to captionImage 450: Refer to captionImage 451: Refer to captionImage 452: Refer to captionImage 453: Refer to captionImage 454: Refer to captionImage 455: Refer to captionImage 456: Refer to captionImage 457: Refer to captionImage 458: Refer to caption Image 459: Refer to captionImage 460: Refer to captionImage 461: Refer to captionImage 462: Refer to captionImage 463: Refer to captionImage 464: Refer to captionImage 465: Refer to captionImage 466: Refer to captionImage 467: Refer to captionImage 468: Refer to caption Image 469: Refer to captionImage 470: Refer to captionImage 471: Refer to captionImage 472: Refer to captionImage 473: Refer to captionImage 474: Refer to captionImage 475: Refer to captionImage 476: Refer to captionImage 477: Refer to captionImage 478: Refer to caption Image 479: Refer to captionImage 480: Refer to captionImage 481: Refer to captionImage 482: Refer to captionImage 483: Refer to captionImage 484: Refer to captionImage 485: Refer to captionImage 486: Refer to captionImage 487: Refer to captionImage 488: Refer to caption Image 489: Refer to captionImage 490: Refer to captionImage 491: Refer to captionImage 492: Refer to captionImage 493: Refer to captionImage 494: Refer to captionImage 495: Refer to captionImage 496: Refer to captionImage 497: Refer to captionImage 498: Refer to caption

Figure 19: Non-cherry-picked samples from our method corresponding to the same identity image. The identity image projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Image 499: Refer to captionImage 500: Refer to captionImage 501: Refer to captionImage 502: Refer to captionImage 503: Refer to captionImage 504: Refer to captionImage 505: Refer to captionImage 506: Refer to captionImage 507: Refer to captionImage 508: Refer to captionImage 509: Refer to captionImage 510: Refer to caption Image 511: Refer to captionImage 512: Refer to captionImage 513: Refer to captionImage 514: Refer to captionImage 515: Refer to captionImage 516: Refer to captionImage 517: Refer to captionImage 518: Refer to captionImage 519: Refer to captionImage 520: Refer to caption Image 521: Refer to captionImage 522: Refer to captionImage 523: Refer to captionImage 524: Refer to captionImage 525: Refer to captionImage 526: Refer to captionImage 527: Refer to captionImage 528: Refer to captionImage 529: Refer to captionImage 530: Refer to caption Image 531: Refer to captionImage 532: Refer to captionImage 533: Refer to captionImage 534: Refer to captionImage 535: Refer to captionImage 536: Refer to captionImage 537: Refer to captionImage 538: Refer to captionImage 539: Refer to captionImage 540: Refer to caption Image 541: Refer to captionImage 542: Refer to captionImage 543: Refer to captionImage 544: Refer to captionImage 545: Refer to captionImage 546: Refer to captionImage 547: Refer to captionImage 548: Refer to captionImage 549: Refer to captionImage 550: Refer to caption

Figure 20: Non-cherry-picked samples from our method corresponding to grayscale to RGB. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Image 551: Refer to captionImage 552: Refer to captionImage 553: Refer to captionImage 554: Refer to captionImage 555: Refer to captionImage 556: Refer to captionImage 557: Refer to captionImage 558: Refer to captionImage 559: Refer to captionImage 560: Refer to captionImage 561: Refer to captionImage 562: Refer to caption Image 563: Refer to captionImage 564: Refer to captionImage 565: Refer to captionImage 566: Refer to captionImage 567: Refer to captionImage 568: Refer to captionImage 569: Refer to captionImage 570: Refer to captionImage 571: Refer to captionImage 572: Refer to caption Image 573: Refer to captionImage 574: Refer to captionImage 575: Refer to captionImage 576: Refer to captionImage 577: Refer to captionImage 578: Refer to captionImage 579: Refer to captionImage 580: Refer to captionImage 581: Refer to captionImage 582: Refer to caption Image 583: Refer to captionImage 584: Refer to captionImage 585: Refer to captionImage 586: Refer to captionImage 587: Refer to captionImage 588: Refer to captionImage 589: Refer to captionImage 590: Refer to captionImage 591: Refer to captionImage 592: Refer to caption Image 593: Refer to captionImage 594: Refer to captionImage 595: Refer to captionImage 596: Refer to captionImage 597: Refer to captionImage 598: Refer to captionImage 599: Refer to captionImage 600: Refer to captionImage 601: Refer to captionImage 602: Refer to caption

Figure 21: Non-cherry-picked samples from our method corresponding to grayscale to RGB. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Image 603: Refer to captionImage 604: Refer to captionImage 605: Refer to captionImage 606: Refer to captionImage 607: Refer to captionImage 608: Refer to captionImage 609: Refer to captionImage 610: Refer to captionImage 611: Refer to captionImage 612: Refer to captionImage 613: Refer to captionImage 614: Refer to caption Image 615: Refer to captionImage 616: Refer to captionImage 617: Refer to captionImage 618: Refer to captionImage 619: Refer to captionImage 620: Refer to captionImage 621: Refer to captionImage 622: Refer to captionImage 623: Refer to captionImage 624: Refer to caption Image 625: Refer to captionImage 626: Refer to captionImage 627: Refer to captionImage 628: Refer to captionImage 629: Refer to captionImage 630: Refer to captionImage 631: Refer to captionImage 632: Refer to captionImage 633: Refer to captionImage 634: Refer to caption Image 635: Refer to captionImage 636: Refer to captionImage 637: Refer to captionImage 638: Refer to captionImage 639: Refer to captionImage 640: Refer to captionImage 641: Refer to captionImage 642: Refer to captionImage 643: Refer to captionImage 644: Refer to caption Image 645: Refer to captionImage 646: Refer to captionImage 647: Refer to captionImage 648: Refer to captionImage 649: Refer to captionImage 650: Refer to captionImage 651: Refer to captionImage 652: Refer to captionImage 653: Refer to captionImage 654: Refer to caption

Figure 22: Non-cherry-picked samples from our method corresponding to LR to SR. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Image 655: Refer to captionImage 656: Refer to captionImage 657: Refer to captionImage 658: Refer to captionImage 659: Refer to captionImage 660: Refer to captionImage 661: Refer to captionImage 662: Refer to captionImage 663: Refer to captionImage 664: Refer to captionImage 665: Refer to captionImage 666: Refer to caption Image 667: Refer to captionImage 668: Refer to captionImage 669: Refer to captionImage 670: Refer to captionImage 671: Refer to captionImage 672: Refer to captionImage 673: Refer to captionImage 674: Refer to captionImage 675: Refer to captionImage 676: Refer to caption Image 677: Refer to captionImage 678: Refer to captionImage 679: Refer to captionImage 680: Refer to captionImage 681: Refer to captionImage 682: Refer to captionImage 683: Refer to captionImage 684: Refer to captionImage 685: Refer to captionImage 686: Refer to caption Image 687: Refer to captionImage 688: Refer to captionImage 689: Refer to captionImage 690: Refer to captionImage 691: Refer to captionImage 692: Refer to captionImage 693: Refer to captionImage 694: Refer to captionImage 695: Refer to captionImage 696: Refer to caption Image 697: Refer to captionImage 698: Refer to captionImage 699: Refer to captionImage 700: Refer to captionImage 701: Refer to captionImage 702: Refer to captionImage 703: Refer to captionImage 704: Refer to captionImage 705: Refer to captionImage 706: Refer to caption

Figure 23: Non-cherry-picked samples from our method corresponding to LR to SR. The grayscale projects out from the rest of the images. Samples across different examples with same sampling location has same random seeds

Xet Storage Details

Size:
169 kB
·
Xet hash:
501755a68f401b1f93931c4dd0a381a19f5152686b7748e34bf72cdb963bc714

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.