Buckets:
Title: Harnessing the Latent Diffusion Model for Training-Free Image Style Transfer
URL Source: https://arxiv.org/html/2410.01366
Published Time: Thu, 03 Oct 2024 00:42:07 GMT
Markdown Content: 1 1 institutetext: CyberAgent, Japan
1 1 email: {masui_kento,otani_mayu,nomura_masahiro}@cyberagent.co.jp 2 2 institutetext: The University of Tokyo, Japan
2 2 email: nakayama@nlab.ci.i.u-tokyo.ac.jp
Mayu Otani\orcidlink 0000-0001-9923-2669 11 Masahiro Nomura \orcidlink 0000-0002-4945-5984 11 Hideki Nakayama \orcidlink 0000-0001-8726-2780 22
Abstract
Diffusion models have recently shown the ability to generate high-quality images. However, controlling its generation process still poses challenges. The image style transfer task is one of those challenges that transfers the visual attributes of a style image to another content image. Typical obstacle of this task is the requirement of additional training of a pre-trained model. We propose a training-free style transfer algorithm, Style Tracking Reverse Diffusion Process (STRDP) for a pretrained Latent Diffusion Model (LDM). Our algorithm employs Adaptive Instance Normalization (AdaIN) function in a distinct manner during the reverse diffusion process of an LDM while tracking the encoding history of the style image. This algorithm enables style transfer in the latent space of LDM for reduced computational cost, and provides compatibility for various LDM models. Through a series of experiments and a user study, we show that our method can quickly transfer the style of an image without additional training. The speed, compatibility, and training-free aspect of our algorithm facilitates agile experiments with combinations of styles and LDMs for extensive application.
Keywords:
Image Style Transfer Latent Diffusion Model Generative Models
Figure 1: Our image style transfer results. Our algorithm is able to transfer the visual style to a content image using a pre-trained latent diffusion model, without the need for additional training or heavy optimization. Unlike most existing approaches, our method preserves the original color of the content.
1 Introduction
Neural image style transfer, pioneered by Gatys et al.[7], is a task that transfers the visual features of a style image to another target image. In the realm of art, for instance, this technique has been used to transform photographs into painting-like images, such as those resembling Van Gogh’s style. It has also been applied to represent target images as if they were composed of a given texture image.
An essential challenge in style transfer has been its computational cost. In early research on image style transfer, Gatys et al.demonstrated that a feature space capable of separating style and content can be obtained using VGG[28], which was trained on a large dataset. However, their method had computational speed issues as a result of the direct optimization of images. Following Gatys, Huang et al. introduced the Adaptive Instance Normalization (AdaIN) [11] function to approximate the style transfer effect with training a neural network. Although Huang et al. achieved a faster inference time for style transfer, training an additional neural network on top of VGG[28] still requires a considerable amount of time.
Recent generative models have made promising improvements in the speed and quality of image generation. Diffusion models[29, 23, 5, 2] have enabled high-quality image generation using large-scale datasets such as LAION[26]. In particular, a Latent Diffusion Model (LDM) known as Stable Diffusion[23] has successfully learned a generative model in the latent space corresponding to the images, rather than in the image space itself. This method has improved computational speed and space by handling the diffusion process in a smaller space, making generative diffusion models feasible for consumer-grade GPUs.
In this paper, we propose a quick style transfer algorithm without additional training. Our motivation is to take advantage of the high-quality and efficient image generation capability of LDM without introducing additional training. The LDMs do not have the ability to transfer image styles; therefore, we revisit AdaIN by Huang et al., a function originally developed for rapid style transfer. However, original work by Huang et al. is not designed to work with LDM architecture and also requires additional training. Moreover, naive application of AdaIN on the LDM’s latent variable does not produce desired style transfer due to its limited number of channels. To overcome this issue, we propose an algorithm: STRDP, which alters the denoising process of LDM while iteratively applying AdaIN in a distinct repetitive manner in the U-Net architecture of LDM. Our experiments show that our method manages color and texture styles separately, allowing styled image generation while preserving the original colors of the content image as shown in Fig.1
In summary, our contributions are as follows:
- •We propose an algorithm called STRDP, for a pre-trained LDM to perform style transfer without additional training. We achieve this by repeatedly applying the AdaIN function in a distinct manner in the U-Net architecture of an LDM during the reverse diffusion steps.
- •We show that our method runs faster than other diffusion-based or training-free methods, while achieving style transfer and color preservation.
- •We designed our algorithm to be compatible with various LDM-based models and techniques for extensive applications.
2 Related Work
2.1 Style Transfer Methods
Direct Image Optimization:
Gatys et al.[7] introduced style and content losses using VGG features for style transfer. They also advocated for the preservation of content image through post-processing algorithms such as histogram matching[6]. Kolkin et al.[14] employed a loss term based on Earth Mover’s Distance. These methods directly optimize image pixels through loss functions, taking tens to hundreds of seconds depending on image resolution.
Image Feature Transformation:
There are approaches with an encoder-decoder architecture that first encodes a content image and a style image into features[11, 17, 27, 20, 31, 12, 10]. The features of a content image are then modified to have a style image feature. Finally, the modified feature is decoded back to an image with a style transfer effect.
Huang et al. proposed the AdaIN function to transfer the statistics of style features into the content image features. They also proposed a framework with neural network to approximate the image optimization done by Gatys. Huang et al. uses the pre-trained VGG model as an encoder. The mean and variance of the content image features in VGG are replaced with the mean and variance of the style image features using the AdaIN function. The modified features are fed to a trained decoder for the final result. Their decoder is trained to convert modified features to an image that minimizes the style and content losses proposed in [7].
These methods are fast, as they require only a single forward pass of the model, allowing style transfer in under a second. However, they require considerable time to train the encoder-decoder architecture on which they rely.
2.2 Diffusion Models and Controllability
In the field of image generation, diffusion models[29] have achieved success in terms of generation quality with training on large datasets. On the other hand, controlling the generation process poses a challenge. There are several approaches for controlling a trained diffusion model, as we discuss below.
Guidance:
In Guided Diffusion[5] and Stable Diffusion[23], the generative model is defined and learned as a conditional model. Guided Diffusion involves providing gradients from a separate model as guidance during denoising. However, both methods require training and the use of a separate model from the generative model itself, which poses a challenge. Kwon et al.[16] also implemented style transfer in the diffusion model with guidance, but their approach is not designed to work on the latent space we aim to work on for reduced computational cost.
Additive Control:
Some approaches introduce additional neural network modules or parameters into the text-to-image model to enhance controllability. ControlNet[32] incorporates a trainable clone of a diffusion model and fine-tunes new parameters with conditional input, enabling the model to generate images based on various conditions such as line arts, depth maps, etc. LoRA[9] is an approach first proposed for large language models. LoRA introduces additional parameters into the layers of transformer architecture in a model. By training these parameters with different data sets and objective functions, the model can follow additional context similar to fine-tuning the original model.
Tuning:
There are approaches that fine-tune the pre-trained LDM for controllability[25, 13, 18]. InST by Zhang et al.[34] also proposed a method with a text-to-image diffusion model. Their method optimizes a text embedding rather than model parameters to obtain a vector that can express the style image; however, this optimization takes 20 minutes.
We have reviewed various approaches for controlling diffusion models. However, these approaches either require additional training to enhance controllability or are not designed to work with the latent space of LDM.
3 Background
Diffusion Models:
Diffusion Model (DM)[29] is a generative model that approximates the distribution of any data x 𝑥 x italic_x by first converting x 𝑥 x italic_x into simple Gaussian distribution with a diffusion process and then learning the reverse diffusion process. This diffusion process adds small amount of Gaussian noises to the original data x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over T T\mathrm{T}roman_T steps until x T subscript 𝑥 𝑇 x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT becomes a Gaussian noise.
Ho et al. implemented this diffusion model as Denoising Probabilistic Diffusion Model (DDPM)[8] by formulating each step of the reverse diffusion process as a denoising problem. They introduced a neural network model ϵ θ(x t,t)subscript italic-ϵ 𝜃 subscript 𝑥 𝑡 𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) to predict the added noise of x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, so that ϵ θ(x t,t)subscript italic-ϵ 𝜃 subscript 𝑥 𝑡 𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) can be used iteratively for T T\mathrm{T}roman_T steps to reconstruct the original data x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This ϵ θ(x t,t)subscript italic-ϵ 𝜃 subscript 𝑥 𝑡 𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is modeled to implicitly predict the original data x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from any x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
Latent Diffusion Models:
Following the formulation of the DDPM, the Latent Diffusion Model (LDM) has been proposed by Rombach et al.[23]. LDM learns the corresponding latent space z 𝑧 z italic_z for input data x 𝑥 x italic_x using an autoencoder composed of an encoder E 𝐸 E italic_E and a decoder D 𝐷 D italic_D as z=E(x)𝑧 𝐸 𝑥 z=E(x)italic_z = italic_E ( italic_x ) and x=D(z)𝑥 𝐷 𝑧 x=D(z)italic_x = italic_D ( italic_z ). LDM introduces ϵ θ(z t,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑡 𝑡\epsilon_{\theta}({z_{t},t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) to generate z 𝑧 z italic_z. As the size of z 𝑧 z italic_z is smaller than x 𝑥 x italic_x, the computational cost is significantly reduced. Note that when working with the image domain, this ϵ θ(z t,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑡 𝑡\epsilon_{\theta}({z_{t},t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is typically modeled by a denoising U-Net[5, 24] for its ability to retain the spatial structure of the image. Unfortunately, the LDM’s efficiency in latent space also introduces a challenge. The problem is that typical approaches for controlling the diffusion model, such as guidances, cannot be applied to a pre-trained LDM without additional training. Successive work in this field includes challenges for higher resolution image generation by SDXL[21], and faster image generation by LCM-LoRA[19].
Adaptive Instance Normalization:
To control the reverse diffusion process of a diffusion model for image style transfer, we need to modify ϵ θ(z t,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑡 𝑡\epsilon_{\theta}({z_{t},t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). However, since z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is in the domain of latent space, statistical properties of z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT must be in the valid range of the autoencoder. Therefore, we employ AdaIN[11] function by Huang et al. for controlling the style property of z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. AdaIN works by replacing the mean and standard deviation of each filter’s activation of CNN[15] for the original image with those of the style image as follows:
AdaIN(x,y)=σ(y)(x−μ(x)σ(x))+μ(y).AdaIN 𝑥 𝑦 𝜎 𝑦 𝑥 𝜇 𝑥 𝜎 𝑥 𝜇 𝑦\text{AdaIN}(x,y)=\sigma(y)\left(\frac{x-\mu(x)}{\sigma(x)}\right)+\mu(y).AdaIN ( italic_x , italic_y ) = italic_σ ( italic_y ) ( divide start_ARG italic_x - italic_μ ( italic_x ) end_ARG start_ARG italic_σ ( italic_x ) end_ARG ) + italic_μ ( italic_y ) .(1)
x∈ℝ C×H×W 𝑥 superscript ℝ 𝐶 𝐻 𝑊 x\in\mathbb{R}^{C\times H\times W}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT is the activation of a CNN filter from the content image, and y∈ℝ C×H×W 𝑦 superscript ℝ 𝐶 𝐻 𝑊 y\in\mathbb{R}^{C\times H\times W}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT is the activation of a CNN filter from the style image. We denote C 𝐶 C italic_C, H 𝐻 H italic_H, and W 𝑊 W italic_W as the corresponding channel, height and width of the activations of the convolutional layer. Note that μ(⋅)∈ℝ C 𝜇⋅superscript ℝ 𝐶\mu(\cdot)\in\mathbb{R}^{C}italic_μ ( ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and σ(⋅)∈ℝ C 𝜎⋅superscript ℝ 𝐶\sigma(\cdot)\in\mathbb{R}^{C}italic_σ ( ⋅ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT are channel-wise mean and standard deviation. Huang et al. demonstrates that the activation statistics capture visual styles for style transfer. A drawback of the framework by Huang et al. is the requirement to train a decoder to convert the output feature into an image.
4 Methodology
Figure 2: An architecture of our image style transfer with style-tracking reverse diffusion process. We first add noises to the latent variables of the style and content image for T′superscript T′\mathrm{T}^{\prime}roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT steps. We keep a history of latent variables from the style image’s forward diffusion process as z s,t subscript 𝑧 𝑠 𝑡 z_{s,t}italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT. In the reverse diffusion steps, we gather the CNN filter activation statistics in ϵ θ subscript italic-ϵ 𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT from style, and transfer them to the corresponding content activations using AdaIN. This scheme allows us to transfer the image’s style without training any module. We also visualize latent variables z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and predicted noise ϵ^t subscript^italic-ϵ 𝑡\hat{\epsilon}{t}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT involved in this architecture as colored images. We further show a detailed diagram of ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in Fig.3
Figure 3: A diagram of ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}_{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT which repeatedly applies AdaIN during the forward pass of denoising U-Net. AdaIN is introduced to every convolutional layer to transfer filter activation statistics from a style image.
Our key idea is to transfer visual style via representation statistics in the reverse diffusion process of pre-trained LDM using AdaIN. Figure 2 illustrates the pipeline of our visual style transfer.
A straightforward approach to controlling the image generation process in LDM is to modulate z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT; however, AdaIN is not applicable to z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT due to its small number of channels. We will explain how we overcome this issue in Sec.4.2.
4.1 Overview
We utilize the forward and reverse diffusion processes of the LDM. First, the style image and the content image are converted to latent space. Then, in the forward diffusion process, we iteratively add noise to the representations and produce a sequence of noisy latent representations. In the reverse diffusion process, we employ AdaIN repeatedly to integrate CNN filter activations of style and content to obtain stylized latent variable. Finally, we decode the stylized latent variabele with a decoder of the LDM to obtain the stylized image.
Forward Diffusion Process:
We use DDIM[30] as the base sampling algorithm, since it enables faster sampling without losing quality. Initially, we obtain the latent space representations z s,0∈ℝ C z×H z×W z subscript 𝑧 𝑠 0 superscript ℝ subscript 𝐶 𝑧 subscript 𝐻 𝑧 subscript 𝑊 𝑧 z_{s,0}\in\mathbb{R}^{C_{z}\times H_{z}\times W_{z}}italic_z start_POSTSUBSCRIPT italic_s , 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and z c,0∈ℝ C z×H z×W z subscript 𝑧 𝑐 0 superscript ℝ subscript 𝐶 𝑧 subscript 𝐻 𝑧 subscript 𝑊 𝑧 z_{c,0}\in\mathbb{R}^{C_{z}\times H_{z}\times W_{z}}italic_z start_POSTSUBSCRIPT italic_c , 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUPERSCRIPT of the style image x s subscript 𝑥 𝑠 x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and content image x c subscript 𝑥 𝑐 x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT using an encoder of the LDM. Here, C z,H z subscript 𝐶 𝑧 subscript 𝐻 𝑧 C_{z},H_{z}italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, and W z subscript 𝑊 𝑧 W_{z}italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT denote the channel, height, and width of LDM’s latent variable. Given the LDM scheduled to have maximum T T\mathrm{T}roman_T steps for DDIM scheduling, we apply the forward diffusion process to add noise to these feature representations over T′superscript T′\mathrm{T}^{\prime}roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT steps, yielding z c,T′subscript 𝑧 𝑐 superscript T′z_{c,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_c , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and z s,T′subscript 𝑧 𝑠 superscript T′z_{s,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_s , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Here, we control T′superscript T′\mathrm{T}^{\prime}roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a strength parameter S∈[0,1]𝑆 0 1 S\in[0,1]italic_S ∈ [ 0 , 1 ] as T′=round(S∗T)superscript T′round 𝑆 T\mathrm{T}^{\prime}=\textrm{round}(S*\mathrm{T})roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = round ( italic_S ∗ roman_T ). During this phase, we record z s,t subscript 𝑧 𝑠 𝑡 z_{s,t}italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT until z s,T′subscript 𝑧 𝑠 superscript T′z_{s,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_s , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to use them in the decoding phase for the prediction of z^0 subscript^𝑧 0\hat{z}{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The strength S 𝑆 S italic_S controls how much noise we add to the original content and style before applying style transfer. If S 𝑆 S italic_S is 1, the original data becomes complete Gaussian noise in step T′=T superscript T′T\mathrm{T}^{\prime}=\mathrm{T}roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_T, and we cannot reconstruct the original image in the reverse diffusion process. Therefore, we need to select an appropriate strength S 𝑆 S italic_S that can still retain the style and content information in z s,T′subscript 𝑧 𝑠 superscript T′z{s,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_s , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and z c,T′subscript 𝑧 𝑐 superscript T′z_{c,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_c , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in step T′superscript T′\mathrm{T}^{\prime}roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
Reverse Diffusion Proccess:
Starting from the noisy latent representations z c,T′subscript 𝑧 𝑐 superscript T′z_{c,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_c , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and z s,T′subscript 𝑧 𝑠 superscript T′z_{s,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_s , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT obtained through the forward diffusion process, we transfer the style properties from z s,t subscript 𝑧 𝑠 𝑡 z_{s,t}italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT to z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT while reconstructing the image. To derive the final latent variable z^0 subscript^𝑧 0\hat{z}{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from z c,T′subscript 𝑧 𝑐 superscript T′z{c,\mathrm{T}^{\prime}}italic_z start_POSTSUBSCRIPT italic_c , roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we track the encoding history of z s,t subscript 𝑧 𝑠 𝑡 z_{s,t}italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT. We refer to this process as the Style-Tracking Reverse Diffusion Process (STRDP). During this process, we introduce AdaIN into ϵ θ subscript italic-ϵ 𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to transfer the style information. After the reverse process and obtaining z^0 subscript^𝑧 0\hat{z}{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the LDM decoder converts z^0 subscript^𝑧 0\hat{z}{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into the final image x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG. Note that S 𝑆 S italic_S needs to be large enough to accumulate the style effect over T′superscript T′\mathrm{T}^{\prime}roman_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT steps in our reverse diffusion process. Therefore, S 𝑆 S italic_S controls the trade-off between the reconstruction of the content and the amount of effect applied to the result.
4.2 Reverse Diffusion Process with AdaIN and denoising U-Net:
To describe our reverse diffusion algorithm, we start from the following denoising equation with LDM and DDIM:
z t−1=α t−1(z t−1−α tϵ θ(z t)α t)⏟predicted z 0+1−α t−1−σ t 2⋅ϵ θ(z t)⏟direction pointing to z t+σ tϵ t⏟random noise.subscript 𝑧 𝑡 1 subscript 𝛼 𝑡 1 subscript⏟subscript 𝑧 𝑡 1 subscript 𝛼 𝑡 subscript italic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝛼 𝑡 predicted z 0 subscript⏟⋅1 subscript 𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 subscript italic-ϵ 𝜃 subscript 𝑧 𝑡 direction pointing to z t subscript⏟subscript 𝜎 𝑡 subscript italic-ϵ 𝑡 random noise\begin{split}z_{t-1}=&\sqrt{\alpha_{t-1}}\underbrace{\left(\frac{z_{t}-\sqrt{1% -\alpha_{t}}\epsilon_{\theta}(z_{t})}{\sqrt{\alpha_{t}}}\right)}{\text{% predicted $z{0}$}}+\underbrace{\sqrt{1-\alpha_{t-1}-\sigma_{t}^{2}}\cdot% \epsilon_{\theta}(z_{t})}{\text{direction pointing to $z{t}$}}+\underbrace{% \sigma_{t}\epsilon_{t}}_{\text{random noise}}.\end{split}start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = end_CELL start_CELL square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG under⏟ start_ARG ( divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) end_ARG start_POSTSUBSCRIPT predicted italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT direction pointing to italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT random noise end_POSTSUBSCRIPT . end_CELL end_ROW(2)
Here, α t subscript 𝛼 𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the diffusion scheduling parameters in DDIM, while σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the amount of randomness introduced in the reverse diffusion process. Typically, we set σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 0 to get a deterministic result without random noise ϵ t subscript italic-ϵ 𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
Replacing ϵ θ subscript italic-ϵ 𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}_{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:
Since AdaIN for style transfer is not directly effective to z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT due to its small number of channels, we apply AdaIN at CNN filters of ϵ θ subscript italic-ϵ 𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as they have enough number of channels to express the style[11]. To this end, we replace the original denoising U-Net ϵ θ(z t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑡\epsilon_{\theta}(z_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in Eq.2 with our custom parallel U-Net ϵθ(z t,z s,t)subscriptitalic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡\tilde{\epsilon}{\theta}(z{t},z_{s,t})over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) shown in Fig.3.
Our parallel U-Net, ϵθ(z t,z s,t)subscriptitalic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡\tilde{\epsilon}{\theta}(z{t},z_{s,t})over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ), is constructed from the original U-Net ϵ θ subscript italic-ϵ 𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and an AdaIN-embedded ϵ θ′subscript superscript italic-ϵ′𝜃\epsilon^{\prime}{\theta}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We repeatedly apply the AdaIN function at every CNN filter in the model. For l 𝑙 l italic_l th layer of the ϵ θ subscript italic-ϵ 𝜃\epsilon{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ϵ θ′subscript superscript italic-ϵ′𝜃\epsilon^{\prime}{\theta}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we denote C l,H l,W l subscript 𝐶 𝑙 subscript 𝐻 𝑙 subscript 𝑊 𝑙 C{l},H_{l},W_{l}italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as the layer’s channel, height and width. We collect CNN feature maps f sl∈ℝ C l×H l×W l subscript 𝑓 𝑠 𝑙 superscript ℝ subscript 𝐶 𝑙 subscript 𝐻 𝑙 subscript 𝑊 𝑙 f_{sl}\in\mathbb{R}^{C_{l}\times H_{l}\times W_{l}}italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from all convolutional layer while running ϵ θ(z s,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑠 𝑡\epsilon_{\theta}(z_{s,t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ). Then, we transfer the μ(f sl)∈ℝ C l 𝜇 subscript 𝑓 𝑠 𝑙 superscript ℝ subscript 𝐶 𝑙\mu(f_{sl})\in\mathbb{R}^{C_{l}}italic_μ ( italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and σ(f sl)∈ℝ C l 𝜎 subscript 𝑓 𝑠 𝑙 superscript ℝ subscript 𝐶 𝑙\sigma(f_{sl})\in\mathbb{R}^{C_{l}}italic_σ ( italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to the corresponding feature maps f cl∈ℝ C l×H l×W l subscript 𝑓 𝑐 𝑙 superscript ℝ subscript 𝐶 𝑙 subscript 𝐻 𝑙 subscript 𝑊 𝑙 f_{cl}\in\mathbb{R}^{C_{l}\times H_{l}\times W_{l}}italic_f start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in the AdaIN-embedded ϵ θ′subscript superscript italic-ϵ′𝜃\epsilon^{\prime}{\theta}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We denote the style-transferred feature maps fl∈ℝ C l×H l×W l subscript𝑓 𝑙 superscript ℝ subscript 𝐶 𝑙 subscript 𝐻 𝑙 subscript 𝑊 𝑙\tilde{f}{l}\in\mathbb{R}^{C_{l}\times H_{l}\times W_{l}}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each l 𝑙 l italic_l th layer of ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}_{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as follows:
fl=AdaIN(f cl(z t,z s,t),f sl(z s,t)),subscript𝑓 𝑙 AdaIN subscript 𝑓 𝑐 𝑙 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡 subscript 𝑓 𝑠 𝑙 subscript 𝑧 𝑠 𝑡\begin{split}\tilde{f}{l}=\text{AdaIN}(f{cl}(z_{t},z_{s,t}),f_{sl}(z_{s,t}))% ,\end{split}start_ROW start_CELL over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = AdaIN ( italic_f start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ) , end_CELL end_ROW(3)
which corresponds to the AdaIN layers depicted in Fig.3.
Note that fl subscript𝑓 𝑙\tilde{f}_{l}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT depends on its incoming layer activations, which are also the result of AdaIN application, as shown in Fig.3. This means that AdaIN is applied repeatedly during a single U-Net forward pass, in contrast to typical AdaIN application, which only applies AdaIN once to a feature. Our particular design prevents feature values from going outside the valid range in a U-Net forward pass by repeatedly applying AdaIN.
What we enforce with AdaIN is to ensure μ(fl)=μ(f sl)𝜇 subscript𝑓 𝑙 𝜇 subscript 𝑓 𝑠 𝑙\mu(\tilde{f}{l})=\mu(f{sl})italic_μ ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = italic_μ ( italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ) and σ(fl)=σ(f sl)𝜎 subscript𝑓 𝑙 𝜎 subscript 𝑓 𝑠 𝑙\sigma(\tilde{f}{l})=\sigma(f{sl})italic_σ ( over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = italic_σ ( italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ). This statistical constraint is conceptually similar to asking ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to predict a noise ϵ^t subscript^italic-ϵ 𝑡\hat{\epsilon}{t}over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that implicitly predicts z^0 subscript^𝑧 0\hat{z}{0}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT while satisfying the constraint. In other words, ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT responds to z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as if the information from z s,t subscript 𝑧 𝑠 𝑡 z_{s,t}italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT was present in z t subscript 𝑧 𝑡 z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT regardless of its position. The nature of positional invariance of the constraint is due to AdaIN using μ(⋅)𝜇⋅\mu(\cdot)italic_μ ( ⋅ ) and σ(⋅)𝜎⋅\sigma(\cdot)italic_σ ( ⋅ ) for each channel, thus eliminating positional information.
Denoising Equation with ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}_{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:
With our AdaIN embedded ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we show the denoising step of our STRDP by replacing ϵ θ(z t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑡\epsilon{\theta}(z_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in Eq.2 with ϵθ(z t,z s,t)subscriptitalic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡\tilde{\epsilon}{\theta}(z{t},z_{s,t})over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) as follows:
z t−1=α t−1(z t−1−α tϵθ(z t,z s,t)α t)⏟predicted z 0+1−α t−1−σ t 2⋅ϵθ(z t,z s,t)⏟direction pointing to z t+σ tϵ t⏟random noise.subscript 𝑧 𝑡 1 subscript 𝛼 𝑡 1 subscript⏟subscript 𝑧 𝑡 1 subscript 𝛼 𝑡 subscriptitalic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡 subscript 𝛼 𝑡 predicted z 0 subscript⏟⋅1 subscript 𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 subscriptitalic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡 direction pointing to z t subscript⏟subscript 𝜎 𝑡 subscript italic-ϵ 𝑡 random noise\begin{split}z_{t-1}=&\sqrt{\alpha_{t-1}}\underbrace{\left(\frac{z_{t}-\sqrt{1% -\alpha_{t}}\tilde{\epsilon}{\theta}(z{t},z_{s,t})}{\sqrt{\alpha_{t}}}\right% )}{\text{predicted $z{0}$}}\ &+\underbrace{\sqrt{1-\alpha_{t-1}-\sigma_{t}^{2}}\cdot\tilde{\epsilon}{% \theta}(z{t},z_{s,t})}{\text{direction pointing to $z{t}$}}+\underbrace{% \sigma_{t}\epsilon_{t}}_{\text{random noise}}.\end{split}start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = end_CELL start_CELL square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG under⏟ start_ARG ( divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) end_ARG start_POSTSUBSCRIPT predicted italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + under⏟ start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT direction pointing to italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT random noise end_POSTSUBSCRIPT . end_CELL end_ROW(4)
We keep all hyperparameters the same as the original LDM implementation. We also do not modify the scheduling parameter, since the scale of our ϵθ(z t,z s,t)subscriptitalic-ϵ 𝜃 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡\tilde{\epsilon}{\theta}(z{t},z_{s,t})over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) is adjusted to be the same as ϵ θ(z s,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑠 𝑡\epsilon_{\theta}(z_{s,t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) by AdaIN function.
In summary, ϵθ subscriptitalic-ϵ 𝜃\tilde{\epsilon}{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is designed to perform style transfer by repeatedly applying AdaIN to CNN feature maps of ϵ θ′subscript superscript italic-ϵ′𝜃\epsilon^{\prime}{\theta}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT from ϵ θ(z s,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑠 𝑡\epsilon_{\theta}(z_{s,t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) at each step of the reverse diffusion process. This particular use of AdaIN in the feature space as in Eq.3 enables a successful style transfer without additional training. We provide a more detailed explanation of our algorithm as supplementary material, as well as the source code.
5 Experiments
We compare the proposed method to prior style transfer methods, from non-diffusion based models to diffusion based models. For LDM, we used Stable Diffusion LDM-8 [23] and {C z,H z,W z}={4,64,64}subscript 𝐶 𝑧 subscript 𝐻 𝑧 subscript 𝑊 𝑧 4 64 64{C_{z},H_{z},W_{z}}={4,64,64}{ italic_C start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT } = { 4 , 64 , 64 } as the default choice.
Figure 4: Qualitative comparisons of stylized images by baseline methods and ours. Our method has a texture transfer effect while preserving the color of a content image.
5.1 Comparison with State-of-the-Art Image Style Transfer
Figure 4 shows examples of stylized images. With the strength parameter S=0.5 𝑆 0.5 S=0.5 italic_S = 0.5, the results show that our method successfully creates stylized images. A distinct difference is in the way these methods deal with colors. In the top row, most prior methods involve recoloring using the style image’s colors, while ours retains the original image’s color tones. This color-keeping property is beneficial for users who want to stylize an image without significantly altering its color scheme. For Gatys et al. and Huang et al., we observe severe artifacts in the output images. InST often overlooks the content image and redraws the contents with those in the style image as seen in the third and fourth rows. QuantArt, on the other hand, results in a conservative transformation. Although DiffuseIT transfers the style color, the results sometimes contain unrelated objects from either the style or content, such as a spotlight rendered in the third row.
5.2 Effect of Strength S 𝑆 S italic_S
Figure 5: Visualization of the effects from S 𝑆 S italic_S. The style effect becomes more apparent as we increase S 𝑆 S italic_S. We can see a trade-off between the style effect and deformation. This is due to the increased reverse diffusion steps by S 𝑆 S italic_S in the LDM.
Figure 5 visualizes the impact of strength S∈[0,1]𝑆 0 1 S\in[0,1]italic_S ∈ [ 0 , 1 ] on the output results. S 𝑆 S italic_S linearly interpolates the number of reverse diffusion steps from 0 to T=50 T 50\mathrm{T}=50 roman_T = 50. The results show that style transfer effect by AdaIN is strengthened as we increase S 𝑆 S italic_S. At the same time, the increased number of reverse diffusion steps causes the LDM to lose the original content information, except for the color. This makes our style transfer effect a trade-off against the deformation effect.
5.3 Computational Costs
Table 1: Computational cost per an image, training requirement, and diffusion model choice for a 512×\times×512 pixel image. Our method consumes more memory than non-diffusion models. This is due to the base model, i.e. the LDM[23] is larger. However, our method does not require additional training or optimization for style control.
| Method | VRAM | Process Time | Training | Diffusion Model |
|---|---|---|---|---|
| Gatys[7] | 1.3GB | 16.4 sec | Not Required | N/A |
| Huang[11] | 0.3GB | 0.04 sec | Required | N/A |
| StyTr 2[4] | 1.6GB | 0.35 sec | Required | N/A |
| QuantArt[10] | 1.1GB | 0.02 sec | Required | N/A |
| InST[34] | 7.9GB | 3.61 sec | Required | LDM |
| DiffuseIT[16] | 8.6GB | 60.61 sec | Not Required | DM |
| Ours[S=0.3 𝑆 0.3 S=0.3 italic_S = 0.3] | 8.4GB | 2.70 sec | Not Required | LDM |
We measured the amount of VRAM required and the computation time for the actual processing (excluding model loading). Based on the comparison with the baselines in Tab.1, our method consumes more memory than the existing non-diffusion based models[7, 11, 4, 10] due to the underlying LDM’s size. Since we do not introduce additional parameters to the base LDM, the amount of memory required is equivalent to the LDM. The training-based methods[11, 4, 10] are lightweight in inference. However, they need computationally heavy training beforehand. Compared to the existing training-free methods [7, 16], ours has an advantage in terms of speed.
5.4 Quantitative Evaluations
Table 2: The average style loss, content loss, and quantitative metrics obtained from baseline methods and ours. As the strength S 𝑆 S italic_S increases, our method produces less style loss. S 𝑆 S italic_S affects how much style information appears on the result, while content loss increases with S 𝑆 S italic_S due to increased diffusion steps. The rest of metrics from LPIPS to CLIP consistently indicates such trade-off characteristics of our method.
Style Loss ↓Content Loss ↓LPIPS ↓PSNR ↑SSIM ↑CLIP ↑ Gatys etal.𝑒 𝑡 𝑎 𝑙 et\ al.italic_e italic_t italic_a italic_l .0.012 16.0 0.69 16.05 0.40 0.62 Huang etal.𝑒 𝑡 𝑎 𝑙 et\ al.italic_e italic_t italic_a italic_l .0.016 14.8 0.70 13.00 0.29 0.62 StyTr 2 0.014 12.5 0.63 14.53 0.54 0.75 QuantArt 0.045 11.0 0.58 15.08 0.37 0.80 InST 0.060 19.9 0.62 14.50 0.31 0.54 DiffuseIT 0.057 12.6 0.69 12.34 0.36 0.70 Ours(S=0.1)0.054 4.9 0.33 23.28 0.63 0.93 Ours(S=0.3)0.045 11.5 0.48 20.56 0.52 0.77 Ours(S=0.5)0.038 16.8 0.57 18.20 0.42 0.61 Ours(S=0.7)0.036 19.6 0.63 16.39 0.37 0.53 Ours(S=0.9)0.034 22.1 0.70 13.61 0.28 0.51
To quantify the effects of style transfer, we measured the style loss and content loss used by Gatys et al.[7] with LPIPS[33], PSNR, SSIM and CLIP[22] similarity as supplementary. A smaller style loss indicates that the final image represents the transferred visual style. On the other hand, a small content loss indicates that the final image preserves its original content. The amount of reflected style and the preservation of content are likely to show a trade-off. We sampled 100 pairs of style and content images from WikiArt[1], ImageNet[3], and Places365[35] and generated stylized images using each method. The average of each metrics are shown in Table 2.
As we increase the number of diffusion steps with S 𝑆 S italic_S, we observe that the style loss decreases. This result is aligned with Fig.5 since the style information becomes more prominently presented in the final results along with S 𝑆 S italic_S. Our method results in larger style loss values; however, we assume that this is because our method tends to transfer style while keeping the original color of the content image. As color transfer dominates VGG-based style loss, our loss values do not decrease beyond a certain level. For CLIP similarity, we measured cosine similarity in the CLIP encoding space. LPIPS, PSNR, SSIM, and CLIP scores indicate how content information is retained and show an aligned tendency with the content losses for our method. In summary, we can see that S 𝑆 S italic_S controls the trade-off between the amount of style transfer effect and content retention.
5.5 Compatibility across LDM Variants
Figure 6: Image style transfer results using our method (S=0.5 𝑆 0.5 S=0.5 italic_S = 0.5) with text prompts and models. We can see the model follows a text prompt to maintain the features of specified object in (b), such as the bird’s beak. The results with SDXL[21] (c) show that our algorithm can transfer style with SDXL architecture for high resolution. The results with LCM-LoRA[19] (d) are obtained with 2 reverse steps in 0.56 seconds. This configuration (d) can achieve style transfer faster compared to standard LDM (a).
We experiment the compatibility of our method in Fig.6. We provide the results from our method with (b) text prompts, (c) SDXL[21], and (d) LCM-LoRA[19]. SDXL is an extension of LDM to generate high-resolution images. LCM-LoRA is a combination of fine-tuning and a sampling algorithm to accelerate LDM. For text prompts (b), we provide a word that represents the content of an image as a prompt to guide the model to generate style-transferred images while keeping their content. In (a), our method without a text prompt, the bird in the first row is strongly deformed by the abstract art style. This results in the loss of details, such as the existence of a beak. On the contrary, with a prompt as in (b), the bird has its original shape preserved even with the abstract style image. The results with SDXL (c) show that our algorithm can transfer style with SDXL architecture for higher image resolution (1024×1024 1024 1024 1024\times 1024 1024 × 1024). Our algorithm is also compatible with (d) LCM-LoRA and enables style transfer in 2 reverse steps compared to 25 reverse steps of (a). Although fine details of a style image are suppressed due to reduced reverse steps, this enables fast style transfer in less than a second.
In summary, we observe that our method can be plugged into models and techniques commonly introduced to LDM variants without training, for quick experimentations. We include the implementations for (c) and (d) in the Supplementary.
5.6 Effect of Feature Selection and AdaIN
Figure 7: Ablations of applying AdaIN in various feature spaces in the reverse diffusion process with S=0.5 𝑆 0.5 S=0.5 italic_S = 0.5. Only our approach can transfer the style effect, while other approaches that use AdaIN on different features fail to transfer styles. LDM also fails to reconstruct image from features modified by WCT.
To validate the design choice of AdaIN, we performed an ablation study with the selection of feature spaces to apply AdaIN. The choices of source-target pairs for AdaIN are (a): f sl→f cl→subscript 𝑓 𝑠 𝑙 subscript 𝑓 𝑐 𝑙 f_{sl}\rightarrow f_{cl}italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT → italic_f start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT, i.e., ours, (b): ϵ θ(z s,t)→ϵ θ(z t)→subscript italic-ϵ 𝜃 subscript 𝑧 𝑠 𝑡 subscript italic-ϵ 𝜃 subscript 𝑧 𝑡\epsilon_{\theta}(z_{s,t})\rightarrow\epsilon_{\theta}(z_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) → italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and (c): z s,t→z t→subscript 𝑧 𝑠 𝑡 subscript 𝑧 𝑡 z_{s,t}\rightarrow z_{t}italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT → italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We are also interested in which feature transformation algorithms can be effectively incorporated into the diffusion model. To this end, we compared AdaIN with the whitening and coloring transform (WCT)[17] in (d).
- (a)f
l=AdaIN(f cl(z t,z s,t),f sl(z s,t))subscript𝑓 𝑙 AdaIN subscript 𝑓 𝑐 𝑙 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡 subscript 𝑓 𝑠 𝑙 subscript 𝑧 𝑠 𝑡\tilde{f}{l}=\text{AdaIN}(f{cl}(z_{t},z_{s,t}),f_{sl}(z_{s,t}))over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = AdaIN ( italic_f start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ):This proposed method repeatedly applies AdaIN in a U-Net forward pass. - (b)ϵ^t=AdaIN(ϵ θ(z t),ϵ θ(z s,t))subscript^italic-ϵ 𝑡 AdaIN subscript italic-ϵ 𝜃 subscript 𝑧 𝑡 subscript italic-ϵ 𝜃 subscript 𝑧 𝑠 𝑡\hat{\epsilon}{t}=\text{AdaIN}(\epsilon{\theta}(z_{t}),\epsilon_{\theta}(z_{% s,t}))over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = AdaIN ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ): Applying AdaIN to the predicted noise instead of CNN features in ϵ
θ subscriptitalic-ϵ 𝜃\tilde{\epsilon}_{\theta}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as in (a). - (c)ϵ^t=ϵ θ(AdaIN(z t,z s,t))subscript^italic-ϵ 𝑡 subscript italic-ϵ 𝜃 AdaIN subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡\hat{\epsilon}{t}=\epsilon{\theta}(\text{AdaIN}(z_{t},z_{s,t}))over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( AdaIN ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ): This naive approach applies AdaIN between latent variables of content and style.
- (d)f l
=WCT(f cl(z t,z s,t),f sl(z s,t))subscript 𝑓 𝑙 WCT subscript 𝑓 𝑐 𝑙 subscript 𝑧 𝑡 subscript 𝑧 𝑠 𝑡 subscript 𝑓 𝑠 𝑙 subscript 𝑧 𝑠 𝑡\tilde{f_{l}}=\mathrm{WCT}(f_{cl}(z_{t},z_{s,t}),f_{sl}(z_{s,t}))over~ start_ARG italic_f start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG = roman_WCT ( italic_f start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) , italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) ): Applying WCT instead of AdaIN for (a). WCT is also known as a method for transferring style statistics by aligning a covariance matrix between features.
The result is displayed in Figure 7. We can see the successful style transfer effect by our method in (a). (b) barely show style transfer effects, implying that statistics of ϵ θ(z s,t)subscript italic-ϵ 𝜃 subscript 𝑧 𝑠 𝑡\epsilon_{\theta}(z_{s,t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ) do not convey the style. (c) only transfers global color. Comparison between (a) ours and (d) demonstrates the effectiveness of AdaIN. (d) WCT fails to generate a stylized image. We believe this is due to WCT producing extreme values that are outside valid domains for the denoising U-Net. Moreover, introducing WCT in every CNN filter is extremely inefficient due to the computational cost of eigenvalue decomposition. In summary, only our method (a) effectively transfers the style by applying AdaIN from f sl subscript 𝑓 𝑠 𝑙 f_{sl}italic_f start_POSTSUBSCRIPT italic_s italic_l end_POSTSUBSCRIPT to f cl subscript 𝑓 𝑐 𝑙 f_{cl}italic_f start_POSTSUBSCRIPT italic_c italic_l end_POSTSUBSCRIPT, which validates our choice of design.
5.7 Color Transfer with Histogram Matching
Figure 8: Our style transfer results with a Histogram Matching algorithm when S=0.5 𝑆 0.5 S=0.5 italic_S = 0.5. While our algorithm preserves original content colors, we can see that the histogram matching algorithm can additionally transfer the color style of a style image.
Our proposed method enables the application of texture style transfer while preserving the color of the content image. However, there may be situations where one may wish to also transfer the color distribution of the style image. While color transfer is not our primary focus, Histogram Matching can be employed as an additional post-processing step when needed. As demonstrated in Fig.8, the use of histogram matching enables the simultaneous transfer of texture style and color distribution.
5.8 User Study
Table 3: Average scores from the user study. Bold and underlined texts show the first and second best values. The scores show that our method possesses the ability to transfer texture which is comparative to Huang et al.[11] or other diffusion based methods. Our method also achieved higher scores for color preservation.
Characteristics Model Method Overall Quality↑Color Preservation↑Texture Transfer↑Shape Preservation↑ Non-Diffusion Huang[11]2.29 2.20 2.66 3.38 Gatys[7]2.89 2.45 3.65 3.74 StyTr 2[4]3.54 2.59 3.65 4.15 QuantArt[10]3.34 2.91 2.19 4.00 Diffusion InST[34]2.17 2.00 1.92 1.75 DiffuseIT[16]2.59 2.30 2.44 2.92 Ours 3.39 4.14 2.83 3.97
We conducted a user study to evaluate overall quality and 3 characteristics for image style transfer task. Characteristics include color preservation, texture transfer, and shape preservation.
In the experiment, we sampled style and content pairs from Places 365 and ImageNet dataset. We asked annotators to evaluate the output images using 5 point Likert scale, ensuring that at least three individuals answered each question. As a result, we observed 5280 responses by the annotators. The strength S 𝑆 S italic_S of our method was fixed at 0.3. The result of the user study is in Tab.3.
In Tab.3, our method achieves the second-best scores for overall quality. Although the three characteristics do not indicate the superiority of each method, our method received a particular rating for color preservation, reflecting its tendency to retain color. For texture transfer and shape preservation, our method is rated as having the style transfer effect while retaining the shape similar to previous methods. We provide a detailed explanation of the user study in the supplementary material.
5.9 Limitations
There are several limitations to our proposed method. First, our method is not quick enough to transfer the style effect in real time due to the base LDM. The development of faster sampling with a diffusion model such as LCM-LoRA may mitigate this issue in the future. Second, the memory requirement for our method is comparatively larger; this is again due to the base LDM, which already consumes more memory than the models used in prior methods such as VGG.
6 Conclusion
In this work, we present a style transfer algorithm called STRDP on a pretrained LDM. We introduced a custom U-Net architecture to repeatedly apply the AdaIN function in each layer during the reverse diffusion process. This approach allows quick style transfer without the need for additional training or optimization. We have demonstrated the style transfer performance of our algorithm in quantitative metrics and user studies, in particular, our algorithm runs faster compared to diffusion-based methods and training-free methods. Furthermore, we have shown the compatibility of our algorithm with common LDM models and techniques, such as text prompts, SDXL for high resolution, and LCM-LoRA for faster generation.
In summary, the speed, compatibility, and training-free characteristics of our method make it suitable for quick experimentation with image style transfer and extensive LDM-based techniques and applications.
References
- [1] Visual art encyclopedia (2023), http://www.wikiart.org/, accessed: 2023-05-10
- [2] Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18208–18218 (2022)
- [3] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. IEEE (2009)
- [4] Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., Xu, C.: Stytr2: Image style transfer with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11326–11336 (2022)
- [5] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34, 8780–8794 (2021)
- [6] Gatys, L., Bethge, M., Hertzmann, A., et al.: Preserving color in neural artistic style transfer. arXiv preprint arXiv:1606.05897 (2016)
- [7] Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2414–2423 (2016)
- [8] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, 6840–6851 (2020)
- [9] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
- [10] Huang, S., An, J., Wei, D., Luo, J., Pfister, H.: Quantart: Quantizing image style transfer towards high visual fidelity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5947–5956 (2023)
- [11] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision. pp. 1501–1510 (2017)
- [12] Huo, J., Jin, S., Li, W., Wu, J., Lai, Y.K., Shi, Y., Gao, Y.: Manifold alignment for semantically aligned style transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14861–14869 (2021)
- [13] Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: Text-guided diffusion models for robust image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2426–2435 (2022)
- [14] Kolkin, N., Salavon, J., Shakhnarovich, G.: Style transfer by relaxed optimal transport and self-similarity. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10051–10060 (2019)
- [15] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Communications of the ACM 60(6), 84–90 (2017)
- [16] Kwon, G., Ye, J.C.: Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264 (2022)
- [17] Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. Advances in neural information processing systems 30 (2017)
- [18] Lu, H., Tunanyan, H., Wang, K., et al.: Specialist diffusion: Plug-and-play sample-efficient fine-tuning of text-to-image diffusion models to learn any unseen style. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
- [19] Luo, S., Tan, Y., Patil, S., Gu, D., von Platen, P., Passos, A., Huang, L., Li, J., Zhao, H.: Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556 (2023)
- [20] Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5880–5888 (2019)
- [21] Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)
- [22] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
- [23] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)
- [24] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
- [25] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242 (2022)
- [26] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
- [27] Sheng, L., Lin, Z., Shao, J., Wang, X.: Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8242–8250 (2018)
- [28] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- [29] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning. pp. 2256–2265. PMLR (2015)
- [30] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
- [31] Yoo, J., Uh, Y., Chun, S., Kang, B., Ha, J.W.: Photorealistic style transfer via wavelet transforms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9036–9045 (2019)
- [32] Zhang, L., Agrawala, M.: Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543 (2023)
- [33] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
- [34] Zhang, Y., Huang, N., Tang, F., Huang, H.: Inversion-based style transfer with diffusion models. In: Proceedings of the (2023)
- [35] Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. vol.40, pp. 1452–1464. IEEE (2017)
Xet Storage Details
- Size:
- 76.1 kB
- Xet hash:
- af6e86a27e630f5f20e9249be7046138bb931a2c2e33f39aa32f9c667b4db1bb
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.







