Buckets:
Title: Deep Optimal Transport: A Practical Algorithm for Photo-realistic Image Restoration
URL Source: https://arxiv.org/html/2306.02342
Published Time: Tue, 13 Aug 2024 01:00:11 GMT
Markdown Content: Theo J. Adrai
Technion–Israel Institute of Technology
Computer Science
&Guy Ohayon
Technion–Israel Institute of Technology
Computer Science
&Michael Elad
Technion–Israel Institute of Technology
Computer Science
&Tomer Michaeli
Technion–Israel Institute of Technology
Electrical Engineering
Abstract
We propose an image restoration algorithm that can control the perceptual quality and/or the mean square error (MSE) of any pre-trained model, trading one over the other at test time. Our algorithm is few-shot: Given about a dozen images restored by the model, it can significantly improve the perceptual quality and/or the MSE of the model for newly restored images without further training. Our approach is motivated by a recent theoretical result that links between the minimum MSE (MMSE) predictor and the predictor that minimizes the MSE under a perfect perceptual quality constraint. Specifically, it has been shown that the latter can be obtained by optimally transporting the output of the former, such that its distribution matches the source data. Thus, to improve the perceptual quality of a predictor that was originally trained to minimize MSE, we approximate the optimal transport by a linear transformation in the latent space of a variational auto-encoder, which we compute in closed-form using empirical means and covariances. Going beyond the theory, we find that applying the same procedure on models that were initially trained to achieve high perceptual quality, typically improves their perceptual quality even further. And by interpolating the results with the original output of the model, we can improve their MSE on the expense of perceptual quality. We illustrate our method on a variety of degradations applied to general content images of arbitrary dimensions.
1 Introduction
Figure 1: The 𝒲 2 subscript 𝒲 2\mathcal{W}_{2}caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-MSE trade-off[1].
Many image restoration algorithms aim to recover a clean source image from its degraded version. The performance of such algorithms is often evaluated in terms of their average distortion, which measures the discrepancy between restored images and their corresponding clean sources, as well as perceptual quality, which refers to the extent to which restored images resemble natural images. The work in [2] exposed a fundamental trade-off between distortion and perceptual quality, where the latter is measured using a perceptual index that quantifies the statistical divergence between the distribution of restored images and the distribution of natural images. The trade-off curve reveals the predictor that achieves the lowest possible distortion, denoted as 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT, while maintaining perfect perceptual quality (refer tofig.1).
Following the methodology introduced by [2], it has become common practice to compare restoration methods on the perception-distortion (PD) plane, with many methods aiming to reach the elusive 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT point. In this paper, we present a practical approach to approximate the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT predictor where distortion is measured using the MSE and perceptual quality is measured by the Wasserstein-2 distance (𝒲 2 subscript 𝒲 2\mathcal{W}{2}\ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) between the distributions of restored and real images. Our approach is based on the recent work [1] which demonstrated that the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT predictor can be obtained using optimal transport (OT) from the output distribution of the MMSE predictor to the distribution of natural images. By applying an optimal transport plan to an MMSE restoration resulting from a degraded image, we can produce 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimations by transporting the MMSE restored estimate.
Figure 2: Our few-shot algorithm improves the visual quality of any estimator at test time. For example, we can improve the photo-realism of DDRM[3] even further.
Although progress has been made in finding OT plans between image distributions[4, 5, 6], it remains challenging task, particularly for high-dimensional distributions. Therefore, we propose an approximation method for the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator by performing transportation in the latent space of a pre-trained auto-encoder. A similar strategy was successfully employed in the context of image generation in[7], showing effectiveness in reducing complexity and preserving details.
Inspired by the style transfer literature[8, 9, 10], we assume that the latent representations follow a Multivariate Gaussian(MVG) distribution. Thus, by considering the first and second-order statistics of the embedded MMSE estimates and embedded natural images, we can compute the well-known closed form solution of the OT operator between two Gaussians. To further reduce complexity, we make additional assumptions about the structure of the latent covariance matrices, enabling the computation of the OT operator with as few as 10 unpaired MMSE restored and clean samples. This approach leads to a few-shot algorithm that significantly enhances visual quality.
Interestingly, our method can even improve the visual quality of generative models that were trained to achieve high perceptual quality in the first place (seefig.2). Furthermore, by adjusting a single interpolation parameter, we can trade off perception for distortion, resulting in marginal improvements in the distortion performance of some regression models that were trained to prioritize source fidelity. We demonstrate the improved photo-realism of our approach on a variety of tasks and models, including GAN and diffusion-based methods, using high-resolution (e.g., 512 2 superscript 512 2 512^{2}512 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT px) general-content images with arbitrary aspect ratios.1 1 1 Our code is publicly available at https://github.com/theoad/dot-dmax.
2 Related Work
Throughout the paper, we distinguish between two kinds of restoration algorithms: distortion and perception focused. The former category includes traditional methods that minimize distortion (e.g., MSE)[11, 12, 13, 14, 15]. The latter category includes more recent works that usually involve generative models like Generative Adversarial Networks (GANs)[16, 17, 18], or diffusion-based techniques[3, 19].
This paper searches for the theoretical 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator which minimizes the MSE under a perfect perceptual quality constraint. However, our method is not a stand-alone restoration algorithm, i.e., its input is not the degraded image. Rather, it can be applied on top of any existing predictor. We provide a new way to potentially improve the performance (either MSE or perceptual) of any given estimator in a few-shot, plug-and-play fashion.
Provided with an image latent representation method (e.g., an auto-encoder), our algorithm applies a linear transformation on all the overlapping patches of its input (after encoding). In this regard, its functioning is not far from classical image restoration methods[15, 20, 14].
2.1 Wasserstein-2 transport
While many successful approaches exist to compute the 𝒲 2 subscript 𝒲 2\mathcal{W}{2}\ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between discrete, low or medium dimensional densities[21, 22], it is far more challenging to determine an optimal transport plan in the continuous, high-dimensional setting. In fact, the task of even computing the Wasserstein distance (without its optimal plan) on empirical distributions has drawn significant attention with WGANs[23]. Thus, computing the transport operator requires to optimize the 𝒲 2 subscript 𝒲 2\mathcal{W}{2}\ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance but with an additional ordering constraint on the generator, which proved to be a challenging task when dealing with real-world data sets[6, 4, 5]. An attempt to sidestep this difficulty would be to use the Gelbrich distance[1, 24], which lower bounds the 𝒲 2 subscript 𝒲 2\mathcal{W}_{2}\ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance and depends only on the first two moments of the distributions. Nevertheless, it is only a good estimate when the support of the distributions has elliptical level sets. To address this, one can find a low-dimensional embedding where (i) high dimensionality is no longer an issue, (ii) the distributions are not degenerated, and (iii) the Gelbrich distance equals the Wasserstein-2 distance. A possible option would be to use the bottleneck of an auto-encoder. This approach was adopted by style transfer works[8, 9, 10], and is also widely used by tools that compare image distributions like the Fréchet Inception Distance (FID)[25]. Both use a convolutional encoder and consider the pixels of the latent embedding (the vectors that span across the channel dimension) as a MVG distribution.
3 Background
3.1 Optimal transport in Wasserstein Space
In this section, we briefly introduce key concepts of optimal transport theory that we draw from[26].
Let μ 𝜇\mu italic_μ and ν 𝜈\nu italic_ν be probability measures on ℝ n superscript ℝ 𝑛\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The set of all transport plans, which are probability measures π 𝜋\pi italic_π on ℝ n×ℝ n superscript ℝ 𝑛 superscript ℝ 𝑛\mathbb{R}^{n}\times\mathbb{R}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT with marginals μ 𝜇\mu italic_μ and ν 𝜈\nu italic_ν, is denoted by Π(μ,ν)Π 𝜇 𝜈\Pi(\mu,\nu)roman_Π ( italic_μ , italic_ν ). The Wasserstein-2 distance between μ 𝜇\mu italic_μ and ν 𝜈\nu italic_ν is defined as follows:
𝒲 2 2(μ,ν)=inf π∈Π(μ,ν)𝔼 x,y∼π[∥x−y∥2 2].superscript subscript 𝒲 2 2 𝜇 𝜈 subscript infimum 𝜋 Π 𝜇 𝜈 subscript 𝔼 similar-to 𝑥 𝑦 𝜋 delimited-[]superscript subscript delimited-∥∥𝑥 𝑦 2 2\mathcal{W}{2}^{2}(\mu,\nu)=\inf{\pi\in\Pi(\mu,\nu)}\mathbb{E}{x,y\sim\pi}% \left[\left\lVert x-y\right\rVert{2}^{2}\right].caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ , italic_ν ) = roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π ( italic_μ , italic_ν ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , italic_y ∼ italic_π end_POSTSUBSCRIPT [ ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .(1)
A transport plan that achieves this infimum is called an optimal transport plan between μ 𝜇\mu italic_μ and ν 𝜈\nu italic_ν. If μ 𝜇\mu italic_μ has a density (i.e. is absolutely continuous w.r.t. the Lesbegue measure), there exists a measurable function T μ⟶ν:ℝ n⟶ℝ n:subscript T⟶𝜇 𝜈⟶superscript ℝ 𝑛 superscript ℝ 𝑛\text{T}{\mu\longrightarrow\nu}:\mathbb{R}^{n}\longrightarrow\mathbb{R}^{n}T start_POSTSUBSCRIPT italic_μ ⟶ italic_ν end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟶ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that, if 𝐱 1∼μ similar-to subscript 𝐱 1 𝜇\mathbf{x}{1}\sim\mu bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_μ and 𝐱 2∼ν similar-to subscript 𝐱 2 𝜈\mathbf{x}{2}\sim\nu bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_ν are two random variables, then 𝐱 2=a.s T μ⟶ν(𝐱 1)superscript formulae-sequence 𝑎 𝑠 subscript 𝐱 2 subscript T⟶𝜇 𝜈 subscript 𝐱 1\mathbf{x}{2}\stackrel{{\scriptstyle a.s}}{{=}}\text{T}{\mu\longrightarrow% \nu}(\mathbf{x}{1})bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_a . italic_s end_ARG end_RELOP T start_POSTSUBSCRIPT italic_μ ⟶ italic_ν end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). We refer to T μ⟶ν subscript T⟶𝜇 𝜈\text{T}{\mu\longrightarrow\nu}T start_POSTSUBSCRIPT italic_μ ⟶ italic_ν end_POSTSUBSCRIPT as the optimal transport operator between μ 𝜇\mu italic_μ and ν 𝜈\nu italic_ν. Like in[5], we also abuse this notation even when π 𝜋\pi italic_π is non-degenerate, in which case T μ⟶ν subscript T⟶𝜇 𝜈\text{T}{\mu\longrightarrow\nu}T start_POSTSUBSCRIPT italic_μ ⟶ italic_ν end_POSTSUBSCRIPT represents a one-to-many (stochastic) mapping.
Additionally, when considering two Multivariate Gaussians (MVGs) 𝐱 1 subscript 𝐱 1\mathbf{x}{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐱 2 subscript 𝐱 2\mathbf{x}{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with 𝐱 1∼𝒩(μ 𝐱 1,Σ 𝐱 1)similar-to subscript 𝐱 1 𝒩 subscript 𝜇 subscript 𝐱 1 subscript Σ subscript 𝐱 1{\mathbf{x}{1}\sim\mathcal{N}(\mu{\mathbf{x}{1}},\Sigma{\mathbf{x}{1}})}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and 𝐱 2∼𝒩(μ 𝐱 2,Σ 𝐱 2)similar-to subscript 𝐱 2 𝒩 subscript 𝜇 subscript 𝐱 2 subscript Σ subscript 𝐱 2{\mathbf{x}{2}\sim\mathcal{N}(\mu_{\mathbf{x}{2}},\Sigma{\mathbf{x}{2}})}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), respectively, and assuming that Σ 𝐱 1 subscript Σ subscript 𝐱 1\Sigma{\mathbf{x}{1}}roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Σ 𝐱 2 subscript Σ subscript 𝐱 2\Sigma{\mathbf{x}_{2}}roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are non-singular, there exists a closed-form solution for the optimal transport operator, which is deterministic and linear:
T p 𝐱 1⟶p 𝐱 2 MVG(x 1)=Σ 𝐱 1−1 2(Σ 𝐱 1 1 2Σ 𝐱 2Σ 𝐱 1 1 2)1 2Σ 𝐱 1−1 2⋅(x 1−μ 𝐱 1)+μ 𝐱 2,superscript subscript T⟶subscript 𝑝 subscript 𝐱 1 subscript 𝑝 subscript 𝐱 2 MVG subscript 𝑥 1⋅superscript subscript Σ subscript 𝐱 1 1 2 superscript superscript subscript Σ subscript 𝐱 1 1 2 subscript Σ subscript 𝐱 2 superscript subscript Σ subscript 𝐱 1 1 2 1 2 superscript subscript Σ subscript 𝐱 1 1 2 subscript 𝑥 1 subscript 𝜇 subscript 𝐱 1 subscript 𝜇 subscript 𝐱 2\text{T}{p{\mathbf{x}{1}}\longrightarrow p{\mathbf{x}{2}}}^{\text{MVG}}(x% {1})=\Sigma{\mathbf{x}{1}}^{-\frac{1}{2}}\left(\Sigma_{\mathbf{x}{1}}^{% \frac{1}{2}}\Sigma{\mathbf{x}{2}}\Sigma{\mathbf{x}{1}}^{\frac{1}{2}}\right% )^{\frac{1}{2}}\Sigma{\mathbf{x}{1}}^{-\frac{1}{2}}\cdot(x{1}-\mu_{\mathbf{% x}{1}})+\mu{\mathbf{x}_{2}},T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT MVG end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ⋅ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,(2)
where a symmetric and positive definite square root of the matrices is chosen.
3.2 Wasserstein-2 MSE tradeoff
We build upon the problem setting introduced in[2, 1] to establish our analysis. We consider the following scenario: 𝐱∈ℝ n 𝐱 superscript ℝ 𝑛\mathbf{x}\in\mathbb{R}^{n}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents a source natural image, 𝐲∈ℝ m 𝐲 superscript ℝ 𝑚\mathbf{y}\in\mathbb{R}^{m}bold_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT represents its degraded version, and we assume that the posterior p 𝐱|𝐲(⋅|y)p_{\mathbf{x}|\mathbf{y}}(\cdot|y)italic_p start_POSTSUBSCRIPT bold_x | bold_y end_POSTSUBSCRIPT ( ⋅ | italic_y ) is non-degenerate for almost any y 𝑦 y italic_y. Our objective is to construct an estimator 𝐱^^𝐱\mathbf{\hat{x}}over^ start_ARG bold_x end_ARG that predicts 𝐱 𝐱\mathbf{x}bold_x given 𝐲 𝐲\mathbf{y}bold_y. A valid estimator 𝐱^^𝐱\mathbf{\hat{x}}over^ start_ARG bold_x end_ARG should be independent of 𝐱 𝐱\mathbf{x}bold_x given 𝐲 𝐲\mathbf{y}bold_y. Finally, p 𝐱 subscript 𝑝 𝐱 p_{\mathbf{x}}italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, p 𝐱∗subscript 𝑝 superscript 𝐱 p_{\mathbf{x}^{}}italic_p start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and p 𝐱^0 subscript 𝑝 subscript^𝐱 0 p_{\mathbf{\hat{x}}_{0}}italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the probability distributions associated with the random variables 𝐱 𝐱\mathbf{x}bold_x, 𝐱∗superscript 𝐱\mathbf{x}^{}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, respectively.
Let 𝐱∗=𝔼[𝐱|𝐲]superscript 𝐱 𝔼 delimited-[]conditional 𝐱 𝐲\mathbf{x}^{}=\mathbb{E[\mathbf{x}|\mathbf{y}]}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = blackboard_E [ bold_x | bold_y ] denote the MMSE estimator that achieves the minimal MSE, i.e., MSE(𝐱,𝐱∗)=D min MSE 𝐱 superscript 𝐱 subscript 𝐷\text{MSE}(\mathbf{x},\mathbf{x}^{})=D_{\min}MSE ( bold_x , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. Additionally, let 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator, which among all estimators satisfying 𝒲 2(p 𝐱,p 𝐱^0)=0 subscript 𝒲 2 subscript 𝑝 𝐱 subscript 𝑝 subscript^𝐱 0 0\mathcal{W}{2}(p{\mathbf{x}},p_{\mathbf{\hat{x}}{0}})=0 caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = 0, attains the minimal MSE, namely, MSE(𝐱,𝐱^0)=D max MSE 𝐱 subscript^𝐱 0 subscript 𝐷\text{MSE}(\mathbf{x},\mathbf{\hat{x}}{0})=D_{\max}MSE ( bold_x , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (refer tofig.1). Notably, as discussed in[1], these estimators have a compelling property: their joint distribution p 𝐱^0,𝐱∗subscript 𝑝 subscript^𝐱 0 superscript 𝐱 p_{\mathbf{\hat{x}}_{0},\mathbf{x}^{}}italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is an optimal transport plan between 𝐱∗superscript 𝐱\mathbf{x}^{}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐱 𝐱\mathbf{x}bold_x, characterized by the following optimization problem:
p 𝐱^0,𝐱∗∈argmin p 𝐱 1,𝐱 2∈𝚷(p 𝐱,p 𝐱∗)𝔼[∥𝐱 1−𝐱 2∥2 2].subscript 𝑝 subscript^𝐱 0 superscript 𝐱 subscript arg min subscript 𝑝 subscript 𝐱 1 subscript 𝐱 2 𝚷 subscript 𝑝 𝐱 subscript 𝑝 superscript 𝐱 𝔼 delimited-[]superscript subscript delimited-∥∥subscript 𝐱 1 subscript 𝐱 2 2 2 p_{\mathbf{\hat{x}}{0},\mathbf{x}^{}}\in\operatorname{arg,min}{p_{\mathbf% {x}{1},\mathbf{x}{2}}\in\mathbf{\Pi}(p_{\mathbf{x}},p_{\mathbf{x}^{*}})}{% \mathbb{E}\left[\left\lVert\mathbf{x}{1}-\mathbf{x}{2}\right\rVert_{2}^{2}% \right]}.italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ bold_Π ( italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E [ ∥ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .(3)
In other words, finding 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is equivalent to finding an optimal transport plan from 𝐱∗superscript 𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to 𝐱 𝐱\mathbf{x}bold_x. Then, the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator is simply 𝐱^0=T p 𝐱∗⟶p 𝐱(𝐱∗)subscript^𝐱 0 subscript T⟶subscript 𝑝 superscript 𝐱 subscript 𝑝 𝐱 superscript 𝐱\mathbf{\hat{x}}{0}=\text{T}{p_{\mathbf{x}^{}}\longrightarrow p_{\mathbf{x}% }}(\mathbf{x}^{})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ).
This estimator is particularly useful as it allows to obtain any point on the perception-distortion function through a naive linear interpolation with the MMSE estimator. Specifically, we can define the interpolated estimator 𝐱^P subscript^𝐱 𝑃\mathbf{\hat{x}}_{P}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT as follows:
𝐱^P=(1−α)𝐱^0+α𝐱∗,subscript^𝐱 𝑃 1 𝛼 subscript^𝐱 0 𝛼 superscript 𝐱\mathbf{\hat{x}}{P}=(1-\alpha)\mathbf{\hat{x}}{0}+\alpha\mathbf{x}^{*},over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = ( 1 - italic_α ) over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ,(4)
where 0≤α≤1 0 𝛼 1 0\leq\alpha\leq 1 0 ≤ italic_α ≤ 1 is an interpolation constant[1] that depends on the perceptual index of 𝐱∗superscript 𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the desired perceptual index 0≤P=𝒲 2(𝐱,𝐱^P)0 𝑃 subscript 𝒲 2 𝐱 subscript^𝐱 𝑃 0\leq P=\mathcal{W}{2}(\mathbf{x},\mathbf{\hat{x}}{P})0 ≤ italic_P = caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_x , over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) (refer tofig.1).
Figure 3: Trading perception and distortion using out-of-the-box predictors, wrapped with our method. Usingeq.4 with α∈[0,1]𝛼 0 1\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] we interpolate a given predictor (orange) and our improved 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimation (green), to approximate the PD FID-MSE function (blue curve). With α∈[−1,0]∪[1,2]𝛼 1 0 1 2\alpha\in[-1,0]\cup[1,2]italic_α ∈ [ - 1 , 0 ] ∪ [ 1 , 2 ] we extrapolate outside of the PD curve (light gray), beyond the theory-inspired area, to further improve performance.
4 Method
We start by describing the general flow of our proposed algorithm, and then move to elaborate on each of its components.
Theoretically speaking, our algorithm, combined with any given MMSE estimator, is an approximation of the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator. In practice, however, it can be combined with any type of estimator and potentially improve its perceptual quality. I.e., it can be combined with an estimator that optimizes distortion (e.g., SwinIR[11]) and improve its perceptual quality at the expense of distortion, or it can be combined with an estimator that optimizes perceptual quality (e.g., DDRM[3]) and improve its perceptual quality even further. As a result, our algorithm is agnostic to the type of degradation. To clarify, our algorithm is not really a restoration algorithm by itself, but rather a wrapper which can potentially improve the perceptual quality of any given estimator.
Figure 4: With a pre-trained VAE, we estimate the first and second order statistics of the latent patches of natural images and the restorations of some given estimator. At inference time, we use the closed-form OTeq.2 operator between MVG distributions to transport the latent representation of a given restored sample, which, after decoding, increases the visual quality of the restored sample. For a fully detailed explanation of the algorithm, seesection 4.
4.1 The algorithm
The main goal of our algorithm is to approximate the optimal transport plan between p 𝐱^subscript 𝑝^𝐱 p_{\hat{\mathbf{x}}}italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT and p 𝐱 subscript 𝑝 𝐱 p_{\mathbf{x}}italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT, namely, T p 𝐱^⟶p 𝐱 subscript T⟶subscript 𝑝^𝐱 subscript 𝑝 𝐱\text{T}{p{\hat{\mathbf{x}}}\longrightarrow p_{\mathbf{x}}}T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG is a given estimator. Theoretically, with such an operator, one could optimally transport 𝐱^^𝐱\hat{\mathbf{x}}over^ start_ARG bold_x end_ARG such that, with minimal loss in MSE performance, the transported estimator would attain perfect perceptual quality. Computing T p 𝐱^⟶p 𝐱 subscript T⟶subscript 𝑝^𝐱 subscript 𝑝 𝐱\text{T}{p{\hat{\mathbf{x}}}\longrightarrow p_{\mathbf{x}}}T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT for high dimensional distributions is a difficult task, involving complex (often adversarial) optimization (see discussion in section 2). To solve this, we perform several assumptions and approximations that allow us to efficiently compute a closed form transport operator that approximates T p 𝐱^⟶p 𝐱 subscript T⟶subscript 𝑝^𝐱 subscript 𝑝 𝐱\text{T}{p{\hat{\mathbf{x}}}\longrightarrow p_{\mathbf{x}}}T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The flow of the algorithm is presented infig.4, and goes as follows:
Encoding: In the training stage we encode N 𝑁 N italic_N natural images {x(i)}i=1 N superscript subscript superscript 𝑥 𝑖 𝑖 1 𝑁{x^{(i)}}{i=1}^{N}{ italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and N 𝑁 N italic_N restored samples {x^(i)}i=1 N superscript subscript superscript^𝑥 𝑖 𝑖 1 𝑁{\hat{x}^{(i)}}{i=1}^{N}{ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT (unpaired) into their latent representations, {x e(i)}i=1 N superscript subscript subscript superscript 𝑥 𝑖 𝑒 𝑖 1 𝑁{x^{(i)}{e}}{i=1}^{N}{ italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and {x^e(i)}i=1 N superscript subscript subscript superscript^𝑥 𝑖 𝑒 𝑖 1 𝑁{\hat{x}^{(i)}{e}}{i=1}^{N}{ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, respectively. The size of each image sample is denoted by (3,H,W)3 𝐻 𝑊(3,H,W)( 3 , italic_H , italic_W ), where H 𝐻 H italic_H and W 𝑊 W italic_W are the height and width, respectively, and the size of their latent representation is denoted by (c,H e,W e)𝑐 subscript 𝐻 𝑒 subscript 𝑊 𝑒(c,H_{e},W_{e})( italic_c , italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ). In the inference stage we perform the same process but only on a single estimate x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG, resulting again in a latent representation of size (c,H e,W e)𝑐 subscript 𝐻 𝑒 subscript 𝑊 𝑒(c,H_{e},W_{e})( italic_c , italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT )
Unfold: From each latent representation we extract all the overlapping patches of size (c,p,p)𝑐 𝑝 𝑝(c,p,p)( italic_c , italic_p , italic_p ), where p 𝑝 p italic_p is the height and width of each patch.
Flatten each patch and aggregate: We flatten all the extracted patches to obtain 1-dimensional vectors of size cp 2 𝑐 superscript 𝑝 2 cp^{2}italic_c italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which we assume to come from a MVG distribution. In the training stage we compute their empirical mean and covariance matrix (aggregating over the N 𝑁 N italic_N dimension), and compute the optimal transport in closed-form T p 𝐱 e^⟶p 𝐱 e MVG superscript subscript T⟶subscript 𝑝^subscript 𝐱 𝑒 subscript 𝑝 subscript 𝐱 𝑒 MVG\text{T}{p{\hat{\mathbf{x}{e}}}\longrightarrow p{\mathbf{x}_{e}}}^{\text{% MVG}}T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT MVG end_POSTSUPERSCRIPT usingeq.2.
Matmul and unflatten: We apply the pre-computed transport operator using a simple matrix-vector multiplication on the flattened version of the patches extracted from the latent representation x^e subscript^𝑥 𝑒\hat{x}_{e}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. We then reshape each vector to the original patch size (c,p,p)𝑐 𝑝 𝑝(c,p,p)( italic_c , italic_p , italic_p ) (unflatten).
Fold: We rearrange the transported patches back to the original size of the latent representation (reversing the unfold operation). Since the patches overlap, we simply average the shared pixels.
Decoding: To produce our final enhanced estimation x^0 subscript^𝑥 0\hat{x}_{0}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we decode the transported latent image back to the pixel space using the decoder of the VAE.
Together, the inference steps form an end-to-end approximation of the desired transport operator T p 𝐱^⟶p 𝐱 subscript T⟶subscript 𝑝^𝐱 subscript 𝑝 𝐱\text{T}{p{\hat{\mathbf{x}}}\longrightarrow p_{\mathbf{x}}}T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Inappendix B we elaborate on the choices and practical considerations of our algorithm.
5 Experiments
In all of our experiments we use the encoder and decoder of the VAE[27] from stable-diffusion[7].
The pre-trained models we evaluate: We apply our latent transport method (described insection 4) on SwinIR[11], Restormer[12] and Swin2SR[13], all of which attempted to minimize average pixel distortion using a supervised regression loss on paired image samples. Additionally, we apply our algorithm on models that are trained to achieve high perceptual quality, and show that we can improve their visual quality even further. As such, we tested two benchmark models in high perceptual quality image restoration: ESRGAN[16], a GAN-based method, and DDRM[3], a diffusion-based method. Beyond our original goal to improve perceptual quality, we demonstrate that we can also traverse the 𝒲 2 subscript 𝒲 2\mathcal{W}{2}\ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-MSE tradeoff using any restoration model (e.g., SwinIR, ESRGAN). To do so, we pick any of the aforementioned algorithms and apply our method to improve its perceptual quality, leading to a new estimator. We then interpolate the original algorithm and its improved version usingeq.4, adjusting α 𝛼\alpha italic_α to traverse the tradeoff. To clarify, we plug the original algorithm as 𝐱∗superscript 𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ineq.4 (instead of the theoretical MMSE estimator), and plug our improved version as 𝐱^0 subscript^𝐱 0\hat{\mathbf{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (instead of the theoretical 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator).
Restoration tasks: We showcase our algorithm on Single-Image Super-Resolution (SISR), denoising of Additive White Gaussian Noise (AWGN), JPEG[28] decompression, Noisy Super-Resolution (NSR) and Compressed Super-Resolution (CSR). Training and inference of our algorithm are performed on each restoration model separately, and the evaluation is performed on the restoration task that corresponds to the given model.
Transport operator computation: The transport operator is computed using two disjoint sets of 10 randomly picked images from the ImageNet[29] dataset train split. The first set is used to approximate the predictor’s latent statistics (μ 𝐱^e subscript 𝜇 subscript^𝐱 𝑒\mu_{\hat{\mathbf{x}}{e}}italic_μ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Σ 𝐱^e subscript Σ subscript^𝐱 𝑒\Sigma{\hat{\mathbf{x}}{e}}roman_Σ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT): we degrade each image according to the restoration task the predictor is intended to solve, compute the 10 restored outputs, embed the results and compute the embeddings’ statistics. The second set is embedded into the latent representation without further modification and serves to approximate the natural image latent statistics (μ 𝐱 e subscript 𝜇 subscript 𝐱 𝑒\mu{\mathbf{x}{e}}italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Σ 𝐱 e subscript Σ subscript 𝐱 𝑒\Sigma{\mathbf{x}_{e}}roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT).
Metrics: In addition to Peak Signal-to-Noise Ratio (PSNR), we evaluate distortion performance with Structural Similarity Index Measure (SSIM)[30] and Learned Perceptual Image Patch Similarity (LPIPS)[31], both of which suit better for natural image comparison[31].
Nonetheless, LPIPS remains by definition a distortion (full-reference) measure: it is non-negative and zero when the two images are identical[2]. Interestingly, the original perception-distortion paper[2] already classified the VGG loss[32] - the ancestor of LPIPS - to be a distortion, on which the tradeoff exists (but is less severe).
Therefore, to evaluate perceptual quality, we use the Inception Score (IS)[33], the Fréchet Inception Distance (FID)[25] and the Kernel Inception Distance (KID)[34] following popular image restoration papers[7, 35, 3].
Data sets: It is impractical to perform a serious quantitative evaluation of the perception-distortion tradeoff on real-world datasets (e.g., SIDD, DND, RealSR), which have too few samples to compute FID. Hence, for all models except DDRM[3] and Swin2SR[13], we report the performance on the 50,000 validation samples of ImageNet[29] following[7, 35]. Because of its computational complexity, DDRM[3] reported its performance on a subset of a 1000 ImageNet[29] validation samples. For Swin2SR[13], we use the official DIV2K[36] restored samples provided by the authors. Although our algorithm can be applied to images with arbitrary aspect ratios, all the tested models were trained on square images. Thus, we resize the samples to 512×512 512 512 512\times 512 512 × 512 pixels following the pre-processing procedure of DDRM[3]. Finally, we conduct the qualitative evaluation on popular samples from DIV2K or Set14.
5.1 Quantitative results
As reported intable 1, our algorithm can trade perceptual quality for distortion (and vice versa) at test time. We sometimes even manage to improve the pre-trained predictor’s PSNR, even of regression models like SwinIR and Swin2SR. When using α≤0 𝛼 0\alpha\leq 0 italic_α ≤ 0, we systematically improve the predictor’s perceptual performance (FID, KID, IS), even for estimators which were designed to achieve photo-realism in the first place, e.g., ESRGAN and DDRM. On Non-Local-Means (NLM)[15], an older, non deep-learning denoising algorithm, our method marginally improves all metrics.
While, in theory, our procedure to traverse the perception distortion tradeoff should only include values of α 𝛼\alpha italic_α in the range [0,1]0 1[0,1][ 0 , 1 ] (seeeq.4), we also tried to use values outside of this range. As shown infig.3, with α∈[−1,0]∪[1,2]𝛼 1 0 1 2\alpha\in[-1,0]\cup[1,2]italic_α ∈ [ - 1 , 0 ] ∪ [ 1 , 2 ] we can obtain even better PD curves, and sometimes improve the perceptual quality and/or the distortion of the methods even further. For instance, the PD curve of Swin2SR obtained using α∈[1,2]𝛼 1 2\alpha\in[1,2]italic_α ∈ [ 1 , 2 ] is strictly better than the one obtained using α∈[0,1]𝛼 0 1\alpha\in[0,1]italic_α ∈ [ 0 , 1 ]. This deviation from the theory can be explained by the implementation choices discussed insection 4; We perform the transport in the latent space – not the pixel space. Additionally, we use FID as perceptual index to measure visual quality when the theory presented insection 3.2 only talks about the Wasserstein-2 distance. In practice, the sharpened details that appear in 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and not in 𝐱∗superscript 𝐱\mathbf{x}^{}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are either amplified when α∈[−1,0]𝛼 1 0\alpha\in[-1,0]italic_α ∈ [ - 1 , 0 ] or subtracted (instead of being added) to 𝐱∗superscript 𝐱\mathbf{x}^{}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT when α∈[1,2]𝛼 1 2\alpha\in[1,2]italic_α ∈ [ 1 , 2 ]. We leave the formal analysis of this interesting phenomenon for future research.
Table 1: Using eq.4, our algorithm can trade-off perception and distortion at inference time on any predictor[16, 11, 12, 13, 3, 15] and image restoration task. For each task, we report the performance of 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. It consistently improves perceptual metrics on all taks and models (aside of NLM). We also report other interesting choices of α 𝛼\alpha italic_α that optimize perception and distortion (for more details about this choice refer tosection 5.1).
Distortion Perception Signal PSNR ↑↑\uparrow↑SSIM ↑↑\uparrow↑LPIPS ↓↓\downarrow↓FID ↓↓\downarrow↓IS ↑↑\uparrow↑KID×10 3↓\times 10^{3}\downarrow× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ↓ 𝐱 𝐱\mathbf{x}bold_x∞\infty∞1 0 0 240.53±4.42 plus-or-minus 240.53 4.42 240.53{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 4.42}240.53 ± 4.42 0 Task D(E(𝐱))D E 𝐱\textbf{D}(\textbf{E}(\mathbf{x}))D ( E ( bold_x ) )27.10 0.81 0.13 0.24 234.71±4.04 plus-or-minus 234.71 4.04 234.71{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 4.04}234.71 ± 4.04 0.02±0.07 plus-or-minus 0.02 0.07 0.02{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.07}0.02 ± 0.07 SISR×4 subscript SISR absent 4\text{SISR}{\times 4}SISR start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT SwinIR[11]28.10 0.84 0.24 2.54 201.52±4.85 plus-or-minus 201.52 4.85 201.52{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 4.85}201.52 ± 4.85 1.24±0.24 plus-or-minus 1.24 0.24 1.24{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.24}1.24 ± 0.24 𝐱^0.9 subscript^𝐱 0.9\mathbf{\hat{x}}{0.9}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 28.15 0.84 0.24 2.80 198.69±2.97 plus-or-minus 198.69 2.97 198.69{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.97}198.69 ± 2.97 1.38±0.24 plus-or-minus 1.38 0.24 1.38{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.24}1.38 ± 0.24 𝐱^−0.2 subscript^𝐱 0.2\mathbf{\hat{x}}{-0.2}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT - 0.2 end_POSTSUBSCRIPT 25.08 0.77 0.25 1.19 216.74±4.26 plus-or-minus 4.26{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 4.26}± 4.26 0.38±0.89 plus-or-minus 0.38 0.89\textbf{0.38}{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}% {0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.89}0.38 ± 0.89 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 25.48 0.78 0.23 1.39 214.63±5.50 plus-or-minus 5.50{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 5.50}± 5.50 0.69±0.23 plus-or-minus 0.69 0.23 0.69{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.23}0.69 ± 0.23 JPEG q=10 subscript JPEG 𝑞 10\text{JPEG}{q=10}JPEG start_POSTSUBSCRIPT italic_q = 10 end_POSTSUBSCRIPT SwinIR[11]29.68 0.86 0.30 8.95 161.73±3.36 plus-or-minus 3.36{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.36}± 3.36 6.52±0.77 plus-or-minus 0.77{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.77}± 0.77 𝐱^1.1 subscript^𝐱 1.1\mathbf{\hat{x}}{1.1}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 29.58 0.86 0.30 8.36 166.50±3.12 plus-or-minus 3.12{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.12}± 3.12 6.08±0.75 plus-or-minus 0.75{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.75}± 0.75 𝐱^−0.2 subscript^𝐱 0.2\mathbf{\hat{x}}{-0.2}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT - 0.2 end_POSTSUBSCRIPT 23.74 0.76 0.31 7.56 166.65±3.58 plus-or-minus 3.58{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.58}± 3.58 5.68±0.83 plus-or-minus 0.83{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.83}± 0.83 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 24.84 0.78 0.30 8.14 163.14±3.93 plus-or-minus 3.93{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.93}± 3.93 6.15±0.77 plus-or-minus 0.77{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.77}± 0.77 AWGN σ=50 subscript AWGN 𝜎 50\text{AWGN}{\sigma=50}AWGN start_POSTSUBSCRIPT italic_σ = 50 end_POSTSUBSCRIPT Restormer[12]30.18 0.86 0.26 5.21 178.62±2.83 plus-or-minus 2.83{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.83}± 2.83 3.29±0.56 plus-or-minus 0.56{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.56}± 0.56 𝐱^1.1 subscript^𝐱 1.1\mathbf{\hat{x}}{1.1}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 30.09 0.86 0.25 4.63 183.36±3.20 plus-or-minus 3.20{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.20}± 3.20 2.61±1.53 plus-or-minus 1.53{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 1.53}± 1.53 𝐱^1.7 subscript^𝐱 1.7\mathbf{\hat{x}}{1.7}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 27.26 0.82 0.25 2.73 198.93±5.13 plus-or-minus 5.13{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 5.13}± 5.13 1.76±1.58 plus-or-minus 1.58{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 1.58}± 1.58 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 25.31 0.78 0.27 4.42 182.86±2.21 plus-or-minus 2.21{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.21}± 2.21 2.93±1.62 plus-or-minus 1.62{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 1.62}± 1.62 SR×4JPEG q=10 subscript SR absent 4 subscript JPEG 𝑞 10\text{SR}{\times 4}\text{JPEG}{q=10}SR start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT JPEG start_POSTSUBSCRIPT italic_q = 10 end_POSTSUBSCRIPT Swin2SR[13]19.75 0.55 0.53 205.00 5.95±0.49 plus-or-minus 0.49{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.49}± 0.49 40.68±3.34 plus-or-minus 3.34{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.34}± 3.34 𝐱^0.8 subscript^𝐱 0.8\mathbf{\hat{x}}{0.8}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 19.81 0.55 0.53 209.82 5.91±0.69 plus-or-minus 0.69{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.69}± 0.69 43.28±3.86 plus-or-minus 3.86{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.86}± 3.86 𝐱^1.9 subscript^𝐱 1.9\mathbf{\hat{x}}{1.9}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 18.44 0.49 0.51 168.12 6.36±0.69 plus-or-minus 0.69{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.69}± 0.69 19.95±2.84 plus-or-minus 2.84{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.84}± 2.84 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 18.45 0.48 0.51 183.80 6.55±0.61 plus-or-minus 0.61{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.61}± 0.61 29.07±3.58 plus-or-minus 29.07 3.58 29.07{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.58}29.07 ± 3.58 SISR×4 subscript SISR absent 4\text{SISR}{\times 4}SISR start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT ESRGAN[16]26.77 0.80 0.21 1.06 221.68±3.06 plus-or-minus 3.06{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.06}± 3.06 0.43±0.14 plus-or-minus 0.14{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.14}± 0.14 𝐱^0.7 subscript^𝐱 0.7\mathbf{\hat{x}}{0.7}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 27.00 0.81 0.21 1.51 215.87±3.64 plus-or-minus 3.64{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.64}± 3.64 0.56±0.21 plus-or-minus 0.21{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.21}± 0.21 𝐱^−0.2 subscript^𝐱 0.2\mathbf{\hat{x}}{-0.2}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT - 0.2 end_POSTSUBSCRIPT 24.84 0.74 0.23 0.80 221.89±2.53 plus-or-minus 2.53{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.53}± 2.53 0.30±0.20 plus-or-minus 0.20{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.20}± 0.20 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 25.33 0.74 0.22 0.89 220.96±3.19 plus-or-minus 3.19{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.19}± 3.19 0.34±0.18 plus-or-minus 0.34 0.18 0.34{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{% 0.6,0.6,0.6}\pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.18}0.34 ± 0.18 SR×4AWGN σ=50 subscript SR absent 4 subscript AWGN 𝜎 50\text{SR}{\times 4}\text{AWGN}{\sigma=50}SR start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT AWGN start_POSTSUBSCRIPT italic_σ = 50 end_POSTSUBSCRIPT DDRM[3]26.10 0.75 0.34 36.44 43.52±3.33 plus-or-minus 3.33{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.33}± 3.33 5.09 𝐱^1.2 subscript^𝐱 1.2\mathbf{\hat{x}}{1.2}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 25.91 0.75 0.33 33.68 44.90±4.06 plus-or-minus 4.06{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 4.06}± 4.06 3.88 𝐱^1.7 subscript^𝐱 1.7\mathbf{\hat{x}}{1.7}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 24.48 0.70 0.35 29.05 47.91±2.69 plus-or-minus 2.69{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.69}± 2.69 1.47 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 23.19 0.69 0.35 29.71 46.36±4.18 plus-or-minus 4.18{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 4.18}± 4.18 1.91 AWGN σ=50 subscript AWGN 𝜎 50\text{AWGN}{\sigma=50}AWGN start_POSTSUBSCRIPT italic_σ = 50 end_POSTSUBSCRIPT NLM[15]26.09 0.71 0.44 12.84 148.71±3.75 plus-or-minus 3.75{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 3.75}± 3.75 8.73±0.91 plus-or-minus 0.91{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.91}± 0.91 𝐱^0.8 subscript^𝐱 0.8\mathbf{\hat{x}}{0.8}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 26.24 0.72 0.43 12.46 148.78±2.49 plus-or-minus 2.49{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.49}± 2.49 8.60±0.95 plus-or-minus 0.95{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 0.95}± 0.95 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 24.80 0.71 0.42 14.88 140.65±2.10 plus-or-minus 2.10{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 2.10}± 2.10 10.90±1.15 plus-or-minus 1.15{\color[rgb]{0.6,0.6,0.6}\definecolor[named]{pgfstrokecolor}{rgb}{0.6,0.6,0.6}% \pgfsys@color@gray@stroke{0.6}\pgfsys@color@gray@fill{0.6}\pm 1.15}± 1.15
Choosing the right value of α 𝛼\alpha italic_α: Like any other hyper-parameter, α 𝛼\alpha italic_α can improve the performance with some tuning when approaching a new task or a new data set (refer totable 1). We argue that the few-shot nature of our algorithm makes this tuning actually practical: α 𝛼\alpha italic_α does not need to be set before performing some expensive training. Once 𝐱^α=0 subscript^𝐱 𝛼 0\mathbf{\hat{x}}{\alpha=0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_α = 0 end_POSTSUBSCRIPT is computed, any 𝐱^α subscript^𝐱 𝛼\mathbf{\hat{x}}{\alpha}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT can be obtained thanks toeq.4 without additional cost. In any case, as reported intable 1, α=0 𝛼 0\alpha=0 italic_α = 0 consistently improves perceptual quality for all the tasks and models considered (as expected from the theory). We consider it to be a satisfying default choice, so manually adjusting α 𝛼\alpha italic_α is not a great concern.
5.2 Qualitative results
Qualitative results on arbitrary image sizes and aspect ratios are shown in fig.5. Using our method, we observe a consistent improvement of photo-realism when transporting existing restoration algorithms using our method. Hence, the qualitative results align with the quantitative perceptual performance gains.
Figure 5: Our method (third column from the left) notably improves the results of several benchmark predictors (second column from the left) on various degradations.
5.3 Training details & ablation study
All the results presented infigs.3, 1 and5 were obtained using the same hyper-parameters. We used the “f8-ft-MSE” fine-tuned version of stable-diffusion’s VAE from Hugging-Face’s diffusers library[37]. For the training stage we use 20 randomly-drawn images from the ImageNet training set (10 images which we use as the natural images set, and 10 images which we degrade and then restore with the estimator). We used a patch-size of p=3 𝑝 3 p=3 italic_p = 3 in the latent space.
Thanks to its simplicity, for each restoration task, our few-shot algorithm requires just a single GPU, and a few seconds for both training and inference.
We turn to detail some considerations about practical aspects of our algorithm which we empirically evaluate on the popular SISR×4 subscript SISR absent 4\text{SISR}_{\times 4}SISR start_POSTSUBSCRIPT × 4 end_POSTSUBSCRIPT task for the ESRGAN estimator.
Patch size: We experiment with increasing patch-sizes when unfolding the latent image (seesection B.2). p={3,5}𝑝 3 5 p={3,5}italic_p = { 3 , 5 } yielded the best PSNR and FID. Smaller patch size (p=1 𝑝 1 p=1 italic_p = 1) resulted in worse FID and bigger size 7≤p≤15 7 𝑝 15 7\leq p\leq 15 7 ≤ italic_p ≤ 15 yielded slightly worse PSNR.
Training size: As discussed insection B.4, each image contributes thousands of samples to the computation of the OT operator. Still, we expect the empirical statistics estimation to benefit from a larger sample size S 𝑆 S italic_S. To confirm this, we repeated the visual enhancement experiments while varying the number of training samples. Surprisingly, we observe no change in the performance of the evaluated metrics for S={10 5,10 4,10 3,10 2}𝑆 superscript 10 5 superscript 10 4 superscript 10 3 superscript 10 2 S={10^{5},10^{4},10^{3},10^{2}}italic_S = { 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, i.e., approximating the distribution statistics with 100 samples is as good as using 100,000 samples, and this is true regardless of the chosen patch size. Moreover, when With S=10 𝑆 10 S=10 italic_S = 10, then only for small patch sizes of p≤5 𝑝 5 p\leq 5 italic_p ≤ 5 we observe no performance drop compared to using a larger sample size. This suggests that our method can be successfully deployed in few-shot settings, where the number of available samples is small.
Paired vs. unpaired samples: Surprisingly, using paired images to compute the distribution parameters yielded better PSNR but worse FID. We suspect that using paired updates induces a bias which results in worse covariance estimation.
Transporting the degraded measurement directly: Applying our algorithm on the degraded input directly led to insufficient results as we see insection B.7.
Re-applying the algorithm another time on 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: This is actually an interesting idea we tested on super-resolution when conducting our evaluations. As a matter of fact, the performance does not improve (it even degrades a bit) when applying the algorithm another time. The explanation is quite simple: After transporting once the test images using the VAE, their latent distribution aligns with that of the natural images. Hence, transporting another time does nothing (the transport operator is the identity matrix). We are only left with the reconstruction error introduced by the encoding and decoding of the images, which deteriorates the MSE performance.
Does the selection on the training data have an impact on the performance of restoration?: Our experiments showed that the class of images does not have a significant impact on the performance (e.g. one could use images of cars to improve images of dogs). However the resolution of images does play a significant role in attaining the best performance. I.e., to transport 512x512 images, it is best to use training images of the same resolution. This drawback is somewhat mitigated by the few-shot nature of the algorithm.
6 Discussion
Figure 6: Our method’s reconstruction capabilities are bounded by that of the VAE. Our algorithm is not able to preserve complex visual structures such as face identity (top row) or text (middle row).
Limitations: The pre-trained VAE used for the purpose of our experiments exhibits a rate of R=48 𝑅 48 R=48 italic_R = 48 on 512 2 superscript 512 2 512^{2}512 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT images which inevitably translates into sub-optimal distortion performance[38]. Thus, the distortion performance of our estimates are bounded by that of the pre-trained VAE. I.e., even encoding and decoding a completely clean and natural image does not yield result in perfect reconstruction. Most notably, the VAE sometimes fails to reconstruct human faces, as well as text images, and such a weaknesses affects our algorithm as well (seefig.6).
Finally, it has been recently shown that the posterior sampler is the only estimator attaining perfect perceptual quality while producing outputs that are perfectly consistent with the degraded input[39]. As such, 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT cannot hope for consistent restorations.
Potential impact: Instead of using sophisticated and data-hungry generative models, we show it is possible to obtain photo-realistic results using simple tools like MMSE estimators and VAEs. We hope our few-shot algorithm will inspire other simple and practical image restoration methods.
Potential misuse: Our algorithm aims at improving the perceptual quality of existing algorithms. However, when using a biased training set, this could potentially cause bias in the enhanced restoration as well. This could potentially harm the results of medical image diagnosis, for example.
Acknoledgements
This research was partially supported by the Council For Higher Education - Planning and Budgeting Committee.
References
- [1] D.Freirich, T.Michaeli, and R.Meir, “A theory of the distortion-perception tradeoff in wasserstein space,” in Advances in Neural Information Processing Systems, 2021.
- [2] Y.Blau and T.Michaeli, “The perception-distortion tradeoff,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- [3] B.Kawar, M.Elad, S.Ermon, and J.Song, “Denoising diffusion restoration models,” in Advances in Neural Information Processing Systems, 2022.
- [4] A.Korotin, V.Egiazarian, A.Asadulaev, A.Safin, and E.Burnaev, “Wasserstein-2 generative networks,” in International Conference on Learning Representations, 2021.
- [5] A.Korotin, D.Selikhanovych, and E.Burnaev, “Neural optimal transport,” in The Eleventh International Conference on Learning Representations, 2023.
- [6] A.Makkuva, A.Taghvaei, S.Oh, and J.Lee, “Optimal transport mapping via input convex neural networks,” in Proceedings of the 37th International Conference on Machine Learning, 2020.
- [7] R.Rombach, A.Blattmann, D.Lorenz, P.Esser, and B.Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- [8] Y.Li, C.Fang, J.Yang, Z.Wang, X.Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” in Advances in Neural Information Processing Systems, 2017.
- [9] M.Lu, H.Zhao, A.Yao, Y.Chen, F.Xu, and L.Zhang, “A closed-form solution to universal style transfer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- [10] Y.Mroueh, “Wasserstein style transfer,” 2019.
- [11] J.Liang, J.Cao, G.Sun, K.Zhang, L.Van Gool, and R.Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021.
- [12] S.W. Zamir, A.Arora, S.Khan, M.Hayat, F.S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- [13] M.V. Conde, U.-J. Choi, M.Burchi, and R.Timofte, “Swin2SR: Swinv2 transformer for compressed image super-resolution and restoration,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2022.
- [14] K.Dabov, A.Foi, V.Katkovnik, and K.Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on Image Processing, 2007.
- [15] A.Buades, B.Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005.
- [16] X.Wang, K.Yu, S.Wu, J.Gu, Y.Liu, C.Dong, Y.Qiao, and C.Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018.
- [17] G.Ohayon, T.Adrai, G.Vaksman, M.Elad, and P.Milanfar, “High perceptual quality image denoising with a posterior sampling cgan,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021.
- [18] B.Kawar, G.Vaksman, and M.Elad, “Stochastic image denoising by sampling from the posterior distribution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021.
- [19] B.Kawar, G.Vaksman, and M.Elad, “SNIPS: Solving noisy inverse problems stochastically,” in Advances in Neural Information Processing Systems, 2021.
- [20] M.Aharon, M.Elad, and A.Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on Signal Processing, 2006.
- [21] M.Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Advances in Neural Information Processing Systems, 2013.
- [22] G.Peyré and M.Cuturi, “Computational optimal transport,” Foundations and Trends in Machine Learning, 2019.
- [23] M.Arjovsky, S.Chintala, and L.Bottou, “Wasserstein generative adversarial networks,” in Proceedings of the 34th International Conference on Machine Learning, 2017.
- [24] V.Panaretos and Y.Zemel, An Invitation to Statistics in Wasserstein Space. Creative Media Partners, LLC, 2020.
- [25] M.Heusel, H.Ramsauer, T.Unterthiner, B.Nessler, and S.Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, 2017.
- [26] C.Villani, Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften, Springer Berlin Heidelberg, 2008.
- [27] D.P. Kingma and M.Welling, “Auto-Encoding Variational Bayes,” in 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- [28] G.Wallace, “The JPEG still picture compression standard,” IEEE Transactions on Consumer Electronics, 1992.
- [29] J.Deng, W.Dong, R.Socher, L.-J. Li, K.Li, and L.Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- [30] Z.Wang, A.Bovik, H.Sheikh, and E.Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, 2004.
- [31] R.Zhang, P.Isola, A.A. Efros, E.Shechtman, and O.Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- [32] J.Johnson, A.Alahi, and L.Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European Conference on Computer Vision (ECCV), 2016.
- [33] T.Salimans, I.Goodfellow, W.Zaremba, V.Cheung, A.Radford, X.Chen, and X.Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016.
- [34] M.Bińkowski, D.J. Sutherland, M.Arbel, and A.Gretton, “Demystifying MMD GANs,” in International Conference on Learning Representations, 2018.
- [35] C.Saharia, J.Ho, W.Chan, T.Salimans, D.Fleet, and M.Norouzi, “Image super-resolution via iterative refinement,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- [36] E.Agustsson and R.Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
- [37] P.von Platen, S.Patil, A.Lozhkov, P.Cuenca, N.Lambert, K.Rasul, M.Davaadorj, and T.Wolf, “Diffusers: State-of-the-art diffusion models.” https://github.com/huggingface/diffusers, 2022.
- [38] A.Alemi, B.Poole, I.Fischer, J.Dillon, R.A. Saurous, and K.Murphy, “Fixing a broken ELBO,” in Proceedings of the 35th International Conference on Machine Learning, 2018.
- [39] G.Ohayon, T.Adrai, M.Elad, and T.Michaeli, “Reasons for the superiority of stochastic estimators over deterministic ones: Robustness, consistency and perceptual quality,” 2022.
- [40] J.T. Flam, S.Chatterjee, K.Kansanen, and T.Ekman, “On mmse estimation: A linear model under gaussian mixture statistics,” IEEE Transactions on Signal Processing, 2012.
- [41] I.Kligvasser, T.Shaham, Y.Bahat, and T.Michaeli, “Deep self-dissimilarities as powerful visual fingerprints,” Advances in Neural Information Processing Systems, 2021.
- [42] T.Rott Shaham, T.Dekel, and T.Michaeli, “Singan: Learning a generative model from a single natural image,” in Computer Vision (ICCV), IEEE International Conference on, 2019.
Deep Optimal Transport: A Practical Algorithm for Photo-realistic Image Restoration - Supplementary Material
Appendix A Background and extensions
A.1 Numerical Example
Figure 7: 2D Gaussian mixture denoising. Source samples are shown in blue. The MMSE 𝑀 𝑀 𝑆 𝐸 MMSE italic_M italic_M italic_S italic_E estimator (𝐱∗superscript 𝐱\mathbf{x}^{}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, orange) attains the best MSE 𝑀 𝑆 𝐸 MSE italic_M italic_S italic_E but the worst perceptual index 𝒲 2 subscript 𝒲 2\mathcal{W}{2}\ caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The posterior samples (𝐱|𝐲 conditional 𝐱 𝐲\mathbf{x}|\mathbf{y}bold_x | bold_y, purple) attain the best perceptual index but half of the optimal MSE 𝑀 𝑆 𝐸 MSE italic_M italic_S italic_E performance. The 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator (𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, green) maintains the MSE 𝑀 𝑆 𝐸 MSE italic_M italic_S italic_E of 𝐱∗superscript 𝐱\mathbf{x}^{}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT while attaining a perceptual quality close to 𝐱|𝐲 conditional 𝐱 𝐲\mathbf{x}|\mathbf{y}bold_x | bold_y. The DP curve is obtained by interpolating 𝐱^0 subscript^𝐱 0\mathbf{\hat{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱∗superscript 𝐱\mathbf{x}^{*}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT usingeq.4.
To guide the reader in understanding the MMSE transport paradigm, we showcase our method on a 2-dimensional denoising problem. To avoid a too trivial uni-modal example, we draw the clean signal from a 4-components Gaussian mixture with non-trivial covariances. We derive linear MMSE and posterior estimators from[40] and proceed by applying the closed-form transport operator introduced in eq.3.
Note that to avoid deviating from our actual method, we refrain from using more advanced transport operators better suited for multi-modal data. Indeed, those are not a practical solution for real-world image datasets, as they require much more samples than actually available.
We summarize the experiment results in fig.7. We observe that we obtain the best perceptual quality by sampling from the posterior distribution. However, we witness a significant decrease in MSE 𝑀 𝑆 𝐸 MSE italic_M italic_S italic_E performance as predicted by [2]. In contrast, the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator enjoys a good perceptual index while maintaining a close-to-optimal distortion performance.
A.2 Stochastic transport operator
Throughout our experiments, we found out that increasing the patch-size p 𝑝 p italic_p can result in numerical instabilities. Recall that the linear transport operator presented ineq.3 uses the inverse square root of the source covariance matrix Σ 𝐱 1 subscript Σ subscript 𝐱 1\Sigma_{\mathbf{x}_{1}}roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. When p 𝑝 p italic_p is large, (typically p≥7 𝑝 7 p\geq 7 italic_p ≥ 7), we obtain ill-conditioned covariance matrices. When the smallest singular value is still positive, we add a small stability constant to the matrix diagonal to ensure it is strictly positive definite. However, the numerical errors sometimes adds up to negative eigenvalues.2 2 2 We tried to avoid overflow when summing over the images by using 64 bit precision In this case, we clamp the negative eigenvalues to zero and use the stochastic (one-to-many) transport operator proposed by[1],
T p 𝐱 1⟶p 𝐱 2 stochastic(x 1)=Σ 𝐱 2 1 2(Σ 𝐱 2 1 2Σ 𝐱 1Σ 𝐱 2 1 2)1 2Σ 𝐱 2−1 2Σ 𝐱 1†(x 1−μ 𝐱 1)+μ 𝐱 2+w,superscript subscript T⟶subscript 𝑝 subscript 𝐱 1 subscript 𝑝 subscript 𝐱 2 stochastic subscript 𝑥 1 superscript subscript Σ subscript 𝐱 2 1 2 superscript superscript subscript Σ subscript 𝐱 2 1 2 subscript Σ subscript 𝐱 1 superscript subscript Σ subscript 𝐱 2 1 2 1 2 superscript subscript Σ subscript 𝐱 2 1 2 superscript subscript Σ subscript 𝐱 1†subscript 𝑥 1 subscript 𝜇 subscript 𝐱 1 subscript 𝜇 subscript 𝐱 2 𝑤\text{T}{p{\mathbf{x}{1}}\longrightarrow p{\mathbf{x}{2}}}^{\text{% stochastic}}(x{1})=\Sigma_{\mathbf{x}{2}}^{\frac{1}{2}}\left(\Sigma{\mathbf% {x}{2}}^{\frac{1}{2}}\Sigma{\mathbf{x}{1}}\Sigma{\mathbf{x}{2}}^{\frac{1}% {2}}\right)^{\frac{1}{2}}\Sigma{\mathbf{x}{2}}^{-\frac{1}{2}}\Sigma{\mathbf% {x}{1}}^{{\dagger}}(x{1}-\mu_{\mathbf{x}{1}})+\mu{\mathbf{x}_{2}}+w,T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT stochastic end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_μ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_w ,(5)
when Σ 𝐱 1†superscript subscript Σ subscript 𝐱 1†\Sigma_{\mathbf{x}{1}}^{{\dagger}}roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT denotes the pseudo-inverse of Σ 𝐱 1 subscript Σ subscript 𝐱 1\Sigma{\mathbf{x}{1}}roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (after negative eigenvalues where clamped) and w∼𝒩(0,Σ 𝐱 2 1 2(I−Σ 𝐱 2 1 2T∗Σ 𝐱 1†T∗Σ 𝐱 2 1 2)1 2Σ 𝐱 2 1 2)similar-to 𝑤 𝒩 0 superscript subscript Σ subscript 𝐱 2 1 2 superscript 𝐼 superscript subscript Σ subscript 𝐱 2 1 2 superscript T superscript subscript Σ subscript 𝐱 1†superscript T superscript subscript Σ subscript 𝐱 2 1 2 1 2 superscript subscript Σ subscript 𝐱 2 1 2{w\sim\mathcal{N}(0,\Sigma{\mathbf{x}{2}}^{\frac{1}{2}}(I-\Sigma{\mathbf{x}% {2}}^{\frac{1}{2}}\text{T}^{*}\Sigma{\mathbf{x}{1}}^{{\dagger}}\text{T}^{*}% \Sigma{\mathbf{x}{2}}^{\frac{1}{2}})^{\frac{1}{2}}\Sigma{\mathbf{x}{2}}^{% \frac{1}{2}})}italic_w ∼ caligraphic_N ( 0 , roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( italic_I - roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ), with T∗=Σ 𝐱 2−1 2(Σ 𝐱 2 1 2Σ 𝐱 1Σ 𝐱 2 1 2)1 2Σ 𝐱 2−1 2 superscript T superscript subscript Σ subscript 𝐱 2 1 2 superscript superscript subscript Σ subscript 𝐱 2 1 2 subscript Σ subscript 𝐱 1 superscript subscript Σ subscript 𝐱 2 1 2 1 2 superscript subscript Σ subscript 𝐱 2 1 2\text{T}^{*}=\Sigma{\mathbf{x}{2}}^{-\frac{1}{2}}\left(\Sigma{\mathbf{x}{2% }}^{\frac{1}{2}}\Sigma{\mathbf{x}{1}}\Sigma{\mathbf{x}{2}}^{\frac{1}{2}}% \right)^{\frac{1}{2}}\Sigma{\mathbf{x}_{2}}^{-\frac{1}{2}}T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.
Appendix B Practical choices and considerations in our algorithm
B.1 Working in latent space
We adopt the latent transport approach where the images are embedded into the latent space of a pre-trained auto-encoder. Let E(⋅)E⋅\textbf{E}(\cdot)E ( ⋅ ), D(⋅)D⋅\textbf{D}(\cdot)D ( ⋅ ) denote the encoder and decoder, respectively. Even if D(E(t))=t D E 𝑡 𝑡\textbf{D}(\textbf{E}(t))=t D ( E ( italic_t ) ) = italic_t, it is likely that E(⋅)E⋅\textbf{E}(\cdot)E ( ⋅ ) “deforms” the space, I.e., ∥E(s)−E(t)∥≠∥s−t∥delimited-∥∥E 𝑠 E 𝑡 delimited-∥∥𝑠 𝑡\left\lVert\textbf{E}(s)-\textbf{E}(t)\right\rVert\neq\left\lVert s-t\right\rVert∥ E ( italic_s ) - E ( italic_t ) ∥ ≠ ∥ italic_s - italic_t ∥, which means that the optimal transport plan in the latent space could be different than the plan we seek in the pixel space (the cost function in eq.3 has changed). We can address this by modifying the latent cost function to account for the deformation via the following change of variables
𝔼[∥𝐱^−𝐱∥2]=𝔼[∥E(𝐱^)−E(𝐱)∥2|∂E(𝐱)∂𝐱|⋅|∂E(𝐱^)∂𝐱^|],𝔼 delimited-[]superscript delimited-∥∥^𝐱 𝐱 2 𝔼 delimited-[]superscript delimited-∥∥E^𝐱 E 𝐱 2⋅E 𝐱 𝐱 E^𝐱^𝐱\mathbb{E}\left[\left\lVert\mathbf{\hat{x}}-\mathbf{x}\right\rVert^{2}\right]=% \mathbb{E}\left[\frac{\left\lVert\textbf{E}(\mathbf{\hat{x}})-\textbf{E}(% \mathbf{x})\right\rVert^{2}}{|\frac{\partial\textbf{E}(\mathbf{x})}{\partial% \mathbf{x}}|\cdot|\frac{\partial\textbf{E}(\mathbf{\hat{x}})}{\partial\mathbf{% \hat{x}}}|}\right],blackboard_E [ ∥ over^ start_ARG bold_x end_ARG - bold_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E [ divide start_ARG ∥ E ( over^ start_ARG bold_x end_ARG ) - E ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | divide start_ARG ∂ E ( bold_x ) end_ARG start_ARG ∂ bold_x end_ARG | ⋅ | divide start_ARG ∂ E ( over^ start_ARG bold_x end_ARG ) end_ARG start_ARG ∂ over^ start_ARG bold_x end_ARG end_ARG | end_ARG ] ,(6)
where |∂E(𝐱)∂𝐱|E 𝐱 𝐱|\frac{\partial\textbf{E}(\mathbf{x})}{\partial\mathbf{x}}|| divide start_ARG ∂ E ( bold_x ) end_ARG start_ARG ∂ bold_x end_ARG | is the determinant of the Jacobian matrix of E(⋅)E⋅\textbf{E}(\cdot)E ( ⋅ ) However it is not a practical solution since we lose access to the closed-form solution eq.2. Note that the latent MSE approximation is usually desirable when dealing with natural images (e.g. to elaborate image quality measure [41], perceptual quality metrics[25]). It is also true in our case but it means we can no longer claim we obtain the 𝐃 𝐦𝐚𝐱 subscript 𝐃 𝐦𝐚𝐱\mathbf{D_{max}}bold_D start_POSTSUBSCRIPT bold_max end_POSTSUBSCRIPT estimator.
With that, we argue that switching to a latent cost is actually a strength rather than a weakness of our method. Indeed, using the MSE between deep latent variables has shown to be a better fit to compare natural images than directly working in the pixel space[31]. The authors of[7] trained their VAE (which is used in our experiments) to remove “imperceptible details” from the latent representation, in order to better focus on higher level image semantics. Insection 5.1 we validate this claim by showing that our algorithm maintains the “perceptual” discrepancy performance of the original estimator (e.g., LPIPS).
B.2 Overlapping patches extraction strategy
For Convolutional Neural Network (CNN) encoders 3 3 3 This methodology can easily be extrapolated to other encoder architectures., let (c,H e,W e)𝑐 subscript 𝐻 𝑒 subscript 𝑊 𝑒(c,H_{e},W_{e})( italic_c , italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) denote the shape of the latent representation (CNN encoders produce 3-dimensional encoded tensors), where H e,W e subscript 𝐻 𝑒 subscript 𝑊 𝑒 H_{e},W_{e}italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT the spatial extent and c 𝑐 c italic_c is the number of channels (i.e., the number of convolution kernels in the last convolution layer). The covariance matrices Σ 𝐱^e,Σ 𝐱 e subscript Σ subscript^𝐱 𝑒 subscript Σ subscript 𝐱 𝑒\Sigma_{\hat{\mathbf{x}}{e}},\ \Sigma{\mathbf{x}{e}}roman_Σ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT contain (c,H e,W e)2 2 superscript 𝑐 subscript 𝐻 𝑒 subscript 𝑊 𝑒 2 2\frac{(c,H{e},W_{e})^{2}}{2}divide start_ARG ( italic_c , italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG parameters, which may require a large amount of samples for large latent images with H e,W e≫1 much-greater-than subscript 𝐻 𝑒 subscript 𝑊 𝑒 1 H_{e},W_{e}\gg 1 italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≫ 1. To mitigate the quadratic dependency on H e⋅W e⋅subscript 𝐻 𝑒 subscript 𝑊 𝑒 H_{e}\cdot W_{e}italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⋅ italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, we assume that the latent pixels depend only on the pixels in their close neighborhood. In practice, we unfold the latent representation, extracting all overlapping patches of shape (c,p,p)𝑐 𝑝 𝑝(c,p,p)( italic_c , italic_p , italic_p ). A similar approximation exists in the style-transfer literature[8, 9], where instead of patches, only the pixels are considered (i.e., this is a private case of our approach with p=1 𝑝 1 p=1 italic_p = 1). In section 5.3 we empirically show that increasing p 𝑝 p italic_p improves the perceptual quality at the expense of MSE performance, given that enough training samples are available.
B.3 Shared distribution
When dealing with natural image scenes, it is beneficial to suppose that overlapping patches share common statistical attributes[41, 42]. In the case of a CNN encoded image, this approximation remains satisfying because we ultimately look at filter activations which are spatial-invariant with each latent patch having the same receptive field. Therefore, we assume that the overlapping patches are all samples from the same distribution. This approach dramatically reduces the number of estimated parameters, and also multiplies the number of samples at our disposal by H e⋅W e⋅subscript 𝐻 𝑒 subscript 𝑊 𝑒 H_{e}\cdot W_{e}italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ⋅ italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, which alleviates the curse of dimensionality. We demonstrate these practical benefits insection 5.3. In practice, given N 𝑁 N italic_N images, we “flatten” all the extracted patches to vectors v¯cp 2×1 subscript¯𝑣 𝑐 superscript 𝑝 2 1\underline{v}{cp^{2}\times 1}under¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_c italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × 1 end_POSTSUBSCRIPT which we stack into a sample matrix X¯¯NH eW e×cp 2 subscript¯¯𝑋 𝑁 subscript 𝐻 𝑒 subscript 𝑊 𝑒 𝑐 superscript 𝑝 2\underline{\underline{X}}{NH_{e}W_{e}\times cp^{2}}under¯ start_ARG under¯ start_ARG italic_X end_ARG end_ARG start_POSTSUBSCRIPT italic_N italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × italic_c italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. We then aggregate the samples to compute the MVG statistics: μ=X T𝟏 𝜇 superscript 𝑋 𝑇 1\mu=X^{T}\mathbf{1}italic_μ = italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_1, Σ=NH eW e NH eW e−1(X−μ)(X−μ)T Σ 𝑁 subscript 𝐻 𝑒 subscript 𝑊 𝑒 𝑁 subscript 𝐻 𝑒 subscript 𝑊 𝑒 1 𝑋 𝜇 superscript 𝑋 𝜇 𝑇\Sigma=\frac{NH_{e}W_{e}}{NH_{e}W_{e}-1}(X-\mu)(X-\mu)^{T}roman_Σ = divide start_ARG italic_N italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG start_ARG italic_N italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - 1 end_ARG ( italic_X - italic_μ ) ( italic_X - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. As NH eW e 𝑁 subscript 𝐻 𝑒 subscript 𝑊 𝑒 NH_{e}W_{e}italic_N italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT may be very large, we perform all computations in double precision. When training, this process is done twice; once for the natural image samples, and once for the restored samples we wish to transport.
B.4 Size of the latent representation
When increasing the capacity of models with a fixed encoding rate, deepening is preferable than widening. Indeed, increasing c 𝑐 c italic_c makes the covariance estimation dramatically harder while increasing H e,W e subscript 𝐻 𝑒 subscript 𝑊 𝑒 H_{e},W_{e}italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT enlarges the sample pool. Therefore, the VAE from[7] with c=4 𝑐 4 c=4 italic_c = 4 and H e,W e≫1 much-greater-than subscript 𝐻 𝑒 subscript 𝑊 𝑒 1 H_{e},W_{e}\gg 1 italic_H start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ≫ 1 is a particularly good candidate for our method. For p=3 𝑝 3 p=3 italic_p = 3 for instance, the covariance matrix admits only 1296 1296 1296 1296 parameters while each 512 2 superscript 512 2 512^{2}512 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT image contributes 4096 4096 4096 4096 samples to its estimation. As we see next, this greatly contributes to reducing the number of training samples needed to estimate the covariance matrices and allows to compute the transport operator in a few-shot manner.
B.5 Transport
In a single pass on a data set of natural images and a (possibly different) data set of restored samples, we compute T p 𝐱^e⟶p 𝐱 e MVG superscript subscript T⟶subscript 𝑝 subscript^𝐱 𝑒 subscript 𝑝 subscript 𝐱 𝑒 MVG\text{T}{p{\hat{\mathbf{x}}{e}}\longrightarrow p{\mathbf{x}{e}}}^{\text{% MVG}}T start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟶ italic_p start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT MVG end_POSTSUPERSCRIPT (seeeq.2). Note that each latent distribution could sometimes be degenerate, especially for severe degradations. Fortunately, the classical MVG transport operator can be generalized to ill-posed settings where Σ 𝐱^subscript Σ^𝐱\Sigma{\hat{\mathbf{x}}}roman_Σ start_POSTSUBSCRIPT over^ start_ARG bold_x end_ARG end_POSTSUBSCRIPT is a singular matrix (seeeq.5).
B.6 Decoding
Since the transported patches overlap, we “fold” them back into a latent image 𝐱^0,latent subscript^𝐱 0 𝑙 𝑎 𝑡 𝑒 𝑛 𝑡\mathbf{\hat{x}}{0,latent}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 , italic_l italic_a italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT by averaging. The latent image is then decoded back to the pixel space, i.e. 𝐱^0=𝐃(𝐱^0,latent)subscript^𝐱 0 𝐃 subscript^𝐱 0 𝑙 𝑎 𝑡 𝑒 𝑛 𝑡\mathbf{\hat{x}}{0}=\mathbf{D}(\mathbf{\hat{x}}_{0,latent})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_D ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 , italic_l italic_a italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT ). Since E(⋅)E⋅\textbf{E}(\cdot)E ( ⋅ ) is not invertible, the decoder D(⋅)D⋅\textbf{D}(\cdot)D ( ⋅ ) is used as a convenient approximation in the training domain of the auto-encoder. A corollary of this approximation is that the auto-encoder should in theory be trained on the image distribution we aim to transport, which weakens our claim to a fully blind algorithm.
All the steps described above are summarized in fig.4.
B.7 Transporting the degraded measurement
We tried applying our algorithm on the degraded measurement directly. Indeed we observe qualitatively and quantitatively that transporting the degraded measurement 𝐲 𝐲\mathbf{y}bold_y amplifies the degradation (refer tofig.8).
Figure 8: Transporting the degraded measurement (JPEG q=10 subscript JPEG 𝑞 10\text{JPEG}_{q=10}JPEG start_POSTSUBSCRIPT italic_q = 10 end_POSTSUBSCRIPT) directly is not enough to restore the image. It can sometimes even exacerbate the degradation. Quantitatively, the degraded sample 𝐲 𝐲\mathbf{y}bold_y has better PSNR and FID than its transported version (respectively 27.26 dB and 13.88 FID v.s. 23.69 dB and 15.88 FID).
Xet Storage Details
- Size:
- 110 kB
- Xet hash:
- 22ef7ec788a6e9769ae08d2892d6fe49d0f75cc21a8f921448ec57db34678559
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

