81 kB

Title: A Diffusion Model with Gradient Guidance for Infrared Image Super-Resolution

URL Source: https://arxiv.org/html/2503.01187

Published Time: Tue, 04 Mar 2025 02:48:54 GMT

Markdown Content: Xingyuan Li 1, Zirui Wang 1 1 1 1 Evaluate the low-resolution image after enlarging it to match the resolution of the high-resolution image through interpolation., Yang Zou 2, Zhixin Chen 3,

Jun Ma 1, Zhiying Jiang 4, Long Ma 1, Jinyuan Liu 1†

1 Dalian University of Technology 2 Northwestern Polytechnical University

3 Waseda University 4 Dalian Maritime University

xingyuan_lxy@163.com ziruiwang0625@gmail.com

Abstract

††∗ Equal contribution. † Corresponding author.

Infrared imaging is essential for autonomous driving and robotic operations as a supportive modality due to its reliable performance in challenging environments. Despite its popularity, the limitations of infrared cameras, such as low spatial resolution and complex degradations, consistently challenge imaging quality and subsequent visual tasks. Hence, infrared image super-resolution (IISR) has been developed to address this challenge. While recent developments in diffusion models have greatly advanced this field, current methods to solve it either ignore the unique modal characteristics of infrared imaging or overlook the machine perception requirements. To bridge these gaps, we propose DifIISR, an infrared image super-resolution diffusion model optimized for visual quality and perceptual performance. Our approach achieves task-based guidance for diffusion by injecting gradients derived from visual and perceptual priors into the noise during the reverse process. Specifically, we introduce an infrared thermal spectrum distribution regulation to preserve visual fidelity, ensuring that the reconstructed infrared images closely align with high-resolution images by matching their frequency components. Subsequently, we incorporate various visual foundational models as the perceptual guidance for downstream visual tasks, infusing generalizable perceptual features beneficial for detection and segmentation. As a result, our approach gains superior visual results while attaining State-Of-The-Art downstream task performance. Code is available at https://github.com/zirui0625/DifIISR

Figure 1: The left side shows a comparison between existing super-resolution methods and our proposed DifIISR. Our method introduces additional visual guidance based on the Fourier Transform, as well as foundational model-based perception guidance. This allows our approach to achieve optimal performance in both visual and perceptual space. The right side demonstrates that our method outperforms other methods in both detection and segmentation tasks.

1 Introduction

The objective of infrared image super-resolution (IISR) is to reconstruct a high-resolution (HR) infrared image from its low-resolution (LR) counterpart[48]. The consistent performance of infrared imaging under challenging conditions allows its application to span various fields[28, 41, 42], such as object detection[37, 38, 25], semantic segmentation[21, 27], and autonomous driving[39, 2]. Despite its great potential, the inherent limitations of infrared cameras — such as high noise levels, reduced spatial resolution, and limited dynamic range — continually affect the quality of infrared images.

Conventionally, CNN-based methods[59, 60, 58, 30] address this challenge by mapping interpolated LR images to HR images and then enhancing the details (e.g., SRCNN[32]). Although CNN-based super-resolution methods have significantly advanced this field, they are limited by the perceptual field of local convolution operations. To overcome this, Transformer-based methods[4, 3, 57, 53] model long-range dependencies to capture global context. Liang et al.[23] proposed SwinIR significantly improving super-resolution performance by integrating CNNs with the Swin Transformer. Lately, Li et al.[22] proposed CoRPLE, which leverages a Contourlet residual framework to restore infrared-specific high-frequency features.

Recently, the diffusion model has introduced a novel paradigm for image super-resolution tasks, offering a fresh approach that goes beyond the CNN- and Transformer-based methods[11], leveraging its capacity to learn implicit priors of the underlying data distribution[36]. Yue et al. proposed ResShift[52], which applies an iterative sampling procedure to shift the residual between the LR and the desired HR image during inference. Unlike other diffusion models, Wang et al.[46] accelerate the diffusion-based SR model to a single inference step while maintaining satisfactory performance. These methods generally achieve visually pleasing results when applied to visible images.

However, existing methods often fail to extend effectively to infrared imaging, particularly in downstream tasks such as infrared image object detection and semantic segmentation. A common approach for task-oriented infrared image super-resolution is to adapt an RGB super-resolution model to infrared data, and then connect it to a downstream detection or segmentation module. Unfortunately, this approach faces two significant challenges: 1) Ignoring the unique modal characteristics of infrared imaging, which include distinct thermal spectrum distributions. Infrared image reconstruction quality is particularly sensitive to high-frequency components due to longer wavelengths and reduced atmospheric scattering effects. 2) Overlooking the machine perception requirements. While the model may reconstruct visually appealing images, these results are often sub-optimal for specific perceptual tasks. The objectives of visual domain optimization and perceptual domain optimization can differ significantly[25]. For instance, diffusion-based super-resolution models typically focus on “seeking visually appealing” results, often at the expense of structural information of targets and textural details critical for machine vision. Given these limitations, we ask, “Why not develop a super-resolution model that reconstructs infrared images to be both visually appealing and perceptually salient?”

To this end, as shown in figure1, we propose a task-oriented infrared image super-resolution method that optimizes the diffusion process through gradient-based guidance. Specifically, we inject the gradient of a designed prior loss into the noise estimation at each training step, refining the model’s performance across iterations. Our guidance consists of two components. First, to ensure visual consistency, we introduce visual guidance via infrared thermal spectral distribution modulation, which ensures the reconstructed images align with high-resolution counterparts by preserving their spectral characteristics. Second, we integrate perceptual guidance by leveraging powerful pre-trained vision models, such as VGG[34] and SAM[19], to infuse the diffusion process with generalized perceptual features. Extensive experiments demonstrate that our proposed method excels in both visual quality and downstream task performance. Our contributions can be summarized as follows:

•We propose a solution for infrared image super-resolution by integrating gradient-based priors into the noise during diffusion, enabling task-based guidance in sampling, and achieving simultaneous optimization in both visual and perceptual-specific domains.
•We introduce a thermal spectrum distribution regulation to preserve the visual fidelity of infrared images, guiding the diffusion process to learn the unique infrared image frequency distribution.
•We propose perceptual guidance for the diffusion process, incorporating generalizable perceptual features from foundational models for visual tasks. This notably enhances performance in detection and segmentation.

2 Related work

2.1 Image Super-Resolusion

Since the pioneering work of SRCNN[13] was proposed, deep learning has gradually become the mainstream approach for image super-resolution (SR). The initial works[13, 18, 60, 20] mainly focused on utilizing convolutional neural networks (CNNs)[12] for image super-resolution tasks and optimizing the network by minimizing the mean square error (MSE) between the super-resolved image (SR) and their corresponding high-resolution (HR) counterparts. Subsequently, GAN-based super-resolution methods were proposed, drawing significant attention. For example, both BSRGAN[54] and Real-ESRGAN[45] employ GANs for super-resolution tasks and introduce training samples with more realistic types of degradations to achieve better results. While these methods improve the quality of the low-resolution images, they often fail to produce stable outcomes, resulting in artifacts in the images. LDL[24] and DeSRA[50] attempt to address this issue, but they still struggle to generate images with natural details. Recently, diffusion models have been widely applied to image super-resolution tasks, such as ResShift[52] and SinSR[46]. However, these methods are not designed specifically for the characteristics of infrared images and overlook the requirements of machine perception[61], so they do not perform well in infrared image super-resolution (IISR).

2.2 Diffusion Methods

The Diffusion Denoising Probabilistic Model (DDPM)[14] is a generative model with stability and controllability. Since it was proposed, it has attracted widespread attention. The main focus of the diffusion model is to train a denoising autoencoder, which estimates the reverse process of the Markov diffusion process by predicting the noise. Diffusion models were initially applied to image generation tasks and have been continuously improved in recent years[35, 1, 31, 36, 29]. ControlNet[55] introduces control conditions into pre-trained diffusion models, expanding the application scope of diffusion models in image generation. DDIM[35] proposes a non-Markovian generation method, significantly enhancing the inference speed of diffusion models. Diffusion models have demonstrated exceptional capabilities not only in image generation tasks but also in various other tasks, showing great potential. With the introduction of several related methods[52, 46, 33, 8, 9, 16], diffusion models have also been validated to achieve remarkable results in the field of image super-resolution.

Figure 2: Overall architecture of our proposed method: the vanilla super-resolution diffusion process is marked in black, whereas our proposed additional visual and perceptual priors are marked in red.

3 Preliminaries

Diffusion models. We first introduce the background of Denoising Diffusion Probabilistic Models[14]. DDPM obtains samples x 0∼p data⁢(x)similar-to subscript 𝑥 0 subscript 𝑝 data 𝑥 x_{0}\sim p_{\text{data}}(x)italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( italic_x ) from the data distribution. In a diffusion model, noise is gradually added to the sampled x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT over time steps up to T 𝑇 T italic_T, eventually resulting in x T∼𝒩⁢(0,𝐈)similar-to subscript 𝑥 𝑇 𝒩 0 𝐈 x_{T}\sim\mathcal{N}(0,\mathbf{I})italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ), which can be approximated as a standard Gaussian distribution. This process is also referred to as the forward process of the diffusion model, and it can be represented as:

q⁢(x t∣x 0)=𝒩⁢(x t;α t⁢x 0,(1−α t)⁢𝐈),𝑞 conditional subscript 𝑥 𝑡 subscript 𝑥 0 𝒩 subscript 𝑥 𝑡 subscript 𝛼 𝑡 subscript 𝑥 0 1 subscript 𝛼 𝑡 𝐈 q(x_{t}\mid x_{0})=\mathcal{N}(x_{t};\sqrt{\alpha_{t}}x_{0},(1-\alpha_{t})% \mathbf{I}),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) ,(1)

where α t=∏s=1 t(1−β s)subscript 𝛼 𝑡 superscript subscript product 𝑠 1 𝑡 1 subscript 𝛽 𝑠\alpha_{t}=\prod_{s=1}^{t}(1-\beta_{s})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), and β s subscript 𝛽 𝑠\beta_{s}italic_β start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are fixed or learned variance schedule. After obtaining x T subscript 𝑥 𝑇 x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, the denoising model ϵ ϕ subscript italic-ϵ italic-ϕ\epsilon_{\phi}italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT learns to predict the noise ϵ italic-ϵ\epsilon italic_ϵ added during the forward process, thereby removing the noise from x T subscript 𝑥 𝑇 x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Specifically, the denoising model ϵ ϕ subscript italic-ϵ italic-ϕ\epsilon_{\phi}italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT predicts the noise by optimizing the re-weighted evidence lower bound, which can be written as:

ℒ simple⁢(ϕ)=𝔼 x 0,t,ϵ⁢[‖ϵ ϕ⁢(x t,t)−ϵ‖2].subscript ℒ simple italic-ϕ subscript 𝔼 subscript 𝑥 0 𝑡 italic-ϵ delimited-[]superscript norm subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡 italic-ϵ 2\mathcal{L}{\text{simple}}(\phi)=\mathbb{E}{x_{0},t,\epsilon}\left[|% \epsilon_{\phi}(x_{t},t)-\epsilon|^{2}\right].caligraphic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t , italic_ϵ end_POSTSUBSCRIPT [ ∥ italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_ϵ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .(2)

In this formula, ϵ ϕ⁢(x t,t)subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡\epsilon_{\phi}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) represents the noise predicted by the model, and t 𝑡 t italic_t is randomly sampled from a predefined range of time steps. During the training process, the denoising model ϵ ϕ subscript italic-ϵ italic-ϕ\epsilon_{\phi}italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is optimized by minimizing ℒ simple⁢(ϕ)subscript ℒ simple italic-ϕ\mathcal{L}_{\text{simple}}(\phi)caligraphic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT ( italic_ϕ ), ultimately resulting in a model capable of accurately predicting the noise.

After training the denoising model ϵ ϕ subscript italic-ϵ italic-ϕ\epsilon_{\phi}italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, we sample x T∼𝒩⁢(0,𝐈)similar-to subscript 𝑥 𝑇 𝒩 0 𝐈 x_{T}\sim\mathcal{N}(0,\mathbf{I})italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) and iteratively refine it using the denoising model. This process is also known as the reverse process, and the specific formula can be represented as:

p θ⁢(x t−1∣x t)=𝒩⁢(x t−1;μ θ⁢(x t,t),Σ θ⁢(x t,t)),subscript 𝑝 𝜃 conditional subscript 𝑥 𝑡 1 subscript 𝑥 𝑡 𝒩 subscript 𝑥 𝑡 1 subscript 𝜇 𝜃 subscript 𝑥 𝑡 𝑡 subscript Σ 𝜃 subscript 𝑥 𝑡 𝑡 p_{\theta}(x_{t-1}\mid x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma% {\theta}(x{t},t)),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) ,(3)

where μ θ⁢(x t,t)subscript 𝜇 𝜃 subscript 𝑥 𝑡 𝑡\mu_{\theta}(x_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the mean function from step t 𝑡 t italic_t to t−1 𝑡 1 t-1 italic_t - 1, and Σ θ⁢(x t,t)subscript Σ 𝜃 subscript 𝑥 𝑡 𝑡\Sigma_{\theta}(x_{t},t)roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is the covariance. Due to the slow sampling process of DDPM, DDIM proposes using a non-Markovian diffusion process, which significantly improves the model’s sampling speed. The improved sampling formula can be expressed as:

x t−1=α t−1⁢x^0⁢(x t)+1−α t−1−σ t 2⁢ϵ ϕ⁢(x t,t)+σ t⁢z,subscript 𝑥 𝑡 1 subscript 𝛼 𝑡 1 subscript^𝑥 0 subscript 𝑥 𝑡 1 subscript 𝛼 𝑡 1 superscript subscript 𝜎 𝑡 2 subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡 subscript 𝜎 𝑡 𝑧 x_{t-1}=\sqrt{\alpha_{t-1}}\hat{x}{0}(x{t})+\sqrt{1-\alpha_{t-1}-\sigma_{t}^% {2}}\epsilon_{\phi}(x_{t},t)+\sigma_{t}z,italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z ,(4)

where σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the variance of the noise and z 𝑧 z italic_z follows a standard normal distribution. x^0⁢(x t)subscript^𝑥 0 subscript 𝑥 𝑡\hat{x}{0}(x{t})over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the predicted x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the prediction formula is:

x^0⁢(x t)=1 α t⁢(x t−1−α t⁢ϵ ϕ⁢(x t,t)).subscript^𝑥 0 subscript 𝑥 𝑡 1 subscript 𝛼 𝑡 subscript 𝑥 𝑡 1 subscript 𝛼 𝑡 subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡\hat{x}{0}(x{t})=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\sqrt{1-\alpha_{t}}% ,\epsilon_{\phi}(x_{t},t)\right).over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) .(5)

When σ t subscript 𝜎 𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT equals 0, it is evident that the DDIM sampling process can be regarded as a deterministic process, which allows for quick sampling results from the noise.

4 Method

Overview. Our main objective is to address the problem of infrared image super-resolution using a diffusion model enhanced by gradient-based guidance, as shown in figure2. Specifically, inspired by[10], we propose a method that fine-tunes the diffusion model by introducing an additional guidance mechanism. Unlike previous approaches where loss constraints are directly added numerically during training, we compute the gradient of the loss and inject it into the noise predicted at each denoising step. This correction optimizes the denoising process iteratively, refining the model’s output at every stage. In addition, we incorporate a dual optimization approach combining visual and perceptual aspects to better adapt the diffusion model to the task of infrared image super-resolution.

4.1 Loss-gradient Guidance

The reverse process of diffusion models often requires multiple constraints to generate stable, high-quality images. Most methods tend to guide the reverse process by adding weighted constraints to the final loss function. In contrast, our approach addresses this issue from the perspective of posterior sampling. Inspired by[10], we introduce additional priors and compute the gradient of the resulting loss function, injecting the gradient into the noise estimated at each step to better handle the problem of adding constraints during the reverse process of diffusion models.

Generally, the noise predicted by the denoising model at timestep t 𝑡 t italic_t is often correlated with the score of the denoising model at the current timestep[36]. Specifically, it can be represented as:

ϵ ϕ⁢(x t,t)=−1−α t⁢∇x t log⁡p⁢(x t),subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡 1 subscript 𝛼 𝑡 subscript∇subscript 𝑥 𝑡 𝑝 subscript 𝑥 𝑡\epsilon_{\phi}(x_{t},t)=-\sqrt{1-\alpha_{t}}\nabla_{x_{t}}\log p(x_{t}),italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,(6)

where ∇x t log⁡p⁢(x t)subscript∇subscript 𝑥 𝑡 𝑝 subscript 𝑥 𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the gradient of x t subscript 𝑥 𝑡 x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with respect to the probability density function log⁡p⁢(x t)𝑝 subscript 𝑥 𝑡\log p(x_{t})roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), but now we need to consider not only ∇x t log⁡p⁢(x t)subscript∇subscript 𝑥 𝑡 𝑝 subscript 𝑥 𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we also need to incorporate optimization of g 𝑔 g italic_g during the diffusion model sampling. In our work, g 𝑔 g italic_g represents the guidance obtained by feeding x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into ℳ ℳ\mathcal{M}caligraphic_M, and ℳ ℳ\mathcal{M}caligraphic_M is a forward operator. The relationship between g 𝑔 g italic_g and x 0 subscript 𝑥 0 x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be expressed as g=ℳ⁢(x 0)𝑔 ℳ subscript 𝑥 0 g=\mathcal{M}(x_{0})italic_g = caligraphic_M ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Therefore, the score of the denoising model at timestep t 𝑡 t italic_t becomes ∇x t log⁡p⁢(x t∣g)subscript∇subscript 𝑥 𝑡 𝑝 conditional subscript 𝑥 𝑡 𝑔\nabla_{x_{t}}\log p(x_{t}\mid g)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_g ).

∇x t log⁡p⁢(x t∣g)subscript∇subscript 𝑥 𝑡 𝑝 conditional subscript 𝑥 𝑡 𝑔\nabla_{x_{t}}\log p(x_{t}\mid g)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_g ) is unknown, and we need to use the known ∇x t log⁡p⁢(x t)subscript∇subscript 𝑥 𝑡 𝑝 subscript 𝑥 𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to derive ∇x t log⁡p⁢(x t∣g)subscript∇subscript 𝑥 𝑡 𝑝 conditional subscript 𝑥 𝑡 𝑔\nabla_{x_{t}}\log p(x_{t}\mid g)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_g ). According to Bayes’ theorem, we can write:

∇x t log⁡p⁢(x t∣g)=∇x t log⁡p⁢(x t)+∇x t log⁡p⁢(g∣x t).subscript∇subscript 𝑥 𝑡 𝑝 conditional subscript 𝑥 𝑡 𝑔 subscript∇subscript 𝑥 𝑡 𝑝 subscript 𝑥 𝑡 subscript∇subscript 𝑥 𝑡 𝑝 conditional 𝑔 subscript 𝑥 𝑡\nabla_{x_{t}}\log p(x_{t}\mid g)=\nabla_{x_{t}}\log p(x_{t})+\nabla_{x_{t}}% \log p(g\mid x_{t}).∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_g ) = ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_g ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .(7)

From this, it can be seen that ∇x t log⁡p⁢(x t)subscript∇subscript 𝑥 𝑡 𝑝 subscript 𝑥 𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is known, and the problem changes from calculating ∇x t log⁡p⁢(x t∣g)subscript∇subscript 𝑥 𝑡 𝑝 conditional subscript 𝑥 𝑡 𝑔\nabla_{x_{t}}\log p(x_{t}\mid g)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_g ) to calculating ∇x t log⁡p⁢(g∣x t)subscript∇subscript 𝑥 𝑡 𝑝 conditional 𝑔 subscript 𝑥 𝑡\nabla_{x_{t}}\log p(g\mid x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_g ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Inspired by[10], we can derive the formula:

∇x t log⁡p⁢(g∣x t)subscript∇subscript 𝑥 𝑡 𝑝 conditional 𝑔 subscript 𝑥 𝑡\displaystyle\nabla_{x_{t}}\log p(g\mid x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_g ∣ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )≃∇x t log⁡p⁢(g∣x^0⁢(x t))similar-to-or-equals absent subscript∇subscript 𝑥 𝑡 𝑝 conditional 𝑔 subscript^𝑥 0 subscript 𝑥 𝑡\displaystyle\simeq\nabla_{x_{t}}\log p(g\mid\hat{x}{0}(x{t}))≃ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_g ∣ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )(8) ≃−ρ⁢∇x t‖g−ℳ⁢(x^0⁢(x t))‖2 2,similar-to-or-equals absent 𝜌 subscript∇subscript 𝑥 𝑡 superscript subscript norm 𝑔 ℳ subscript^𝑥 0 subscript 𝑥 𝑡 2 2\displaystyle\simeq-\rho\nabla_{x_{t}}\left|g-\mathcal{M}(\hat{x}{0}(x{t}))% \right|_{2}^{2},≃ - italic_ρ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_g - caligraphic_M ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where ∇x t‖g−ℳ⁢(x^0⁢(x t))‖2 2 subscript∇subscript 𝑥 𝑡 superscript subscript norm 𝑔 ℳ subscript^𝑥 0 subscript 𝑥 𝑡 2 2\nabla_{x_{t}}\left|g-\mathcal{M}(\hat{x}{0}(x{t}))\right|{2}^{2}∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_g - caligraphic_M ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT also can be represented as ∇ℒ g∇subscript ℒ 𝑔\nabla\mathcal{L}{g}∇ caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT. Therefore, we can express the noise prediction adjusted according to condition g 𝑔 g italic_g as:

ϵ ϕ′subscript superscript italic-ϵ′italic-ϕ\displaystyle\epsilon^{\prime}{\phi}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT=ϵ ϕ⁢(x t,t)+ρ⁢1−α t⁢∇x t‖g−ℳ⁢(x^0⁢(x t))‖2 2 absent subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡 𝜌 1 subscript 𝛼 𝑡 subscript∇subscript 𝑥 𝑡 subscript superscript norm 𝑔 ℳ subscript^𝑥 0 subscript 𝑥 𝑡 2 2\displaystyle=\epsilon{\phi}(x_{t},t)+\rho\sqrt{1-\alpha_{t}}\nabla_{x_{t}}% \left|g-\mathcal{M}(\hat{x}{0}(x{t}))\right|^{2}{2}= italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_ρ square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_g - caligraphic_M ( over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(9) =ϵ ϕ⁢(x t,t)+ρ⁢1−α t⁢∇ℒ g,absent subscript italic-ϵ italic-ϕ subscript 𝑥 𝑡 𝑡 𝜌 1 subscript 𝛼 𝑡∇subscript ℒ 𝑔\displaystyle=\epsilon{\phi}(x_{t},t)+\rho\sqrt{1-\alpha_{t}}\nabla\mathcal{L% }_{g},= italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_ρ square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∇ caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ,

where ϵ ϕ′subscript superscript italic-ϵ′italic-ϕ\epsilon^{\prime}{\phi}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT represents the adjusted noise, obtained by adding the gradient of the guidance loss ∇ℒ g∇subscript ℒ 𝑔\nabla\mathcal{L}{g}∇ caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT to the noise predicted by the original denoising model.

Thus, by applying gradient guidance to the noise predicted by the diffusion model, we impose constraints on the reverse process of the diffusion model. More detailed derivations and pseudocode of our method can be found in the supplementary materials.

4.2 Visual-perceptual Dual Optimization

Now, let’s explain the composition of our guidance ℒ g subscript ℒ 𝑔\mathcal{L}{g}caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT in detail. Specifically, ℒ g subscript ℒ 𝑔\mathcal{L}{g}caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT can be divided into two parts: visual loss ℒ visual subscript ℒ visual\mathcal{L}{\text{visual}}caligraphic_L start_POSTSUBSCRIPT visual end_POSTSUBSCRIPT and perceptual loss ℒ perceptual subscript ℒ perceptual\mathcal{L}{\text{perceptual}}caligraphic_L start_POSTSUBSCRIPT perceptual end_POSTSUBSCRIPT.

𝒱 𝒱\mathcal{V}caligraphic_V - visual guidance. To guide the diffusion process in reconstructing infrared images towards infrared-specific visual characteristics, we propose ℒ visual subscript ℒ visual\mathcal{L}{\text{visual}}caligraphic_L start_POSTSUBSCRIPT visual end_POSTSUBSCRIPT to regularize the distribution of high- and low-frequency information as visual guidance 𝒱 𝒱\mathcal{V}caligraphic_V. Here, 𝒱 𝒱\mathcal{V}caligraphic_V replaces the forward operator ℳ ℳ\mathcal{M}caligraphic_M in equation 8. Given the HR image 𝐈 H⁢R subscript 𝐈 𝐻 𝑅\mathbf{I}{HR}bold_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and the super-resolved image 𝐈 S⁢R subscript 𝐈 𝑆 𝑅\mathbf{I}_{SR}bold_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT, we first use Fast Fourier Transforms (FFT) to transform their spatial domain representation into the frequency domain, formally:

𝐈^H⁢R=ℱ⁢(𝐈 H⁢R),𝐈^S⁢R=ℱ⁢(𝐈 S⁢R),ℱ⁢(u,v)=∑x=0 H−1∑y=0 W−1 I⁢(x,y)⋅e−i⁢2⁢π H⁢u⁢x⋅e−i⁢2⁢π W⁢v⁢y.formulae-sequence subscript^𝐈 𝐻 𝑅 ℱ subscript 𝐈 𝐻 𝑅 formulae-sequence subscript^𝐈 𝑆 𝑅 ℱ subscript 𝐈 𝑆 𝑅 ℱ 𝑢 𝑣 superscript subscript 𝑥 0 𝐻 1 superscript subscript 𝑦 0 𝑊 1⋅𝐼 𝑥 𝑦 superscript 𝑒 𝑖 2 𝜋 𝐻 𝑢 𝑥 superscript 𝑒 𝑖 2 𝜋 𝑊 𝑣 𝑦\begin{gathered}\hat{\mathbf{I}}{HR}=\mathcal{F}(\mathbf{I}{HR}),,\hat{% \mathbf{I}}{SR}=\mathcal{F}(\mathbf{I}{SR}),\ \mathcal{F}(u,v)=\sum_{x=0}^{H-1}\sum_{y=0}^{W-1}I(x,y)\cdot e^{-i\frac{2\pi}{% H}ux}\cdot e^{-i\frac{2\pi}{W}vy}.\end{gathered}start_ROW start_CELL over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT = caligraphic_F ( bold_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ) , over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT = caligraphic_F ( bold_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL caligraphic_F ( italic_u , italic_v ) = ∑ start_POSTSUBSCRIPT italic_x = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_y = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W - 1 end_POSTSUPERSCRIPT italic_I ( italic_x , italic_y ) ⋅ italic_e start_POSTSUPERSCRIPT - italic_i divide start_ARG 2 italic_π end_ARG start_ARG italic_H end_ARG italic_u italic_x end_POSTSUPERSCRIPT ⋅ italic_e start_POSTSUPERSCRIPT - italic_i divide start_ARG 2 italic_π end_ARG start_ARG italic_W end_ARG italic_v italic_y end_POSTSUPERSCRIPT . end_CELL end_ROW(10)

where ℱ⁢(u,v)ℱ 𝑢 𝑣\mathcal{F}(u,v)caligraphic_F ( italic_u , italic_v ) denotes the FFT of the image at frequency coordinates (u,v)𝑢 𝑣(u,v)( italic_u , italic_v ), and 𝐈^H⁢R subscript^𝐈 𝐻 𝑅\hat{\mathbf{I}}{HR}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT is the transformed HR images. In the frequency domain, we first shift the zero-frequency component, which represents the mean intensity of the image, to the center of the spectrum for both HR and SR images, yielding 𝐈^H⁢R shift superscript subscript^𝐈 𝐻 𝑅 shift\hat{\mathbf{I}}{HR}^{\text{shift}}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT shift end_POSTSUPERSCRIPT and 𝐈^S⁢R shift superscript subscript^𝐈 𝑆 𝑅 shift\hat{\mathbf{I}}{SR}^{\text{shift}}over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT shift end_POSTSUPERSCRIPT. Following this, we compute the magnitude spectra 𝐌 H⁢R subscript 𝐌 𝐻 𝑅\mathbf{M}{HR}bold_M start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and 𝐌 S⁢R subscript 𝐌 𝑆 𝑅\mathbf{M}{SR}bold_M start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT by applying logarithmic compression to the Fourier-transformed images. This step ensures a balanced consideration of both high-frequency and low-frequency components during comparison. To focus the loss on matching the frequency distribution patterns rather than absolute intensity differences, we normalize the magnitude spectra to have zero mean and unit variance, resulting in the normalized spectra 𝐌 H⁢R norm superscript subscript 𝐌 𝐻 𝑅 norm\mathbf{M}{HR}^{\text{norm}}bold_M start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT and 𝐌 S⁢R norm superscript subscript 𝐌 𝑆 𝑅 norm\mathbf{M}{SR}^{\text{norm}}bold_M start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT. Finally, the Visual Loss ℒ visual subscript ℒ visual\mathcal{L}{\text{visual}}caligraphic_L start_POSTSUBSCRIPT visual end_POSTSUBSCRIPT is computed as the mean squared error between the normalized magnitude spectra of the HR and SR images:

ℒ visual=(N⁢(log⁡(1+|𝐈^H⁢R shift|))⏞𝐌 H⁢R norm−N⁢(log⁡(1+|𝐈^S⁢R shift|))⏞𝐌 S⁢R norm)2,subscript ℒ visual superscript superscript⏞𝑁 1 superscript subscript^𝐈 𝐻 𝑅 shift superscript subscript 𝐌 𝐻 𝑅 norm superscript⏞𝑁 1 superscript subscript^𝐈 𝑆 𝑅 shift superscript subscript 𝐌 𝑆 𝑅 norm 2\mathcal{L}{\text{visual}}=\biggl{(}\overbrace{N\bigl{(}\log(1+|\hat{\mathbf{% I}}{HR}^{\text{shift}}|)\bigr{)}}^{\mathbf{M}{HR}^{\text{norm}}}-\overbrace{% N\bigl{(}\log(1+|\hat{\mathbf{I}}{SR}^{\text{shift}}|)\bigr{)}}^{\mathbf{M}_{% SR}^{\text{norm}}}\biggr{)}^{2},caligraphic_L start_POSTSUBSCRIPT visual end_POSTSUBSCRIPT = ( over⏞ start_ARG italic_N ( roman_log ( 1 + | over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT shift end_POSTSUPERSCRIPT | ) ) end_ARG start_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - over⏞ start_ARG italic_N ( roman_log ( 1 + | over^ start_ARG bold_I end_ARG start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT shift end_POSTSUPERSCRIPT | ) ) end_ARG start_POSTSUPERSCRIPT bold_M start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT norm end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(11)

where N⁢(⋅)𝑁⋅N(\cdot)italic_N ( ⋅ ) represents the normalization operation. Visual Loss plays a critical role in preserving the frequency distribution of the infrared image.

Datasets Set5 Set15 Set20 Methods CLIP-IQA↑↑\uparrow↑MUSIQ↑↑\uparrow↑CLIP-IQA↑↑\uparrow↑MUSIQ↑↑\uparrow↑CLIP-IQA↑↑\uparrow↑MUSIQ↑↑\uparrow↑ Low Resolution 1 1 1 Evaluate the low-resolution image after enlarging it to match the resolution of the high-resolution image through interpolation.-0.2167 24.609 0.2049 23.063 0.2230 22.446 ESRGAN[44]ECCV’18 0.2130 40.819 0.2038 40.745 0.1804 36.654 RealSR-JPEG[15]CVPR’20 0.3615 48.419 0.3573 49.225 0.3277 47.213 BSRGAN[54]CVPR’21 0.3290 53.119 0.3194 52.644 0.3301 51.917 SwinIR[23]CVPR’21 0.2160 37.156 0.2230 37.970 0.2258 34.919 RealESRGAN[45]ICCV’21 0.2780 54.306 0.2424 53.163 0.2523 51.647 HAT[5]CVPR’23 0.2298 38.050 0.2377 39.743 0.2466 35.633 DAT[6]ICCV’23 0.2297 37.538 0.2410 39.419 0.2518 35.750 ResShift[52]NeurIPS’23 0.4701 50.769 0.4428 52.871 0.4082 51.244 CoRPLE[22]ECCV’24 0.2339 36.281 0.2281 36.458 0.2281 34.270 SinSR[46]CVPR’24 0.5877 54.355 0.5762 54.106 0.5357 53.187 Bi-DiffSR[7]NeurIPS’24 0.3151 35.356 0.2758 36.102 0.2674 36.537 DifIISR Ours 0.6144 55.194 0.5906 54.504 0.5484 53.636 High Resolution-0.2200 34.066 0.2161 34.410 0.2139 32.024

Table 1: No-reference Metrics Comparison of infrared image super-resolution with SOTA methods on M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD datasets.

Datasets Metrics Set5 Set15 Set20 ResShift SinSR Bi-DiffSR DifIISR ResShift SinSR Bi-DiffSR DifIISR ResShift SinSR Bi-DiffSR DifIISR PSNR↑↑\uparrow↑30.101 31.645 32.022 32.279 30.283 31.988 32.145 32.351 30.976 33.438 33.447 33.451 SSIM↑↑\uparrow↑0.8329 0.8481 0.8579 0.8637 0.8228 0.8426 0.8471 0.8578 0.8446 0.8853 0.8874 0.8941 LPIPS↓↓\downarrow↓0.3179 0.2737 0.2816 0.2704 0.3537 0.2817 0.2924 0.2845 0.3507 0.2549 0.2820 0.2735

Table 2: Reference-based Metrics Comparison with diffusion-based methods on M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD datasets.

𝒫 𝒫\mathcal{P}caligraphic_P - perceptual guidance. To regularize the diffusion process to better align with machine perception, we adopt ℒ perceptual subscript ℒ perceptual\mathcal{L}{\text{perceptual}}caligraphic_L start_POSTSUBSCRIPT perceptual end_POSTSUBSCRIPT that consists of the VGG Loss ℒ VGG subscript ℒ VGG\mathcal{L}{\text{VGG}}caligraphic_L start_POSTSUBSCRIPT VGG end_POSTSUBSCRIPT and the Segmentation Loss ℒ seg subscript ℒ seg\mathcal{L}{\text{seg}}caligraphic_L start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT as perceptual guidance 𝒫 𝒫\mathcal{P}caligraphic_P. Here, 𝒫 𝒫\mathcal{P}caligraphic_P replaces the forward operator ℳ ℳ\mathcal{M}caligraphic_M in equation 8. Given the HR image 𝐈 H⁢R subscript 𝐈 𝐻 𝑅\mathbf{I}{HR}bold_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and the super-resolved image 𝐈 S⁢R subscript 𝐈 𝑆 𝑅\mathbf{I}{SR}bold_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT, the VGG Loss is computed by mean squared error between the extracted features of the HR and SR images from a pre-trained deep neural network. This guides the model to capture nuanced aspects of images, including textures, edges, and shapes, which are crucial for preserving visual fidelity. To enhance the semantic fidelity of the reconstructed images, we regulate the diffusion process using the Segment Anything Model (SAM)[19] and propose the Segmentation Loss ℒ seg subscript ℒ seg\mathcal{L}{\text{seg}}caligraphic_L start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT. Given the HR image 𝐈 H⁢R subscript 𝐈 𝐻 𝑅\mathbf{I}{HR}bold_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and the super-resolved image 𝐈 S⁢R subscript 𝐈 𝑆 𝑅\mathbf{I}{SR}bold_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT, we use a locked SAM to segment the masks 𝐒 H⁢R subscript 𝐒 𝐻 𝑅\mathbf{S}{HR}bold_S start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and 𝐒 S⁢R subscript 𝐒 𝑆 𝑅\mathbf{S}{SR}bold_S start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT for 𝐈 H⁢R subscript 𝐈 𝐻 𝑅\mathbf{I}{HR}bold_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and 𝐈 S⁢R subscript 𝐈 𝑆 𝑅\mathbf{I}{SR}bold_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT, respectively. The Segmentation Loss ℒ seg subscript ℒ seg\mathcal{L}{\text{seg}}caligraphic_L start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT is then computed by mean squared error between 𝐒 H⁢R subscript 𝐒 𝐻 𝑅\mathbf{S}{HR}bold_S start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT and 𝐒 S⁢R subscript 𝐒 𝑆 𝑅\mathbf{S}_{SR}bold_S start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT, providing effective high-level supervision for the reconstructed images.

The Perceptual Loss ℒ perceptual subscript ℒ perceptual\mathcal{L}_{\text{perceptual}}caligraphic_L start_POSTSUBSCRIPT perceptual end_POSTSUBSCRIPT is computed by integrating the VGG-based and segmentation-based losses, as:

ℒ perceptual=‖ϕ l⁢(𝐈 H⁢R)−ϕ l⁢(𝐈 S⁢R)‖2 2⏞ℒ VGG+‖𝐒 H⁢R−𝐒 S⁢R‖2 2⏞ℒ seg,subscript ℒ perceptual superscript⏞superscript subscript norm subscript italic-ϕ 𝑙 subscript 𝐈 𝐻 𝑅 subscript italic-ϕ 𝑙 subscript 𝐈 𝑆 𝑅 2 2 subscript ℒ VGG superscript⏞superscript subscript norm subscript 𝐒 𝐻 𝑅 subscript 𝐒 𝑆 𝑅 2 2 subscript ℒ seg\mathcal{L}{\text{perceptual}}=\overbrace{\left|\phi{l}(\mathbf{I}{HR})-% \phi{l}(\mathbf{I}{SR})\right|{2}^{2}}^{\mathcal{L}{\text{VGG}}}+% \overbrace{\left|\mathbf{S}{HR}-\mathbf{S}{SR}\right|{2}^{2}}^{\mathcal{L% }_{\text{seg}}},caligraphic_L start_POSTSUBSCRIPT perceptual end_POSTSUBSCRIPT = over⏞ start_ARG ∥ italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT ) - italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_I start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT VGG end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + over⏞ start_ARG ∥ bold_S start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT - bold_S start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT seg end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,(12)

where ϕ l⁢(⋅)subscript italic-ϕ 𝑙⋅\phi_{l}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ⋅ ) represents the feature map extracted from the l 𝑙 l italic_l-th layer of a pre-trained deep neural network (in our experiment, VGG-16).

The incorporation of visual and perceptual guidance refines each iteration of the diffusion, facilitating a more optimized denoising procedure. This not only improves visual fidelity but also enhances perceptual performance.

Figure 3: Visual comparison of infrared image super-resolution with SOTA methods on M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD datasets.

5 Experiments

5.1 Experimental Settings

Dataset and evaluation metrics. To ensure the fairness of the experiment, we used the same training[25] and test sets[25, 51, 40], as CoRPLE[22]. We use the infrared image dataset M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD[25] to train the model and evaluate its performance using three datasets: M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD[25], RoadScene[51], and TNO[40]. We adopt five metrics to evaluate the performance of our model quantitatively: CLIPIQA[43], MUSIQ[17], PSNR, LPIPS[56], and SSIM[47]. Among them, CLIPIQA and MUSIQ are no-reference metrics. CLIPIQA leverages the CLIP model[32] to assess image quality, while MUSIQ uses a multi-scale feature extraction approach for quality evaluation. We mainly rely on CLIPIQA and MUSIQ as evaluation metrics to compare the performance of different methods.

Implementation Details. Our network was trained on a GeForce RTX 4090 GPU. Our backbone model and specific experimental parameter settings largely follow ResShift[52]. Notably, ResShift uses the residual between high-resolution (HR) and low-resolution images (LR) as the noise for the diffusion model, meaning that we can effectively apply gradient guidance on the residual between HR and LR images. During training, our approach differs from ResShift in that we initially perform 200K iterations on a new training set to enable the model to develop basic infrared image super-resolution capabilities. Subsequently, we incorporate conditional (visual and perceptual) guidance into the model and conduct an additional 50K training iterations to achieve improved results.

5.2 Experiments on Infrared SR

We perform a comprehensive comparison of our approach with eleven SOTA methods, including ESRGAN[44], RealSR-JPEG[15], BSRGAN[54], SwinIR[23], RealESRGAN[45], HAT[5], DAT[6], ResShift[52], CoPRLE[22], Bi-DiffSR[7] and SinSR[46]. Table 1, 2 presents our quantitative comparison results compared with the above methods and Figure 3 presents our qualitative results.

Quantitative Comparison. Table 1 presents a quantitative comparison of CLIPIQA and MUSIQ on the M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD dataset against various methods. CLIPIQA inherits the powerful representation capabilities of CLIP, demonstrating stable and robust performance in evaluating the perceptual quality of natural images. Our method outperforms other methods on both metrics across all three test sets, indicating that our approach better aligns with the human perceptual system. Additionally, our method achieves superior performance on MUSIQ compared to all other methods, demonstrating that it can also achieve excellent results in multi-scale image quality assessment.

It is worth noting that we also compared our method against HR images on no-reference metrics. Our method significantly outperforms HR images in no-reference visual quality metrics, demonstrating an enhancement over the HR images. However, this improvement introduces a challenge: in comparison to traditional methods on reference-based metrics such as PSNR, LPIPS, and SSIM, our approach shows less advantage, as our results differ significantly from the HR images. Nevertheless, our method still leads diffusion-based methods on reference-based metrics, as shown in Table 2. This demonstrates that our approach leverages the powerful generative capabilities of diffusion to produce high-quality images while also preserving essential detail features from the HR images under both visual and perceptual guidance.

Qualitative Results. The qualitative results shown in Figure 3 emphasize the superiority of our method in visual performance compared to other approaches. Additional examples can be found in the supplementary materials. We selected one image from each of the three datasets, Set5, Set15, and Set20, for qualitative analysis to ensure comprehensive evaluation. Our method achieves more natural details in portraits, avoiding color discrepancies and better matching the contours of the true image. For vehicle details, our method accurately reproduces the grille at the front of the vehicle in the true image, whereas other methods tend to blur these details. This demonstrates that our method also has distinct advantages in qualitative results.

5.3 Ablation Study.

Experiments on the effectiveness of guidance. We conducted ablation experiments to evaluate the effectiveness of visual and perceptual guidance on the infrared super-resolution task, as shown in Table 3. We assessed the super-resolution results under four conditions: without guidance, with only visual guidance, with only perceptual guidance, and with both visual and perceptual guidance. The results show that the infrared image super-resolution performance is best when both guidance are applied.

Experiments on the guidance combinations. We conducted ablation experiments on different guidance combinations of various methods, as shown in Table 4. In the perceptual-based setup, which involves using a perceptual loss gradient for guidance, we performed three sets of experiments: (1) without the visual loss, (2) directly adding the visual loss ∑ℒ ℒ\sum{\mathcal{L}}∑ caligraphic_L, and (3) incorporating the gradient of the loss ∇ℒ∇ℒ\nabla{\mathcal{L}}∇ caligraphic_L into the noise. The experimental results demonstrate that incorporating the gradient of the loss into the noise yields the best performance. We also conducted experiments in a visual-based setup, the results under the visual-based setup also follow this trend.

Visual Perceptual PSNR CLIP-IQA mAP mIoU --33.466 0.5102 31.2 40.9 ✓-34.528 0.5365 31.7 41.3 -✓33.923 0.5230 32.8 42.2 ✓✓34.575 0.5379 33.1 42.4

Table 3: Ablation study on the effectiveness of multiple guidance.

Figure 4: Detection performance comparison of infrared image super-resolution with SOTA methods on M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD datasets.

Figure 5: Segmentation performance comparison of infrared image super-resolution with SOTA methods on FMB datasets.

Figure 6: Quantitative comparison of detection and segmentation results with SOTA methods.

5.4 Experiments on Infrared Object Detection

Setup. We employ YOLOv5-s for infrared image object detection, fine-tuning it specifically on the M 3⁢FD superscript M 3 FD\text{M}^{3}\text{FD}M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT FD dataset. The primary evaluation metric is the mean Average Precision (mAP) at varying IoU thresholds (mAP@.5:.95). The model is fine-tuned with a batch size of 16, using the SGD optimizer with learning rate of 0.01.

Quantitative Comparison. The left section of Figure 6 presents a quantitative comparison of detection results across SOTA methods. In the top-right quadrant of the plot, the overall mAP of each model is displayed, while the other three quadrants represent the performance across individual categories. Our model consistently outperforms all other models in each detection category, demonstrating its superior ability in object detection tasks. Notably, in the truck detection category, our model achieves a 5.6% improvement over the best-competing method, underscoring its robustness in identifying challenging classes.

Guide PSNR CLIP-IQA mAP mIoU Perceptual Base-33.923 0.5230 32.8 42.2

Visual∑ℒ ℒ\sum\mathcal{L}∑ caligraphic_L 34.061 0.5342 32.5 42.0
Visual∇ℒ∇ℒ\nabla{\mathcal{L}}∇ caligraphic_L 34.575 0.5379 33.1 42.4 Visual Base-34.528 0.5365 31.7 41.3
Task∑ℒ ℒ\sum\mathcal{L}∑ caligraphic_L 34.561 0.5371 32.8 41.9
Task∇ℒ∇ℒ\nabla{\mathcal{L}}∇ caligraphic_L 34.575 0.5379 33.1 42.4

Table 4: Ablation study for different guidance combinations.

Qualitative Comparison. The qualitative results in Figure 4 demonstrate the superiority of our method in object detection. Other methods frequently miss at least one label or make errors. For example, in the first row, some methods fail to detect the person on the right side of the image, with none capable of detecting both signs above simultaneously. In the second row, certain methods miss the people farthest away, and others are unable to recognize the partially obstructed car. Only our method consistently achieves the best detection prediction results.

5.5 Experiments on Infrared Image Segmentation

Setup. We perform semantic segmentation on the FMB dataset[26]. The SegFormer-b1 model[49] is used as the backbone, with intersection-over-union (IoU) as the primary evaluation metric. Supervised by cross-entropy loss, the model is trained using the AdamW optimizer, with a learning rate of 6e-05 and a weight decay of 0.01. Training spans 25,000 iterations with a batch size of 8.

Quantitative Comparison. The right section of Figure 6 presents a quantitative comparison of semantic segmentation results. The top-right quadrant of the circle represents the comparison of mIoU, while the remaining three quadrants depict the performance of other models across the three primary segmentation classes. Overall, our model achieves the best results in each category. Notably, it achieves the highest improvement in truck, with an improvement of 7.4%. It also improves by 5.4% and 3.0% in car and human, respectively.

Qualitative Comparison. The figure 5 presents a qualitative comparison of segmentation results from various SOTA methods. These results reveal that other methods often fail to segment complete objects, or they struggle with segmenting all relevant elements. For example, in other models, only part of the sign occurs, leaving parts of it undetected. While RealESRGAN shows some improvement, it still falls short of our method. Similarly, in the right image, other models fail to recognize the farthest poles and cannot fully capture the shapes of the people.

6 Conclusion

In this paper, we propose a task-oriented infrared image super-resolution diffusion model, namely DifIISR. Specifically, we introduce infrared thermal spectral distribution modulation as visual guidance to ensure consistency with high-resolution images by matching frequency components. In addition, we incorporate foundational vision models to provide perception guidance, which enhances detection and segmentation performance. With the above guidance, our method further optimizes each iteration of the standard diffusion process, refining the model at each denoising step and achieving superior visual and perceptual performance.

References

Austin et al. [2021] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. In Proceedings of the Advances in Neural Information Processing Systems, pages 17981–17993, 2021.
Cao et al. [2023] Bing Cao, Yiming Sun, Pengfei Zhu, and Qinghua Hu. Multi-modal gated mixture of local-to-global experts for dynamic image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 23555–23564, 2023.
Cao et al. [2021] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv preprint arXiv:2106.06847, 2021.
Chen et al. [2021] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yiping Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, and Wen Gao. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12299–12310, 2021.
Chen et al. [2023a] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, and Chao Dong. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22367–22377, 2023a.
Chen et al. [2023b] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, and Fisher Yu. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12312–12321, 2023b.
Chen et al. [2024] Zheng Chen, Haotong Qin, Yong Guo, Xiongfei Su, Xin Yuan, Linghe Kong, and Yulun Zhang. Binarized diffusion model for image super-resolution. arXiv preprint arXiv:2406.05723, 2024.
Choi et al. [2021] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, and Sungroh Yoon. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of IEEE/CVF International Conference on Computer Vision, pages 14347–14356, 2021.
Chung et al. [2022] Hyungjin Chung, Byeongsu Sim, and Jong Chul Ye. Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12413–12422, 2022.
Chung et al. [2023] Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In Proceedings of the International Conference on Learning Representations, 2023.
Cui and Harada [2024] Ziteng Cui and Tatsuya Harada. Raw-adapter: Adapting pre-trained visual model to camera raw images. In European Conference on Computer Vision, pages 37–56. Springer, 2024.
Cui et al. [2022] Ziteng Cui, Kunchang Li, Lin Gu, Shenghan Su, Peng Gao, Zhengkai Jiang, Yu Qiao, and Tatsuya Harada. You only need 90k parameters to adapt light: a light weight transformer for image enhancement and exposure correction. arXiv preprint arXiv:2205.14871, 2022.
Dong et al. [2014] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pages 184–199, 2014.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, pages 6840–6851, 2020.
Ji et al. [2020] Xiaozhong Ji, Yun Cao, Ying Tai, Chengjie Wang, Jilin Li, and Feiyue Huang. Real-world super-resolution via kernel estimation and noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 466–467, 2020.
Kawar et al. [2022] Bahjat Kawar, Michael Elad, Stefano Ermon, and Jiaming Song. Denoising diffusion restoration models. In Proceedings of the Advances in Neural Information Processing Systems, pages 23593–23606, 2022.
Ke et al. [2021] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5148–5157, 2021.
Kim et al. [2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
Li et al. [2018] Juncheng Li, Faming Fang, Kangfu Mei, and Guixu Zhang. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pages 517–532, 2018.
Li et al. [2023] Xingyuan Li, Yang Zou, Jinyuan Liu, Zhiying Jiang, Long Ma, Xin Fan, and Risheng Liu. From text to pixels: a context-aware semantic synergy solution for infrared and visible image fusion. arXiv preprint arXiv:2401.00421, 2023.
Li et al. [2024] Xingyuan Li, Jinyuan Liu, Zhixin Chen, Yang Zou, Long Ma, Xin Fan, and Risheng Liu. Contourlet residual for prompt learning enhanced infrared image super-resolution. In Proceedings of the European Conference on Computer Vision, pages 270–288, 2024.
Liang et al. [2021] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833–1844, 2021.
Liang et al. [2022] Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5657–5666, 2022.
Liu et al. [2022] Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5811, 2022.
Liu et al. [2023] Jinyuan Liu, Zhu Liu, Guanyao Wu, Long Ma, Risheng Liu, Wei Zhong, Zhongxuan Luo, and Xin Fan. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the International Conference on Computer Vision, 2023.
Liu et al. [2024] Jinyuan Liu, Xingyuan Li, Zirui Wang, Zhiying Jiang, Wei Zhong, Wei Fan, and Bin Xu. Promptfusion: Harmonized semantic prompt learning for infrared and visible image fusion. IEEE/CAA Journal of Automatica Sinica, 2024.
Liu et al. [2020] Risheng Liu, Jinyuan Liu, Zhiying Jiang, Xin Fan, and Zhongxuan Luo. A bilevel integrated model with data-driven layer ensemble for multi-modality image fusion. IEEE transactions on image processing, 30:1261–1274, 2020.
Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. In Proceedings of the Advances in Neural Information Processing Systems, pages 5775–5787, 2022.
Luo et al. [2020] Xiaotong Luo, Yuan Xie, Yulun Zhang, Yanyun Qu, Cuihua Li, and Yun Fu. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the European Conference on Computer Vision, pages 272–289, 2020.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, pages 8162–8171, 2021.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763, 2021.
Saharia et al. [2022] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022.
Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, 2015.
Song et al. [2021a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proceedings of International Conference on Learning Representations, 2021a.
Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, 2021b.
Sun et al. [2022a] Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Detfusion: A detection-driven infrared and visible image fusion network. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4003–4011, 2022a.
Sun et al. [2022b] Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Transactions on Circuits and Systems for Video Technology, 32(10):6700–6713, 2022b.
Sun et al. [2024] Yiming Sun, Bing Cao, Pengfei Zhu, and Qinghua Hu. Dynamic brightness adaptation for robust multi-modal image fusion. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 1317–1325, 2024.
Toet [2017] Alexander Toet. The tno multiband image data collection. Data in brief, 15:249–251, 2017.
Wang et al. [2022] Di Wang, Jinyuan Liu, Xin Fan, and Risheng Liu. Unsupervised misaligned infrared and visible image fusion via cross-modality image generation and registration. arXiv preprint arXiv:2205.11876, 2022.
Wang et al. [2023a] Di Wang, Jinyuan Liu, Risheng Liu, and Xin Fan. An interactively reinforced paradigm for joint infrared-visible image fusion and saliency object detection. Information Fusion, 98:101828, 2023a.
Wang et al. [2023b] Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2555–2563, 2023b.
Wang et al. [2018] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision Workshops, pages 0–0, 2018.
Wang et al. [2021] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1905–1914, 2021.
Wang et al. [2024] Yufei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C Kot, and Bihan Wen. Sinsr: diffusion-based image super-resolution in a single step. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25796–25805, 2024.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
Wang et al. [2020] Zhihao Wang, Jian Chen, and Steven CH Hoi. Deep learning for image super-resolution: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3365–3387, 2020.
Xie et al. [2021] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the Advances in Neural Information Processing Systems, 2021.
Xie et al. [2023] Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, and Chao Dong. Desra: detect and delete the artifacts of gan-based real-world super-resolution models. In Proceedings of the International Conference on Machine Learning, pages 38204–38226, 2023.
Xu et al. [2020] Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling. U2fusion: A unified unsupervised image fusion network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):502–518, 2020.
Yue et al. [2024] Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super-resolution by residual shifting. In Proceedings of the Advances in Neural Information Processing Systems, 2024.
Zamir et al. [2022] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5728–5739, 2022.
Zhang et al. [2021] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
Zhang et al. [2018a] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 586–595, 2018a.
Zhang et al. [2022] Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, pages 649–667, 2022.
Zhang et al. [2015] Yongbing Zhang, Yulun Zhang, Jian Zhang, and Qionghai Dai. Ccr: Clustering and collaborative representation for fast single image super-resolution. IEEE Transactions on Multimedia, 18(3):405–417, 2015.
Zhang et al. [2018b] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision, pages 286–301, 2018b.
Zhang et al. [2018c] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018c.
Zou et al. [2024] Yang Zou, Zhixin Chen, Zhipeng Zhang, Xingyuan Li, Long Ma, Jinyuan Liu, Peng Wang, and Yanning Zhang. Contourlet refinement gate framework for thermal spectrum distribution regularized infrared image super-resolution. arXiv preprint arXiv:2411.12530, 2024.

Xet Storage Details

Size:: 81 kB
Xet hash:: d86ad69eac77d05575843b451c224de96cd74779270ed8475fb96e7f8c1ff204

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.