45.4 kB

Title: A Vision Transformer Approach for Efficient Near-Field Irregular SAR Super-Resolution

URL Source: https://arxiv.org/html/2305.02074

Markdown Content: [ Josiah W. Smith](https://orcid.org/0000-0002-3388-4805)

Department of Electrical and Computer Engineering

The University of Texas at Dallas

Richardson, TX 75080

josiah.smith@utdallas.edu

&[ Yusef Alimam](https://orcid.org/0000-0001-7229-1765)

Department of Electrical and Computer Engineering

The University of Texas at Dallas

Richardson, TX 75080

&[ Geetika Vedula](https://orcid.org/0000-0001-7229-1765)

Department of Electrical and Computer Engineering

The University of Texas at Dallas

Richardson, TX 75080

&[ Murat Torlak](https://orcid.org/0000-0001-7229-1765)

Department of Electrical and Computer Engineering

The University of Texas at Dallas

Richardson, TX 75080

torlak@utdallas.edu

Abstract

In this paper, we develop a novel super-resolution algorithm for near-field synthetic-aperture radar (SAR) under irregular scanning geometries. As fifth-generation (5G) millimeter-wave (mmWave) devices are becoming increasingly affordable and available, high-resolution SAR imaging is feasible for end-user applications and non-laboratory environments. Emerging applications such freehand imaging, wherein a handheld radar is scanned throughout space by a user, unmanned aerial vehicle (UAV) imaging, and automotive SAR face several unique challenges for high-resolution imaging. First, recovering a SAR image requires knowledge of the array positions throughout the scan. While recent work has introduced camera-based positioning systems capable of adequately estimating the position, recovering the algorithm efficiently is a requirement to enable edge and Internet of Things (IoT) technologies. Efficient algorithms for non-cooperative near-field SAR sampling have been explored in recent work, but suffer image defocusing under position estimation error and can only produce medium-fidelity images. In this paper, we introduce a mobile-friend vision transformer (ViT) architecture to address position estimation error and perform SAR image super-resolution (SR) under irregular sampling geometries. The proposed algorithm, Mobile-SRViT, is the first to employ a ViT approach for SAR image enhancement and is validated in simulation and via empirical studies.

{tikzpicture} [remember picture,overlay] \node[anchor=south,yshift=20pt] at (current page.south) ©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.doi:10.1109/WMCS55582.2022.9866326;

††This work was supported in part by Texas Instruments through the Foundational Technology Research Centre and the Texas Analog Center of Excellence. Keywords 5G ⋅⋅\cdot⋅ drone mmWave imaging ⋅⋅\cdot⋅ freehand imaging ⋅⋅\cdot⋅ irregular sampling ⋅⋅\cdot⋅ mmWave imaging ⋅⋅\cdot⋅ synthetic aperture radar (SAR)

1 Introduction

Millimeter-wave (mmWave) imaging systems have garnered significant attention in recent years as ultrawideband (UWB) devices are becoming increasingly affordable. Radar devices operating in the mmWave spectrum have been applied in applications such as concealed weapon detection [1], non-destructive testing [2], automotive imaging [3], and hand gesture recognition and tracking [4, 5, 6]. Synthetic aperture radar (SAR) imaging is of particular interest and involves scanning a radar across space to create a synthetic aperture much larger than the radar itself [7]. Traditional SAR imaging requires high-precision laboratory equipment for exact positioning of the antennas throughout the scan. Efficient SAR imaging algorithms have been explored extensively in the literature [1, 8, 9] leveraging the fast Fourier transform (FFT) to recover the image from the radar data. These efficient algorithms strictly require specific synthetic aperture geometries, e.g., planar [1, 8], cylindrical [9], etc., to achieve high-resolution imaging. However, with the emergence of fifth-generation (5G) and Internet of Things (IoT) technologies, near-field SAR sensing at the edge has already received attention at both the system and algorithm levels [10, 11]. One application of interest is known as freehand imaging and involves using a smartphone or handheld radar device to perform the SAR scan. Although these applications operate in similar frequencies to traditional laboratory SAR [12], they suffer from two primary constraints: 1) the resulting SAR array generally does not conform to the traditional geometries required by existing algorithms [1, 8, 9] and 2) since the image computation typically takes place on a low-power or mobile device, the computational load must be reduced compared to conventional imaging. As a result, recovering high-fidelity images under such conditions remains an open challenge.

Previous work on efficient near-field SAR imaging proposes a decomposition of the irregular multiple-input multiple-output SAR (MIMO-SAR) array into a multi-planar imaging scenario wherein the multistatic samples are taken across a volume in space and then projected onto a reference plane for efficient image recovery, referred to as the efficient multi-planar multistatic (EMPM) algorithm [11]. This algorithm achieves similar imaging quality to the gold-standard backprojection algorithm (BPA) at a fraction of the computational load. However, prior analyses [11, 13] do not take into account errors in the position estimation present in practical implementations [10]. Such position errors cause defocusing and distortion to SAR images as the algorithm improperly computes the matched filter weights based on the noisy position estimates. Without knowledge of the exact positions, removing distortion present in images recovered from either the EMPM or BPA remains an open challenge. For a practical system, the EMPM is capable of efficiently reconstructing only a medium-fidelity image.

Separately, deep learning approaches for optical image super-resolution have been extended into the radar domain for SAR image super-resolution [6, 14, 15, 16]. Using convolutional neural network (CNN) architectures, previous efforts have seen success in improving SAR resolution [14, 15] and removing multistatic artifacts [16]. However, these techniques operate on SAR images collected using traditional techniques in laboratory environments and are not suitable for the irregular sampling geometry explored in this paper. Nevertheless, deep learning has seen tremendous success in both the optical domain, for image restoration [17] and super-resolution [18], and radar domain [14, 15, 16]. Hence, deep learning may be a suitable solution for near-field irregular SAR artifact mitigation and super-resolution.

Separately, recent advances in computer vision have seen a shift from CNN-based architectures towards the attention mechanism [19] using Vision Transformer (ViT) techniques [20] to achieve performance gains with smaller model sizes [21, 22, 23]. In [23], the MobileViT architecture is presented leveraging a transformer architecture for image classification. Later, the transformer architecture was employed for optical image super-resolution and artifact mitigation in [17]. Transformer techniques have appeared in recent work on radar image classification [24] and gesture recognition [5]; however, transformers have yet to be employed for SAR image super-resolution. In this paper, we introduce a novel transformer-based architecture for SAR image super-resolution under irregular sampling geometries called Mobile-SRViT. The proposed algorithm operates on images recovered by the EMPM algorithm [11] and produces high-fidelity images of intricate targets. We validate our mobile-friendly algorithm using simulation and empirical data from a near-field SAR scenario with irregular scanning geometry.

The remainder of this paper is formatted as follows. Section 2 overviews the requisite signal model for near-field irregularly sampled SAR. In Section 3, we detail our proposed algorithm. Experimental results are included in Section 4 followed finally by conclusions.

2 Signal Model

In this section, the signal model for non-cooperative SAR in the near-field is briefly introduced. Considering a multi-planar multistatic array with transmitter (Tx) and receiver (Rx) antennas located at (x T,y T,z ℓ)subscript 𝑥 𝑇 subscript 𝑦 𝑇 subscript 𝑧 ℓ(x_{T},y_{T},z_{\ell})( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) and (x R,y R,z ℓ)subscript 𝑥 𝑅 subscript 𝑦 𝑅 subscript 𝑧 ℓ(x_{R},y_{R},z_{\ell})( italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ), respectively, the received signal at the ℓ ℓ\ell roman_ℓ-th Tx/Rx pair can be modeled as

s⁢(x T,x R,y T,y R,z ℓ,t)=∬o⁢(x,y)R ℓ T⁢R ℓ R⁢p⁢(t−R ℓ T c−R ℓ R c)⁢𝑑 x⁢𝑑 y,𝑠 subscript 𝑥 𝑇 subscript 𝑥 𝑅 subscript 𝑦 𝑇 subscript 𝑦 𝑅 subscript 𝑧 ℓ 𝑡 double-integral 𝑜 𝑥 𝑦 superscript subscript 𝑅 ℓ 𝑇 superscript subscript 𝑅 ℓ 𝑅 𝑝 𝑡 superscript subscript 𝑅 ℓ 𝑇 𝑐 superscript subscript 𝑅 ℓ 𝑅 𝑐 differential-d 𝑥 differential-d 𝑦 s(x_{T},x_{R},y_{T},y_{R},z_{\ell},t)=\iint\frac{o(x,y)}{R_{\ell}^{T}R_{\ell}^% {R}}p\left(t-\frac{R_{\ell}^{T}}{c}-\frac{R_{\ell}^{R}}{c}\right)dxdy,italic_s ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_t ) = ∬ divide start_ARG italic_o ( italic_x , italic_y ) end_ARG start_ARG italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG italic_p ( italic_t - divide start_ARG italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_c end_ARG - divide start_ARG italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG start_ARG italic_c end_ARG ) italic_d italic_x italic_d italic_y ,(1)

assuming the Born approximation and an isotropic antenna, where o⁢(x,y)𝑜 𝑥 𝑦 o(x,y)italic_o ( italic_x , italic_y ) is known as the target reflectivity at the plane z=z 0 𝑧 subscript 𝑧 0 z=z_{0}italic_z = italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, p⁢(t)𝑝 𝑡 p(t)italic_p ( italic_t ) is the signal at the transmitter, t 𝑡 t italic_t is the fast-time variable, c 𝑐 c italic_c is the speed of light, and R ℓ T superscript subscript 𝑅 ℓ 𝑇 R_{\ell}^{T}italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, R ℓ R superscript subscript 𝑅 ℓ 𝑅 R_{\ell}^{R}italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT are given by

R ℓ T=[(x T−x)2+(y T−y)2+(z ℓ−z 0)2]1 2,R ℓ R=[(x R−x)2+(y R−y)2+(z ℓ−z 0)2]1 2.formulae-sequence superscript subscript 𝑅 ℓ 𝑇 superscript delimited-[]superscript subscript 𝑥 𝑇 𝑥 2 superscript subscript 𝑦 𝑇 𝑦 2 superscript subscript 𝑧 ℓ subscript 𝑧 0 2 1 2 superscript subscript 𝑅 ℓ 𝑅 superscript delimited-[]superscript subscript 𝑥 𝑅 𝑥 2 superscript subscript 𝑦 𝑅 𝑦 2 superscript subscript 𝑧 ℓ subscript 𝑧 0 2 1 2\displaystyle\begin{split}R_{\ell}^{T}&=\left[(x_{T}-x)^{2}+(y_{T}-y)^{2}+(z_{% \ell}-z_{0})^{2}\right]^{\frac{1}{2}},\ R_{\ell}^{R}&=\left[(x_{R}-x)^{2}+(y_{R}-y)^{2}+(z_{\ell}-z_{0})^{2}\right]^{% \frac{1}{2}}.\end{split}start_ROW start_CELL italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL = [ ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_CELL start_CELL = [ ( italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT - italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . end_CELL end_ROW(2)

Taking the Fourier transform of (1) with respect to time yields the frequency spectrum, which can be expressed as

s⁢(x T,x R,y T,y R,z ℓ,f)=P⁢(f)⁢∬o⁢(x,y)R ℓ T⁢R ℓ R⁢e−j⁢k⁢(R ℓ T+R ℓ R)⁢𝑑 x⁢𝑑 y,𝑠 subscript 𝑥 𝑇 subscript 𝑥 𝑅 subscript 𝑦 𝑇 subscript 𝑦 𝑅 subscript 𝑧 ℓ 𝑓 𝑃 𝑓 double-integral 𝑜 𝑥 𝑦 superscript subscript 𝑅 ℓ 𝑇 superscript subscript 𝑅 ℓ 𝑅 superscript 𝑒 𝑗 𝑘 superscript subscript 𝑅 ℓ 𝑇 superscript subscript 𝑅 ℓ 𝑅 differential-d 𝑥 differential-d 𝑦 s(x_{T},x_{R},y_{T},y_{R},z_{\ell},f)=P(f)\iint\frac{o(x,y)}{R_{\ell}^{T}R_{% \ell}^{R}}e^{-jk(R_{\ell}^{T}+R_{\ell}^{R})}dxdy,italic_s ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_f ) = italic_P ( italic_f ) ∬ divide start_ARG italic_o ( italic_x , italic_y ) end_ARG start_ARG italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_j italic_k ( italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + italic_R start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT italic_d italic_x italic_d italic_y ,(3)

where P⁢(f)𝑃 𝑓 P(f)italic_P ( italic_f ) is the spectrum of p⁢(t)𝑝 𝑡 p(t)italic_p ( italic_t ) and the instantaneous wavenumber is given by k=2⁢π⁢f/c 𝑘 2 𝜋 𝑓 𝑐 k=2\pi f/c italic_k = 2 italic_π italic_f / italic_c.

By the analysis in [11], the multi-planar multistatic data can be projected to an equivalent virtual planar sampled array by

s^⁢(x′,y′,f)≈s⁢(x T,x R,y T,y R,z ℓ,f)⁢e j⁢k⁢β ℓ,^𝑠 superscript 𝑥′superscript 𝑦′𝑓 𝑠 subscript 𝑥 𝑇 subscript 𝑥 𝑅 subscript 𝑦 𝑇 subscript 𝑦 𝑅 subscript 𝑧 ℓ 𝑓 superscript 𝑒 𝑗 𝑘 subscript 𝛽 ℓ\hat{s}(x^{\prime},y^{\prime},f)\approx s(x_{T},x_{R},y_{T},y_{R},z_{\ell},f)e% ^{jk\beta_{\ell}},over^ start_ARG italic_s end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f ) ≈ italic_s ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_f ) italic_e start_POSTSUPERSCRIPT italic_j italic_k italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,(4)

where

β ℓ=2⁢d ℓ z+(d ℓ x)2+(d ℓ y)2 4⁢Z 0,subscript 𝛽 ℓ 2 superscript subscript 𝑑 ℓ 𝑧 superscript superscript subscript 𝑑 ℓ 𝑥 2 superscript superscript subscript 𝑑 ℓ 𝑦 2 4 subscript 𝑍 0\beta_{\ell}=2d_{\ell}^{z}+\frac{(d_{\ell}^{x})^{2}+(d_{\ell}^{y})^{2}}{4Z_{0}},italic_β start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = 2 italic_d start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + divide start_ARG ( italic_d start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_d start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ,(5)

and s^⁢(x′,y′,f)^𝑠 superscript 𝑥′superscript 𝑦′𝑓\hat{s}(x^{\prime},y^{\prime},f)over^ start_ARG italic_s end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f ) is the virtual planar monostatic array with virtual coplanar antenna positions on the Z 0 subscript 𝑍 0 Z_{0}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT plane located at the midpoint of each Tx/Rx pair. After this compensation step, the image can be recovered using the efficient Fourier-based range migration algorithm (RMA) yielding the 2-D reflectivity image o^⁢(x,y)^𝑜 𝑥 𝑦\hat{o}(x,y)over^ start_ARG italic_o end_ARG ( italic_x , italic_y )[7, 11]. Besides the EMPM, whose complexity is O⁢(N 2⁢log⁡(N))𝑂 superscript 𝑁 2 𝑁 O(N^{2}\log(N))italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_N ) ), the BPA also does not require specific sampling geometry. However, the complexity of the BPA is O⁢(N 4)𝑂 superscript 𝑁 4 O(N^{4})italic_O ( italic_N start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), which is generally prohibitive for end-user applications [11, 8]. Furthermore, the recovered image from either algorithm is degraded by distortion and defocusing due to errors in the position estimates. In attempt to address these artifacts and improve SAR imaging resolution, we propose a novel Mobile-SRViT architecture.

Figure 1: Operation of the Mobile-SRViT: the low resolution image produced by the EMPM algorithm is restored by the Mobile-SRViT algorithm. The ground truth image is shown for reference.

Figure 2: Mobile-SRViT architecture.

3 Irregular SAR Super-Resolution using Vision Transformers

In this section, we detail the proposed transformer-based approach for near-field irregular SAR super-resolution and artifact mitigation. An overview of the Mobile-SRViT algorithm is given in Fig. 1. The raw radar data are processed by the EMPM algorithm to produce a low-resolution image with distortion, blur, and defocusing caused by imaging non-idealities. The proposed Mobile-SRViT operates on this image to produce a super-resolution image, restoring image quality while preserving key high-frequency details of the targets.

The architecture of the Mobile-SRViT is based on the MobileViT network employed for image classification in [23]. Since our network is designed for image-to-image processing, the convolution layers are modified to adhere to a fully convolutional neural network (FCNN) framework, similar to that of [6]. Fig. 2 details the implementation of the Mobile-SRViT algorithm, where “MV2” refers to the MobileNetV2 block proposed in [22]. Our algorithm adopts the approach of [23] such that the image is first processed by several MobileNetV2 convolution blocks before alternating between MobileViT and MobileNetV2 operations. The MobileViT block is intended to model the global and local information of the input data with fewer parameters than the traditional ViT [20]. Given an input tensor 𝐗∈ℝ H×W×C 𝐗 superscript ℝ 𝐻 𝑊 𝐶\mathbf{X}\in\mathbb{R}^{H\times W\times C}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where H 𝐻 H italic_H is the height, W 𝑊 W italic_W is the width, and C 𝐶 C italic_C is the number of channels, the MobileViT block first applies a 3×3 3 3 3\times 3 3 × 3 convolution layer followed by a 1×1 1 1 1\times 1 1 × 1, or pointwise, convolution layer to produce a tensor 𝐗 L∈ℝ H×W×d subscript 𝐗 𝐿 superscript ℝ 𝐻 𝑊 𝑑\mathbf{X}{L}\in\mathbb{R}^{H\times W\times d}bold_X start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_d end_POSTSUPERSCRIPT. The last 1×1 1 1 1\times 1 1 × 1 convolution layer reduces the channels to match the input image. In order to learn global representations, 𝐗 L subscript 𝐗 𝐿\mathbf{X}{L}bold_X start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is unfolded into N 𝑁 N italic_N non-overlapping patches 𝐗 U∈ℝ P×N×d subscript 𝐗 𝑈 superscript ℝ 𝑃 𝑁 𝑑\mathbf{X}_{U}\in\mathbb{R}^{P\times N\times d}bold_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_N × italic_d end_POSTSUPERSCRIPT, where P=4 𝑃 4 P=4 italic_P = 4 and N=H⁢W/4 𝑁 𝐻 𝑊 4 N=HW/4 italic_N = italic_H italic_W / 4. Each of the P 𝑃 P italic_P patches are processed using a transformer architecture to encode the inter-patch relationships yielding

𝐗 G⁢(p)=Transformer⁢(𝐗 U⁢(p)),p∈[1,…,P].formulae-sequence subscript 𝐗 𝐺 𝑝 Transformer subscript 𝐗 𝑈 𝑝 𝑝 1…𝑃\mathbf{X}{G}(p)=\text{Transformer}(\mathbf{X}{U}(p)),\quad p\in[1,\dots,P].bold_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_p ) = Transformer ( bold_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_p ) ) , italic_p ∈ [ 1 , … , italic_P ] .(6)

Whereas most ViT implementations lose the positional location of each patch [21, 20], the MobileViT retains the patch order and the pixel order within each patch. As a result, 𝐗 G∈ℝ P×N×d subscript 𝐗 𝐺 superscript ℝ 𝑃 𝑁 𝑑\mathbf{X}{G}\in\mathbb{R}^{P\times N\times d}bold_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_N × italic_d end_POSTSUPERSCRIPT can be directly folded to obtain 𝐗 F∈ℝ H×W×d subscript 𝐗 𝐹 superscript ℝ 𝐻 𝑊 𝑑\mathbf{X}{F}\in\mathbb{R}^{H\times W\times d}bold_X start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_d end_POSTSUPERSCRIPT. The resulting tensor 𝐗 F subscript 𝐗 𝐹\mathbf{X}{F}bold_X start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is projected to a low C 𝐶 C italic_C-dimensional space via a 1×1 1 1 1\times 1 1 × 1 convolution layer before being concatenated with 𝐗 𝐗\mathbf{X}bold_X yielding 𝐗 O∈ℝ H×W×2⁢C subscript 𝐗 𝑂 superscript ℝ 𝐻 𝑊 2 𝐶\mathbf{X}{O}\in\mathbb{R}^{H\times W\times 2C}bold_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 italic_C end_POSTSUPERSCRIPT. Finally, using a 3×3 3 3 3\times 3 3 × 3 convolution, 𝐗 O subscript 𝐗 𝑂\mathbf{X}{O}bold_X start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT is fused to form the output tensor 𝐘 𝐘\mathbf{Y}bold_Y of identical size to 𝐗 𝐗\mathbf{X}bold_X. Interestingly, the receptive field of the MobileViT block is H×W 𝐻 𝑊 H\times W italic_H × italic_W because 𝐗 U⁢(p)subscript 𝐗 𝑈 𝑝\mathbf{X}{U}(p)bold_X start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ( italic_p ) encodes local information from a 3×3 3 3 3\times 3 3 × 3 region via convolutions and each pixel in 𝐗 G⁢(p)subscript 𝐗 𝐺 𝑝\mathbf{X}_{G}(p)bold_X start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_p ) encodes global information over P 𝑃 P italic_P patches [23]. Our implementation maintains C=16 𝐶 16 C=16 italic_C = 16 until the last MobileViT block where C=32 𝐶 32 C=32 italic_C = 32, d=2⁢C 𝑑 2 𝐶 d=2C italic_d = 2 italic_C for each MobileViT block, and L={2,4,3}𝐿 2 4 3 L={2,4,3}italic_L = { 2 , 4 , 3 }. The images are of size 256×256 256 256 256\times 256 256 × 256 and the patch size employed is 16×16 16 16 16\times 16 16 × 16. With this architecture, the proposed Mobile SR-ViT has 69,122 parameters. Loss is computed using the pixel-to-pixel L1 metric as

L p⁢2⁢p=‖𝐗 S⁢R−𝐗 H⁢R‖1.subscript 𝐿 𝑝 2 𝑝 subscript norm subscript 𝐗 𝑆 𝑅 subscript 𝐗 𝐻 𝑅 1 L_{p2p}=||\mathbf{X}{SR}-\mathbf{X}{HR}||_{1}.italic_L start_POSTSUBSCRIPT italic_p 2 italic_p end_POSTSUBSCRIPT = | | bold_X start_POSTSUBSCRIPT italic_S italic_R end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT italic_H italic_R end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT .(7)

Prior attempts at near-field SAR super-resolution have been purely CNN-based [6, 14, 15, 16], but the Mobile-SRViT detailed in this paper is the first to leverage a transformer architecture for SAR imaging.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3: Imaging results using the Mobile-SRViT on synthetic data. The images in the first column, (a) and (d), are produced by the EMPM algorithm and input to the Mobile-SRViT. The images in the second column, (b) and (e), are the super-resolution images output from the Mobile-SRViT. The images in the third column, (c) and (f), are the ground truth images.

3.1 Training the Mobile-SRViT

Our efficient algorithm is trained for 50 epochs using an ADAM optimizer on a single RTX3090 GPU with 24 GB of memory using 4096 samples for the training process and 1024 samples for evaluation. The training data is generated using the procedure detailed in [6, 11, 14]. Whereas prior methods [6, 14, 15, 16] train on SAR images from exclusively point scatterers, we introduce more sophisticated targets consisting of solid and hollow objects in addition to randomly placed point scatterers, such as the example images in Fig. 1. By including more complex targets in the training dataset, our algorithm is able to generalize for solid and hollow targets. Each sample is generated with additive white Gaussian noise (AWGN) with a signal-to-noise ratio in the range [−10,50]10 50[-10,50][ - 10 , 50 ] dB and includes AWGN positioning errors with a standard deviation of 1 mm along the x 𝑥 x italic_x, y 𝑦 y italic_y, and z 𝑧 z italic_z-directions to emulate a practical scenario. The training process was approximately 6 hours with an inference time during validation of 10 10 10 10 ms per sample.

4 Experimental Results

In this section, we conduct simulation and empirical experiments to verify the proposed algorithm. To evaluate the performance of the SRViT algorithm, we first compare the numerical performance of the Mobile-SRViT to the EMPM, BPA, and RMA. The RMA requires a planar sampling and is unable to recover an image from an irregularly sampled array, as discussed in [11]. The gold-standard BPA has no requirements for SAR array geometry and is well-suited for irregular scanning geometries. However, it is computationally prohibitive, particularly for mobile applications, as it computes the pixel-wise matched filter for every sampling location and frequency. The EMPM, on the other hand, is an efficient RMA-based algorithm but assumes the samples are taken across a small volume relative to a reference plane Z 0 subscript 𝑍 0 Z_{0}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT[11]. The proposed Mobile-SRViT algorithm attempts to compensate for the distortion present in the EMPM images due to these assumptions in addition to imaging nonidealities.

Table 1: Quantitative performance of the Mobile-SRViT compared to the BPA, EMPM, and RMA.

Using a test dataset consisting of 1024 samples similar to those in the training dataset but never seen by the network, we apply the Mobile-SRViT, BPA, EMPM, and RMA to the samples and measure the peak signal-to-noise ratio (PSNR), root-mean-square error (RMSE), and computation time per sample. Results are shown in Table 1 with the best evaluations marked in bold-face. All experiments are conducted on a desktop PC equipped with a 12-core AMD Ryzen 9 3900X running at 4.6 GHz with 64 GB of memory. As expected, the RMA is unable to achieve image quality, in terms of PSNR and RMSE, comparable to the BPA or EMPM but is highly efficient. The EMPM achieves identical computation time to the RMA with much higher PSNR and lower RMSE, approaching the BPA. On the other hand, the BPA boasts the highest PSNR and lowest RMSE of the classical algorithms but requires a significantly large amount of computation time. The Mobile-SRViT is superior to the other algorithms, even outperforming the BPA in PSNR and RMSE, with a total computation time of 1.113 seconds required to compute the EMPM and pass the image through the network. This qualitative analysis demonstrates the superiority of the proposed method in comparison to previous techniques in terms of both computational efficiency and image quality.

We further validate the performance of the proposed algorithm via visual inspection of both simulated and empirical data. Two samples from the testing dataset are compared in Fig. 3. For each sample, the proposed SAR super-resolution network is able to recover the solid object in addition to the point scatterers and mitigate distortion caused by position estimation errors and system limitations. The super-resolution images, Figs. 2(b) and 2(e), are quite similar to the ideal images, Figs. 2(c) and 2(f), showing an improvement over the medium-fidelity images recovered by the EMPM.

To evaluate the performance of our proposed algorithm on empirical data, we first perform a SAR scan with irregular scanning geometry, as shown in Fig. 3(a). After reconstructing the image with the EMPM, shown in Fig. 3(b), the Mobile-SRViT is applied to achieve the super-resolution image shown in 3(c). The proposed algorithm not only recovers a better-resolved image but also mitigates multistatic artifacts visible in the EMPM image [11, 7].

(a)

(b)

(c)

Figure 4: Imaging results using the Mobile-SRViT on empirical data from a near-field SAR scenario with irregular sampling geometry and positioning error: (a) SAR sampling geometry, (b) the image reconstructed by the EMPM and (c) the super-resolution image produced by the Mobile-SRViT.

5 Conclusion

In this paper, we introduced a novel algorithm for near-field SAR super-resolution under irregular sampling geometries using a vision transformer architecture. Compared to previous methods for SAR super-resolution, our improved technique addresses the more challenging problem of image enhancement under non-ideal sampling conditions. The proposed algorithm enables numerous applications such as freehand smartphone imaging, UAV SAR, and automotive imaging. Using the efficient medium-fidelity EMPM algorithm developed in [11], we train a novel image-to-image network using state-of-the-art CNN [22] and ViT [23] techniques suitable for mobile applications with low latency and a small model size. The robust algorithm is verified in simulation and empirical studies to outperform the state-of-the-art techniques in terms of both image quality and computational complexity.

References

[1] D.M. Sheen, D.L. McMakin, and T.E. Hall, “Three-dimensional millimeter-wave imaging for concealed weapon detection,” IEEE Trans. Microw. Theory Techn., vol.49, no.9, pp. 1581–1592, Sep. 2001.
[2] M.T. Ghasr, S.Kharkovsky, R.Bohnert, B.Hirst, and R.Zoughi, “30 GHz linear high-resolution and rapid millimeter wave imaging system for NDE,” IEEE Trans. Antennas Propag., vol.61, no.9, pp. 4733–4740, Jun. 2013.
[3] M.Garcia-Fernandez, Y.Alvarez-Lopez, and F.L. Heras, “3D-SAR processing of UAV-mounted GPR measurements: Dealing with non-uniform sampling,” in Proc. 14th Eur. Conf. Antennas Propag. (EuCAP), Copenhagen, Denmark, Aug. 2020, pp. 1–5.
[4] J.W. Smith, S.Thiagarajan, R.Willis, Y.Makris, and M.Torlak, “Improved static hand gesture classification on deep convolutional neural networks using novel sterile training technique,” IEEE Access, vol.9, pp. 10 893–10 902, Jan. 2021.
[5] L.Zheng, J.Bai, X.Zhu, L.Huang, C.Shan, Q.Wu, and L.Zhang, “Dynamic hand gesture recognition in in-vehicle environment based on FMCW radar and transformer,” Sensors, vol.21, no.19, p. 6368, Sep. 2021.
[6] J.W. Smith, O.Furxhi, and M.Torlak, “An FCNN-based super-resolution mmWave radar framework for contactless musical instrument interface,” IEEE Trans. Multimedia, pp. 1–1, May 2021.
[7] M.E. Yanik and M.Torlak, “Near-field MIMO-SAR millimeter-wave imaging with sparsely sampled aperture data,” IEEE Access, vol.7, pp. 31 801–31 819, Mar. 2019.
[8] M.E. Yanik, D.Wang, and M.Torlak, “Development and demonstration of MIMO-SAR mmWave imaging testbeds,” IEEE Access, vol.8, pp. 126 019–126 038, Jul. 2020.
[9] J.W. Smith, M.E. Yanik, and M.Torlak, “Near-field MIMO-ISAR millimeter-wave imaging,” in Proc. IEEE Radar Conf. (RadarConf), Florance, Italy, Sep. 2020, pp. 1–6.
[10] G.Álvarez Narciandi, J.Laviada, and F.Las-Heras, “Towards turning smartphones into mmWave scanners,” IEEE Access, vol.9, pp. 45 147–45 154, Mar. 2021.
[11] J.W. Smith and M.Torlak, “Efficient 3-D near-field MIMO-SAR imaging for irregular scanning geometries,” IEEE Access, vol.10, pp. 10 283–10 294, Jan. 2022.
[12] M.E. Yanik, D.Wang, and M.Torlak, “3-D MIMO-SAR imaging using multi-chip cascaded millimeter-wave sensors,” in Proc. IEEE Global Conf. Signal Inf. Process. (GlobalSIP), Ottawa, ON, Canada, Nov. 2019, pp. 1–5.
[13] G.Álvarez Narciandi, J.Laviada, Y.Álvarez López, G.Ducournau, C.Luxey, C.Belem-Goncalves, F.Gianesello, N.Nachabe, C.D. Rio, and F.Las-Heras, “Freehand system for antenna diagnosis based on amplitude-only data,” IEEE Trans. Antennas Propag., vol.69, no.8, pp. 4988–4998, Feb. 2021.
[14] J.Gao, B.Deng, Y.Qin, H.Wang, and X.Li, “Enhanced radar imaging using a complex-valued convolutional neural network,” IEEE Geosci. Remote Sens. Lett., vol.16, no.1, pp. 35–39, Sep. 2018.
[15] H.Jing, S.Li, K.Miao, S.Wang, X.Cui, G.Zhao, and H.Sun, “Enhanced millimeter-wave 3-D imaging via complex-valued fully convolutional neural network,” Electronics, vol.11, no.1, p. 147, 2022.
[16] Y.Dai, T.Jin, H.Li, Y.Song, and J.Hu, “Imaging enhancement via CNN in MIMO virtual array-based radar,” IEEE Trans. Geosci. Remote Sens., vol.59, no.9, pp. 7449–7458, Nov. 2021.
[17] J.Liang, J.Cao, G.Sun, K.Zhang, L.Van Gool, and R.Timofte, “SwinIR: Image restoration using swin transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshop (ICCVW), Montreal, Canada, Oct. 2021, pp. 1833–1844.
[18] B.Lim, S.Son, H.Kim, S.Nah, and K.Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 136–144.
[19] A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N. Gomez, Ł.Kaiser, and I.Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Long Beach, CA, USA, Dec. 2017, pp. 5998–6008.
[20] A.Dosovitskiy, L.Beyer, A.Kolesnikov, D.Weissenborn, X.Zhai, T.Unterthiner, M.Dehghani, M.Minderer, G.Heigold, S.Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[21] Z.Liu, Y.Lin, Y.Cao, H.Hu, Y.Wei, Z.Zhang, S.Lin, and B.Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Montreal, Canada, Oct. 2021, pp. 10 012–10 022.
[22] M.Sandler, A.Howard, M.Zhu, A.Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, Utah, USA, Jun. 2018, pp. 4510–4520.
[23] S.Mehta and M.Rastegari, “MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer,” arXiv preprint arXiv:2110.02178, Oct. 2021.
[24] H.Dong, L.Zhang, and B.Zou, “Exploring vision transformers for polarimetric SAR image classification,” IEEE Trans. Geosci. Remote Sens., Nov. 2021.

Xet Storage Details

Size:: 45.4 kB
Xet hash:: d46048928c7d3eb27b233abd908603de8ed85496361a89b693f3153ee271eb13

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.