38.1 kB

Title: Bi-Grid Reconstruction for Image Anomaly Detection

URL Source: https://arxiv.org/html/2504.00609

Markdown Content: Huichuan Huang, Zhiqing Zhong, Guangyu Wei, Yonghao Wan, Wenlong Sun, Aimin Feng**Corresponding author: Aimin Feng (amfeng@nuaa.edu.cn) Nanjing University of Aeronautics and Astronautics, China

MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, China

{hchuang, zhiqing, weiguangyu, wangyonghao, wenlong.sun, amfeng}@nuaa.edu.cn

Abstract

In image anomaly detection, significant advancements have been made using un- and self-supervised methods with datasets containing only normal samples. However, these approaches often struggle with fine-grained anomalies. This paper introduces GRAD: Bi-G rid R econstruction for Image A nomaly D etection, which employs two continuous grids to enhance anomaly detection from both normal and abnormal perspectives. In this work: 1) Grids as feature repositories that improve generalization and mitigate the Identical Shortcut (IS) issue; 2) An abnormal feature grid that refines normal feature boundaries, boosting detection of fine-grained defects; 3) The Feature Block Paste (FBP) module, which synthesizes various anomalies at the feature level for quick abnormal grid deployment. GRAD’s robust representation capabilities also allow it to handle multiple classes with a single model. Evaluations on datasets like MVTecAD, VisA, and GoodsAD show significant performance improvements in fine-grained anomaly detection. GRAD excels in overall accuracy and in discerning subtle differences, demonstrating its superiority over existing methods.

Index Terms:

Image Anomaly Detection, Self-supervised Method, Reconstruction Method, Grid Sampling

I Introduction

Image anomaly detection and localization aim to identify and precisely segment abnormal regions in images, with applications spanning industrial inspection, medical imaging, and video surveillance. However, this task faces challenges due to the scarcity of abnormal samples and the diversity of anomaly patterns, ranging from minor scratches to significant structural damage in industrial production. Under these challenges, there is an increasing interest in developing unsupervised and self-supervised methods.

In anomaly detection, notable unsupervised methods include PaDiM [1], SPADE [2], and PatchCore [3], which utilize an external vector database to store features extracted from normal samples. During inference, anomalies are detected by calculating the Euclidean distance between the test sample and its nearest neighbor in the database. While effective, these methods face limitations due to their discrete feature storage, which hampers generalization and requires the retention of a large number of diverse normal features. This results in high spatial complexity and resource-intensive search operations.

Figure 1: In the comparison of complex products and fine-grained anomalies, our model shows significant advantages over other models.

Figure 2: Overall framework of our GRAD. The input samples are first processed by a pre-trained feature extractor to obtain initial features (the subsequent FBP module is only activated during the training of the anomaly grid). These features are then mapped to 2D coordinates through the coordinate mapping module. Based on these coordinates, sampling is performed from the normal and anomaly grid. The sampling results are fused and refined through the feature refinement module to produce the final reconstructed features. The comparison between these reconstructed features and the initial features yields the final anomaly detection results. (PS: The abnormal grid and normal grid have their top-left corner markers offset from each other, indicating that they also alternate during training.

To address the shortcomings of insufficient generalization in the aforementioned methods, approaches such as MemAE [4] and DAAD [5] have been developed. These methods incorporate discrete repositories into the reconstruction task to generate generalized normal features and detect anomalies by comparing samples before and after reconstruction. By leveraging attention mechanisms, these models gather diverse normal features from the repository, resulting in stronger robustness to test data, and thus enhancing generalization performance. However, controlling the training of generative models remains a challenge. Overgeneralization can lead to the Identical Shortcut (IS) issue, where the input sample is mapped too closely to the reconstructed sample, as highlighted in UniAD [6].

To balance generalization, CRAD [7] proposes using a continuous grid instead of discrete feature storage in the reconstruction task. Grid sampling improves generalization by using interpolation techniques, and compared to methods that rely on storing numerous features in memory, this approach reduces the risk of generating entirely new features (i.e., unseen anomalies in our context), thereby helping to avoid the IS issue.

Although the aforementioned unsupervised methods have shown good performance, the boundaries of normal data they define often lack sufficient accuracy due to the absence of real anomaly data. This is particularly problematic when dealing with fine-grained defects, where over- or under-detection frequently occurs. To address the above issues, we propose GRAD, which introduces an anomaly grid that stores abnormal features in addition to the normal grid that stores normal features. This complements the knowledge learned from accessible synthetic anomalies, refining the boundaries of normal features, thereby enhancing the model’s performance in detecting fine-grained anomalies in complex products. As shown in Figure 1, our model demonstrates significant improvements over previous methods in handling more complex products and fine-grained anomalies. Given that training models like DFMGAN [8] and AnomalyDiffusion [9] to synthesize realistic anomalies requires substantial computational resources, we also designed a Feature Block Pasting (FBP) module. This module synthesizes diverse anomalies at the feature level with controllable shape, size, intensity, and position to facilitate the rapid training of usable anomaly grid.

Our comprehensive analysis confirms GRAD as an effective AD solution, addressing limitations of existing methods and contributing to the integration of synthetic anomalies with unsupervised approaches. The main contributions of this paper are summarized as follows:

•We propose a novel anomaly classification and localization method called GRAD. This method introduces an abnormal grid that incorporates knowledge from synthetic anomalies to refine the boundaries of normal features, significantly enhancing the detection performance for fine-grained anomalies.
•We design a lightweight method for anomaly synthesis at the feature level, called FBP, which allows flexible control over the location, size, intensity, and shape of synthetic anomalies.
•We tested GRAD on three image anomaly detection datasets: MVTec-AD[10], VisA[11], and GoodsAD[12]. The results show that GRAD achieves top-tier anomaly detection performance under a unified setting, overcoming the limitations of existing methods in detecting fine-grained anomalies.

II Related Work

This section reviews various unsupervised anomaly detection methods, including reconstruction-based, embedding-based, and synthesis-based approaches. Additionally, this section also emphasizes the effectiveness of grid feature sampling for reconstruction-based anomaly detection.

II-A Unsupervised Anomaly Detection

Regarding the various unsupervised anomaly detection methods that have been proposed, they can be broadly categorized into three types:

a) Reconstruction-based methods: These methods assume models trained on normal samples reconstruct normal areas well but struggle with anomalies. Early efforts used various generative models like AE [13, 14, 15, 16, 17], GAN [18, 19, 20], Transformer [21, 22], and Diffusion Model [23] to learn the normal data distribution, attempt to replicate input data, and detect anomalies through reconstructing errors.

b) Embedding-based methods: These methods extract and store normal image representations from pre-trained networks, identifying anomalies via feature comparison. SPADE uses a multi-resolution semantic pyramid, PaDiM models the normal class with multivariate Gaussian distributions, and PatchCore employs greedy coreset subsampling for a memory-efficient approach. Anomalies are detected through feature cataloging and comparison.

c) Synthesis-based methods: These methods create anomalies on normal images, turning anomaly detection into supervised learning. CutPaste [24] cuts and pastes patches randomly. DRÆM [25] synthesizes pseudo anomalies using Perlin noise combined with out of distribution samples. NSA [26] merges scaled patches with Poisson image editing. SimpleNet [27] adds Gaussian noise in the feature space.

II-B Grid Feature Representation

In the evolution of neural fields or neural representations, grid-based representations of signals parameterized by coordinate functions have proven effective across a range of applications, such as image and video processing [28], 3D reconstruction [29], and novel view synthesis [30]. These grid structures efficiently capture high-frequency details without spectral bias and facilitate effective feature generalization through continuous feature spaces.

Considering the benefits of grid representation, CRAD uses continuous grid for anomaly detection, replacing discrete feature memory banks to improve generalization and address the Identical Shortcut problem. Furthermore, It combines global and local perspectives to capture structural features and detect anomalies across multiple classes, making it ideal for unified anomaly detection. In terms of computational complexity, the O(1) time complexity of grid calculations also surpasses the O(n) time complexity associated with discrete methods.

III Method

GRAD mainly consists of a Feature Extractor, a Bi-Grid Reconstruction module, and an FBP module. In this section, we will explain these three components in sequence and provide details on our training and inference processes at the end.

Figure 3: (a) Multiple anomaly patterns yield similar feature maps from the pretrained extractor. (b) Our FBP module can transform normal images into abnormal ones in the feature space.

Figure 4: Qualitative results of GRAD on three datasets. Each row of the figure represents anomaly images, corresponding ground truths, results from different methods. Notably, even for extremely subtle anomalies in categories such as Macaroni2, Drink_Bottle, and Food_Bottle, our model has provided precise localization results.

TABLE I: Image- and Pixel- level AUROC↑ / AUPR↑ on GoodsAD dataset, the * in the upper right corner of SimpleNet indicates that it is trained under the separated setting.

III-A Feature Extractor

We define the feature extraction process as a preliminary step for subsequent work. Training and test sets are 𝒳 T⁢r⁢a⁢i⁢n subscript 𝒳 𝑇 𝑟 𝑎 𝑖 𝑛\mathcal{X}{Train}caligraphic_X start_POSTSUBSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and 𝒳 T⁢e⁢s⁢t subscript 𝒳 𝑇 𝑒 𝑠 𝑡\mathcal{X}{Test}caligraphic_X start_POSTSUBSCRIPT italic_T italic_e italic_s italic_t end_POSTSUBSCRIPT, respectively, with 𝒳 T⁢r⁢a⁢i⁢n subscript 𝒳 𝑇 𝑟 𝑎 𝑖 𝑛\mathcal{X}{Train}caligraphic_X start_POSTSUBSCRIPT italic_T italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT containing only normal samples and 𝒳 T⁢e⁢s⁢t subscript 𝒳 𝑇 𝑒 𝑠 𝑡\mathcal{X}{Test}caligraphic_X start_POSTSUBSCRIPT italic_T italic_e italic_s italic_t end_POSTSUBSCRIPT including both normal and abnormal samples. For a sample x i∈ℝ 3×H×W subscript 𝑥 𝑖 superscript ℝ 3 𝐻 𝑊 x_{i}\in\mathbb{R}^{3\times H\times W}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT, we use a pre-trained EfficientNetb6 [31] on ImageNet to extract features Φ i∼Φ⁢(x i)similar-to superscript Φ 𝑖 Φ subscript 𝑥 𝑖\Phi^{i}\sim\Phi(x_{i})roman_Φ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ roman_Φ ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Due to data bias in the pre-trained network [3], we adapt it by selecting intermediate layers. For example, we select layers 3 and 4 from EfficientNetb6 layers 1 to 5, denoting them as ϕ l,i superscript italic-ϕ 𝑙 𝑖\phi^{l,i}italic_ϕ start_POSTSUPERSCRIPT italic_l , italic_i end_POSTSUPERSCRIPT, where l∈L={3,4}𝑙 𝐿 3 4 l\in L={3,4}italic_l ∈ italic_L = { 3 , 4 } represents the selected layers.

Next, we align the feature maps from different levels to the same size, and finally concatenate them along the channel dimension to obtain the aligned features for this stage:

ϕ a⁢l⁢i⁢g⁢n⁢e⁢d⁢(x i)=f c⁢a⁢t⁢({f r⁢e⁢s⁢i⁢z⁢e⁢(ϕ l,i,(H m⁢a⁢x,W m⁢a⁢x))|l∈L})subscript italic-ϕ 𝑎 𝑙 𝑖 𝑔 𝑛 𝑒 𝑑 subscript 𝑥 𝑖 subscript 𝑓 𝑐 𝑎 𝑡 conditional-set subscript 𝑓 𝑟 𝑒 𝑠 𝑖 𝑧 𝑒 superscript italic-ϕ 𝑙 𝑖 subscript 𝐻 𝑚 𝑎 𝑥 subscript 𝑊 𝑚 𝑎 𝑥 𝑙 𝐿\phi_{aligned}(x_{i})=f_{cat}({f_{resize}(\phi^{l,i},(H_{max},W_{max}))|l\in L})italic_ϕ start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n italic_e italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_c italic_a italic_t end_POSTSUBSCRIPT ( { italic_f start_POSTSUBSCRIPT italic_r italic_e italic_s italic_i italic_z italic_e end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUPERSCRIPT italic_l , italic_i end_POSTSUPERSCRIPT , ( italic_H start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) ) | italic_l ∈ italic_L } )(1)

where H m⁢a⁢x subscript 𝐻 𝑚 𝑎 𝑥 H_{max}italic_H start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and W m⁢a⁢x subscript 𝑊 𝑚 𝑎 𝑥 W_{max}italic_W start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT are the maximum height and width for all feature maps

III-B Bi-Grid Reconstruction

The Bi-Grid Reconstruction includes a normal grid and an abnormal grid:

Normal Grid: Trained solely on normal samples, this grid reconstructs input features into normal features via grid sampling. It addresses the Identical Shortcut (IS) problem by interpolating normal features for abnormal patches.
Abnormal Grid: Trained with artificially synthesized anomalies or external anomaly samples, this grid helps refine normal feature boundaries during anomaly detection.

The normal grid captures both local and global features, while the abnormal grid categorizes features into normal and anomaly classes using masks and contrastive learning.We utilize the grid sampling method to obtain the following two features from the normal and abnormal grids, respectively: x^i,n subscript^𝑥 𝑖 𝑛\hat{x}{i,n}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT and x^i,a subscript^𝑥 𝑖 𝑎\hat{x}{i,a}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_a end_POSTSUBSCRIPT. Due to space limitations, please refer to [7] for specific details.

The features from two grids are fused through element-wise addition to produce preliminary reconstructed features:

x^i r⁢e⁢c=λ⁢x^i,n⊕(1−λ)⁢x^i,a subscript superscript^𝑥 𝑟 𝑒 𝑐 𝑖 direct-sum 𝜆 subscript^𝑥 𝑖 𝑛 1 𝜆 subscript^𝑥 𝑖 𝑎\hat{x}^{rec}{i}=\lambda\hat{x}{i,n}\oplus(1-\lambda)\hat{x}_{i,a}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT ⊕ ( 1 - italic_λ ) over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i , italic_a end_POSTSUBSCRIPT(2)

Here are some details regarding the training and inference of GRAD:

Training: During the training phase, the normal grid of GRAD learns normal patterns through a reconstruction task using the Mean Squared Error (MSE) loss as the objective function:

ℒ r⁢e⁢c=1 C⁢H⁢W||ϕ a⁢l⁢i⁢g⁢n⁢e⁢d(x i)−x^i r⁢e⁢c)||2\mathcal{L}{rec}=\frac{1}{CHW}||\phi{aligned}(x_{i})-\hat{x}^{rec}{i})||{2}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_C italic_H italic_W end_ARG | | italic_ϕ start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n italic_e italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(3)

where ϕ a⁢l⁢i⁢g⁢n⁢e⁢d⁢(x)subscript italic-ϕ 𝑎 𝑙 𝑖 𝑔 𝑛 𝑒 𝑑 𝑥\phi_{aligned}(x)italic_ϕ start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n italic_e italic_d end_POSTSUBSCRIPT ( italic_x ) is the aligned feature of the input and x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG is the feature reconstructed by grid.

For the anomaly grid, we employ a contrastive learning idea to train it to increase the distance between the normal and anomaly features stored. We utilize the following truncated L1 loss:

ℒ c⁢o⁢n=∑D+m⁢a⁢x⁢(0,t⁢h−d+)L⁢e⁢n⁢(D+)+∑D−m⁢a⁢x⁢(0,−t⁢h+d−)L⁢e⁢n⁢(D−)subscript ℒ 𝑐 𝑜 𝑛 subscript superscript 𝐷 𝑚 𝑎 𝑥 0 𝑡 ℎ superscript 𝑑 𝐿 𝑒 𝑛 superscript 𝐷 subscript superscript 𝐷 𝑚 𝑎 𝑥 0 𝑡 ℎ superscript 𝑑 𝐿 𝑒 𝑛 superscript 𝐷\mathcal{L}{con}=\sum{D^{+}}\frac{max(0,th-d^{+})}{Len(D^{+})}+\sum_{D^{-}}% \frac{max(0,-th+d^{-})}{Len(D^{-})}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_m italic_a italic_x ( 0 , italic_t italic_h - italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L italic_e italic_n ( italic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) end_ARG + ∑ start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_m italic_a italic_x ( 0 , - italic_t italic_h + italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_L italic_e italic_n ( italic_D start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) end_ARG(4)

where, t⁢h 𝑡 ℎ th italic_t italic_h is manually set to create a buffer zone around the separation boundary, with t⁢h 𝑡 ℎ th italic_t italic_h set to 0.5 in our experiments; D+superscript 𝐷 D^{+}italic_D start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and D−superscript 𝐷 D^{-}italic_D start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is a set of positive and negative sample pair constructed via masks from FBP module or Ground_Truth, where d+superscript 𝑑 d^{+}italic_d start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT denotes the similarity of positive pairs and d−superscript 𝑑 d^{-}italic_d start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT denotes that of negative pairs. The training of GRAD is conducted in two stages: initially, the anomaly grid is trained, followed by freezing the anomaly grid parameters and training the normal grid.

Inference: The fused features from both grids are refined using a similarity-based feature refinement module to enhance detection confidence. The final anomaly score is obtained by comparing the reconstructed features with the original aligned features:

p⁢r⁢e⁢d=‖ϕ a⁢l⁢i⁢g⁢n⁢e⁢d⁢(x i)−x^i r⁢e⁢c‖2 𝑝 𝑟 𝑒 𝑑 subscript norm subscript italic-ϕ 𝑎 𝑙 𝑖 𝑔 𝑛 𝑒 𝑑 subscript 𝑥 𝑖 subscript superscript^𝑥 𝑟 𝑒 𝑐 𝑖 2 pred=||\phi_{aligned}(x_{i})-\hat{x}^{rec}{i}||{2}italic_p italic_r italic_e italic_d = | | italic_ϕ start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n italic_e italic_d end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_r italic_e italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(5)

III-C Feature Block Paste

The FBP module is designed to facilitate the rapid training of an anomaly grid that is ready for deployment. Compared to the method of adding Gaussian noise used by SimpleNet[27], the anomalies synthesized using FBP are more diverse, resulting in a trained anomaly grid that achieves superior performance. We describe the FBP module as follows:

ϕ p⁢s⁢e⁢⁢a⁢n⁢o,m⁢a⁢s⁢k=F⁢B⁢P⁢(ϕ n⁢o⁢r,M,B,I,P)subscript italic-ϕ 𝑝 𝑠 𝑒 _ 𝑎 𝑛 𝑜 𝑚 𝑎 𝑠 𝑘 𝐹 𝐵 𝑃 subscript italic-ϕ 𝑛 𝑜 𝑟 𝑀 𝐵 𝐼 𝑃\phi{pse_ano},mask=FBP(\phi_{nor},M,B,I,P)italic_ϕ start_POSTSUBSCRIPT italic_p italic_s italic_e _ italic_a italic_n italic_o end_POSTSUBSCRIPT , italic_m italic_a italic_s italic_k = italic_F italic_B italic_P ( italic_ϕ start_POSTSUBSCRIPT italic_n italic_o italic_r end_POSTSUBSCRIPT , italic_M , italic_B , italic_I , italic_P )(6)

where ϕ p⁢s⁢e⁢⁢a⁢n⁢o subscript italic-ϕ 𝑝 𝑠 𝑒 _ 𝑎 𝑛 𝑜\phi{pse_ano}italic_ϕ start_POSTSUBSCRIPT italic_p italic_s italic_e _ italic_a italic_n italic_o end_POSTSUBSCRIPT and m⁢a⁢s⁢k 𝑚 𝑎 𝑠 𝑘 mask italic_m italic_a italic_s italic_k represent the generated pseudo-anomalies and their corresponding annotation information, respectively. It takes five parameters: the feature map ϕ n⁢o⁢r subscript italic-ϕ 𝑛 𝑜 𝑟\phi_{nor}italic_ϕ start_POSTSUBSCRIPT italic_n italic_o italic_r end_POSTSUBSCRIPT obtained from a normal image through the pretrained backbone, and the parameters M, B, I, P which control the shape, size, intensity, and position of the generated anomalies, respectively.

Specifically, the FBP module operates by defining the block size B 𝐵 B italic_B, block intensity I 𝐼 I italic_I, and block center coordinates P=(x c,y c)𝑃 subscript 𝑥 𝑐 subscript 𝑦 𝑐 P=(x_{c},y_{c})italic_P = ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). We generate the initialization mask M 𝑀 M italic_M as M=zeros⁢(2⁢B+1,2⁢B+1)𝑀 zeros 2 𝐵 1 2 𝐵 1 M=\text{zeros}(2B+1,2B+1)italic_M = zeros ( 2 italic_B + 1 , 2 italic_B + 1 ), a (2⁢B+1)×(2⁢B+1)2 𝐵 1 2 𝐵 1(2B+1)\times(2B+1)( 2 italic_B + 1 ) × ( 2 italic_B + 1 ) matrix initialized to zero. A random walk mask is created by selecting the initial position (x 0,y 0)=(B,B)subscript 𝑥 0 subscript 𝑦 0 𝐵 𝐵(x_{0},y_{0})=(B,B)( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ( italic_B , italic_B ) and randomly choosing the number of steps N 𝑁 N italic_N from [B,2⁢B]𝐵 2 𝐵[B,2B][ italic_B , 2 italic_B ]. The random walk updates the position as (x k+1,y k+1)=(x k+Δ⁢x,y k+Δ⁢y)subscript 𝑥 𝑘 1 subscript 𝑦 𝑘 1 subscript 𝑥 𝑘 Δ 𝑥 subscript 𝑦 𝑘 Δ 𝑦(x_{k+1},y_{k+1})=(x_{k}+\Delta x,y_{k}+\Delta y)( italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) = ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_x , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + roman_Δ italic_y ) with Δ⁢x,Δ⁢y∈{−1,0,1}Δ 𝑥 Δ 𝑦 1 0 1\Delta x,\Delta y\in{-1,0,1}roman_Δ italic_x , roman_Δ italic_y ∈ { - 1 , 0 , 1 }, marking the corresponding position in M 𝑀 M italic_M as 1. Finally, we initialize the block paste tensor P 𝑃 P italic_P of size 1×1×H×W 1 1 𝐻 𝑊 1\times 1\times H\times W 1 × 1 × italic_H × italic_W (initialized to zero), paste the block with intensity I 𝐼 I italic_I at the marked positions in M 𝑀 M italic_M, apply Gaussian blur to obtain P blurred subscript 𝑃 blurred P_{\text{blurred}}italic_P start_POSTSUBSCRIPT blurred end_POSTSUBSCRIPT, and paste P blurred subscript 𝑃 blurred P_{\text{blurred}}italic_P start_POSTSUBSCRIPT blurred end_POSTSUBSCRIPT onto the feature map.

IV Experiments

IV-A Experiments Setup

The methods used in our experiments follow a unified setup, where only one model is trained for all categories in the dataset, rather than a one-model-per-category approach, with the exception of SimpleNet.

Datasets. We assessed GRAD on three datasets: MVTec-AD, VisA, and GoodsAD. MVTec AD is a benchmark for industrial anomaly detection, VisA offers detailed pixel-level annotations for real-world scenarios, and GoodsAD focuses on anomalies in retail products, expanding the scope of anomaly detection to retail automation. Our experiments on these datasets evaluate methods’ performance and adaptability across various contexts.

Methods. We assembled a benchmark of advanced unsupervised anomaly detection methods, spanning reconstruction-based, synthesizing-based, and embedding-based categories. The methods evaluated include PaDiM, RIAD [32], DFR [33], UniAD, PatchCore, SimpleNet, and CRAD.

Metrics. Adhering to standard conventions, we employ both the Area Under the Receiver Operating Characteristics (AU-ROC/AUC) and the Area Under Precision-Recall (AUPR/AP) as metrics for assessing the performance of our models.

Figure 5: Comparing inference speed (FPS), I-AUROC, and memory occupancy on GoodsAD showcases the comprehensive performance of our model.

IV-B Comparison with Other Methods

As shown in Table I, our model excels in both image-level and pixel-level evaluations on the GoodsAD dataset, achieving the highest mean AUROC 79.4% (+3.4%) and AUPR 81.4% (+2.8%) at the image level, and leading AUROC 96.8% (+1.1%) and AUPR 36.4% (+3.0%) at the pixel level. Additionally, the visualization results in Figure 4 demonstrate that our model significantly outperforms existing methods in complex scenarios like GoodsAD and in detecting fine-grained defects. For more metrics and visualization results on MVTec-AD and VisA, please refer to the Appendix.

Additionally, we conducted a comprehensive comparison of GRAD and other methods in terms of memory usage and detection speed, as shown in Figure 5. Our model’s memory usage is only 4.8 GB, significantly lower than PaDiM’s 70.4 GB and SimpleNet’s 10.69 GB. In terms of speed, our model achieves approximately 75 FPS, which is considered to be of moderate level. This indicates that our model not only outperforms other methods in detection accuracy but also maintains competitive spatio-temporal efficiency.

IV-C Ablation Study

Normal-Grid and Abnormal-Grid. We conducted an ablation study to assess the contribution of Normal-Grid (N-Grid) and Abnormal-Grid (A-Grid) to model performance. The results, summarized in Table II, indicate that the inclusion of both N-Grid and A-Grid yields the highest accuracy and recall on both the MVTec AD and Goods AD datasets. Specifically, the model achieved 99.8% (+0.8%) accuracy and 54.2% (+2.6%) Image/Pixel AUPR on MVTec AD, and 81.4% (+4.1%) and 36.4% (+6.6%) Image/Pixel AUPR on Goods AD when both grids were utilized. This indicates that the additional introduction of knowledge learned from synthetic anomalies beyond the normal grid can further enhance the performance of the model. For more ablation studies, please refer to the additional supplementary materials.

TABLE II: Ablation study for N-Grid and A-Grid

V Conclusion

In this paper, we introduce GRAD, which incorporates an abnormal grid along with the FBP anomaly synthesis module, addressing the limitations of existing unsupervised and self-supervised methods in handling fine-grained anomalies. The success of GRAD lies in its use of an additional abnormal grid to refine the boundaries of normal features, and developing the Feature Block Paste (FBP) module for efficient and flexible anomaly synthesis at the feature level. Comprehensive experiments on industrial datasets such as MVTecAD, VisA, and the latest GoodsAD demonstrate GRAD’s superior performance, improved detection accuracy.

Acknowledgment

This work is supported by ‘the Fundamental Research Funds for Central Universities, NO.NJ2024031’.

References

[1] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Audigier, “Padim: a patch distribution modeling framework for anomaly detection and localization,” 2020.
[2] Niv Cohen and Yedid Hoshen, “Sub-image anomaly detection with deep pyramid correspondences,” 2021.
[3] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Schölkopf, Thomas Brox, and Peter Gehler, “Towards total recall in industrial anomaly detection,” 2022.
[4] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel, “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” 2019.
[5] Jinlei Hou, Yingying Zhang, Qiaoyong Zhong, Di Xie, Shiliang Pu, and Hong Zhou, “Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection,” 2021.
[6] Zhiyuan You, Lei Cui, Yujun Shen, Kai Yang, Xin Lu, Yu Zheng, and Xinyi Le, “A unified model for multi-class anomaly detection,” 2022.
[7] Joo Chan Lee, Taejune Kim, Eunbyung Park, Simon S. Woo, and Jong Hwan Ko, “Continuous memory representation for anomaly detection,” 2024.
[8] Yuxuan Duan, Yan Hong, Li Niu, and Liqing Zhang, “Few-shot defect image generation via defect-aware feature manipulation,” 2023.
[9] Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, and Chengjie Wang, “Anomalydiffusion: Few-shot anomaly image generation with diffusion model,” 2024.
[10] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger, “Mvtec ad–a comprehensive real-world dataset for unsupervised anomaly detection,” in CVPR, 2019, pp. 9592–9600.
[11] Yihang Li, Shuichiro Shimizu, Weiqi Gu, Chenhui Chu, and Sadao Kurohashi, “Visa: An ambiguous subtitles dataset for visual scene-aware machine translation,” 2022.
[12] Jian Zhang, Runwei Ding, Miaoju Ban, and Linhui Dai, “Pku-goodsad: A supermarket goods dataset for unsupervised anomaly detection and segmentation,” IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2008–2015, Mar. 2024.
[13] Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattlegger, and Carsten Steger, “Improving unsupervised defect segmentation by applying structural similarity to autoencoders,” arXiv preprint arXiv:1807.02011, 2018.
[14] Jinghui Chen, Saket Sathe, Charu Aggarwal, and Deepak Turaga, “Outlier detection with autoencoder ensembles,” in Proceedings of the 2017 SIAM international conference on data mining. SIAM, 2017, pp. 90–98.
[15] Chong Zhou and Randy C Paffenroth, “Anomaly detection with robust deep autoencoders,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 665–674.
[16] David Dehaene, Oriel Frigo, Sébastien Combrexelle, and Pierre Eline, “Iterative energy-based projection on a normal data manifold for anomaly localization,” arXiv preprint arXiv:2002.03734, 2020.
[17] Wenqian Liu, Runze Li, Meng Zheng, Srikrishna Karanam, Ziyan Wu, Bir Bhanu, Richard J Radke, and Octavia Camps, “Towards visually explaining variational autoencoders,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8642–8651.
[18] Mohammad Sabokrou, Mohammad Khalooei, Mahmood Fathy, and Ehsan Adeli, “Adversarially learned one-class classifier for novelty detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3379–3388.
[19] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Georg Langs, and Ursula Schmidt-Erfurth, “f-anogan: Fast unsupervised anomaly detection with generative adversarial networks,” Medical image analysis, vol. 54, pp. 30–44, 2019.
[20] Yufei Liang, Jiangning Zhang, Shiwei Zhao, Runze Wu, Yong Liu, and Shuwen Pan, “Omni-frequency channel-selection representations for unsupervised anomaly detection,” IEEE Transactions on Image Processing, 2023.
[21] Jonathan Pirnay and Keng Chai, “Inpainting transformer for anomaly detection,” 2021.
[22] Xincheng Yao, Ruoqi Li, Zefeng Qian, Yan Luo, and Chongyang Zhang, “Focus the discrepancy: Intra- and inter-correlation learning for image anomaly detection,” 2023.
[23] Haoyang He, Jiangning Zhang, Hongxu Chen, Xuhai Chen, Zhishan Li, Xu Chen, Yabiao Wang, Chengjie Wang, and Lei Xie, “Diad: A diffusion-based framework for multi-class anomaly detection,” in AAAI, 2024.
[24] Chun-Liang Li, Kihyuk Sohn, Jinsung Yoon, and Tomas Pfister, “Cutpaste: Self-supervised learning for anomaly detection and localization,” 2021.
[25] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj, “Draem – a discriminatively trained reconstruction embedding for surface anomaly detection,” 2021.
[26] Hannah M Schlüter, Jeremy Tan, Benjamin Hou, and Bernhard Kainz, “Natural synthetic anomalies for self-supervised anomaly detection and localization,” in European Conference on Computer Vision. Springer, 2022, pp. 474–489.
[27] Zhikang Liu, Yiming Zhou, Yuansheng Xu, and Zilei Wang, “Simplenet: A simple network for image anomaly detection and localization,” 2023.
[28] Jun Gao, Zian Wang, Jinchen Xuan, and Sanja Fidler, “Beyond fixed grid: Learning geometric image representation with a deformable grid,” 2023.
[29] Chiyu Max Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner, and Thomas Funkhouser, “Local implicit grid representations for 3d scenes,” 2020.
[30] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su, “Tensorf: Tensorial radiance fields,” 2022.
[31] Mingxing Tan and Quoc V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” 2020.
[32] Vitjan Zavrtanik, Matej Kristan, and Danijel Skočaj, “Reconstruction by inpainting for visual anomaly detection,” Pattern Recognition, vol. 112, pp. 107706, 2021.
[33] Yong Shi, Jie Yang, and Zhiquan Qi, “Unsupervised anomaly segmentation via deep feature reconstruction,” Neurocomputing, vol. 424, pp. 9–22, Feb. 2021.

Xet Storage Details

Size:: 38.1 kB
Xet hash:: bd7009771e7ba5354b41164dd2197c0a09b7c29dd1446dfec176d6ed0e18218a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.