52.5 kB

Title: MobileIQA: Exploiting Mobile-level Diverse Opinion Network For No-Reference Image Quality Assessment Using Knowledge Distillation

URL Source: https://arxiv.org/html/2409.01212

Published Time: Wed, 04 Sep 2024 01:32:00 GMT

Markdown Content: 1 1 institutetext: State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA 5 5 institutetext: School of Artificial Intelligence, University of Chinese Academy of Sciences 3 3 institutetext: Beijing Union University 4 4 institutetext: China University of Petroleum 5 5 institutetext: Hebei University 6 6 institutetext: PeopleAl Inc. Beijing, China 7 7 institutetext: School of Information Science and Technology, ShanghaiTech University 8 8 institutetext: Shanghai Transsion Information Technology Limited

8 8 email: chenzewen2022@ia.ac.cn, 8 8 email: 20221081210206@buu.edu.cn, 8 8 email: {cup_zy1, hcguo_hbu}@163.com, 8 8 email: 1418319765@qq.com, 8 8 email: 20231081210210@buu.edu.cn, 8 8 email: jun_wang@ia.ac.cn, 8 8 email: {bli, wmhu}@nlpr.ia.ac.cn,

8 8 email: {dehua.liu, hesong.li}@transsion.com Sunhan Xu 33 Yun Zeng \orcidlink 0009-0004-5934-1528 44 Haochen Guo 55 Jian Guo 33 Shuai Liu 33 Juan Wang \orcidlink 0000-0002-3848-9433 11 Bing Li🖂\orcidlink 0000-0001-6114-1411 1166 Weiming Hu\orcidlink 0000-0001-9237-8825 112277 Dehua Liu 88 Hesong Li 88

Abstract

With the rising demand for high-resolution (HR) images, No-Reference Image Quality Assessment (NR-IQA) gains more attention, as it can ecaluate image quality in real-time on mobile devices and enhance user experience. However, existing NR-IQA methods often resize or crop the HR images into small resolution, which leads to a loss of important details. And most of them are of high computational complexity, which hinders their application on mobile devices due to limited computational resources. To address these challenges, we propose MobileIQA, a novel approach that utilizes lightweight backbones to efficiently assess image quality while preserving image details through high-resolution input. MobileIQA employs the proposed multi-view attention learning (MAL) module to capture diverse opinions, simulating subjective opinions provided by different annotators during the dataset annotation process. The model uses a teacher model to guide the learning of a student model through knowledge distillation. This method significantly reduces computational complexity while maintaining high performance. Experiments demonstrate that MobileIQA outperforms novel IQA methods on evaluation metrics and computational efficiency. The code is available at https://github.com/chencn2020/MobileIQA.

Keywords:

NR-IQA High Resolution Computing Efficiency

1 Introduction

Image quality assessment (IQA) is a long-standing research in image processing fields. According to the availability of reference images, IQA can be categorized into three types: full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA) and no-reference IQA (NR-IQA). Among these types, NR-IQA has gained more attention since it removes the dependence on reference images, which are unavailable in many real-world applications.

Figure 1: Comparison among SOTA IQA methods on UHD-IQA [10] validation set in terms of KROCC, SROCC, PLCC and MACs.

With the development of mobile imaging technology, capturing high-resolution (HR) images (such as 4K) using mobile devices, such as cameras and smartphones, has become increasingly popular. The higher the quality of these images is, the better the user experience will be. Therefore, evaluating the quality of HR images in real-time on mobile devices is crucial.

Over the past decades, numerous efforts have been adopted to NR-IQA, such as developing sophisticated networks[17, 33, 5], proposing proxy tasks[6, 19, 29], introducing Vision-Language Models (VLM) [32, 31]. Although these methods have improved the performance of IQA models in various aspects, they still encounter two major challenges when assessing the quality of HR images on mobile devices. (1) Limited Input Resolution: Most methods resize or crop the HR images into smaller resolution, typically 224×224 224 224 224\times 224 224 × 224, which represents only about 1% of the resolution of 4K images. This process results in the loss of important image details, thereby limiting the model’s generalization and performance. (2) High Computational Complexity: Most of these methods employ computationally intensive backbones such as ResNet [9] or vision transformer (ViT) [7]. However, the limited computational resources available on mobile devices make it challenging to efficiently run these models on such platforms. The two challenges significantly hinder the application of these IQA methods on mobile devices.

In this paper, we introduce MobileIQA, which achieves outstanding performance with significantly fewer multiply-accumulate operations (MACs) to tackle these challenges. IQA is an extremely subjective task, since different individuals perceive the quality differently, leading to variations in their quality ratings of the same image. Therefore, the ground truth (GT) labels of images are defined as the average of subjective scores provided by multiple human annotators, namely mean opinion score (MOS). Mimicking the human rating process, we develop a multi-view attention learning (MAL) module for the MobileIQA to implicitly learn diverse opinion features by capturing complementary contexts from various perspectives. The opinion features collected from different MALs are integrated into a comprehensive quality score, effectively facilitating more reliable quality score assessment.

MobileIQA consists of a teacher model (MobileViT-IQA) and a student model (MobileNet-IQA), which utilize lightweight MobileViT [24] and MobileNet [13] as backbones respectively. These networks with lightweight backbones support a maximum resolution of 1907×1231 1907 1231 1907\times 1231 1907 × 1231, effectively preserving the details in HR images. Although MobileViT-IQA outperforms MobileNet-IQA due to its global attention mechanism, it is less computational efficiency. To address this, we employ knowledge distillation, using MobileViT-IQA as the teacher network to guide the learning of MobileNet-IQA. This approach significantly reduces the computational complexity and improves the performance of MobileNet-IQA. As shown in Fig. 1, our model demonstrate excellent performance in terms of three evaluation metrics and MACs compared to the novel comparison IQA models. Overall, our contributions are summarized as follows:

1.We propose MobileIQA, which integrates diverse opinion features produced by our meticulously designed MAL modules, effectively enhancing the performance of the model.
2.We employ knowledge distillation to transfer the knowledge from the teacher network to the student network, thereby significantly reducing the computational complexity while maintaining the performance.
3.Numerous experimental results demonstrate that our MobileNet-IQA achieves higher accuracy and computational efficiency, significantly outperforming many advanced methods.

2 Related Works

Due to the remarkable progress in vision applications, considerable attention has been focused on elevating the performance of IQA. As a pioneer, [15] design a convolutional neural network (CNN) for IQA to extract image features. Then they extend this work to a multi-task CNN [16]. However, insufficient training samples limit effective learning of CNNs-based models. For this reason, some methods [27, 25, 34] employ pre-trained networks, such as ResNet [9] and ViT [7], as feature extractors. However, recent research [39, 6] point out that these popular networks pre-trained for high-level tasks are not suitable for IQA. Therefore, some works pre-train models on related pretext tasks, e.g., image restoration [18, 20], quality ranking [19, 21], and contrastive learning [38, 23]. Some other methods enhance the IQA performance by introducing auxiliary information. For instance, Wang et al. and Saha et al. [28, 26] integrate textual information into the IQA. Zhang et al. [37] explore the relationship among multiple tasks, namely the IQA, scene classification and distortion classification. Additionally, many methods utilize the idea of ensemble learning to aggregate IQA-related knowledge for more effective learning. [22] collect a set of existing IQA models for annotation. The annotated samples are used for training their model to learn the quality score as well as the uncertainty. Some methods [29, 35, 36] propose a novel multi-dataset training strategy. The IQA task is also approached as a quality ranking problem. Gao et al.[8] utilize cross-entropy loss to measure the discrepancy between predicted quality rankings and GT binary labels for each image pair. Liu et al.[19] use hinge loss to define the optimization objective for quality ranking learning, while Ma et al.[21] apply learning-to-rank algorithms like RankNet[3] and ListNet[4] to train IQA models on numerous image pairs.

Although existing methods have improved IQA performance by addressing various aspects of the model, they take the traditional computer vision resolutions, such as 224×224 224 224 224\times 224 224 × 224 or 256×256 256 256 256\times 256 256 × 256 as the input images, which limits the adaptability to the HR IQA task. Additionally, most of them utilize computationally intensive backbones like ResNet or ViT, making it challenge to be applied on resource-constrained mobile devices. To address this, we propose MobileIQA, a mobile-level IQA model based on diverse opinion and knowledge distillation. By leveraging lightweight backbones, and employing knowledge distillation, our model significantly reduces computational complexity while maintaining model performance.

3 Proposed Method

3.1 Model Design

In this work, we present a novel network called MobileIQA, which uses teacher-student distillation [14] as the training technique. Both of the teacher and student model take lightweight backbones for feature extraction and collects various opinions by capturing diverse attention contexts to make a comprehensive decision on the image quality score. Fig. 2 shows the teacher network (MobileViT-IQA) architecture in the MobileIQA, which mainly consists of four parts: (1) A pre-trained MobileViT employed for multi-level feature perception; (2) Local distortion aware (LDA) modules used for unifying multi-level feature dimensions; (3) Multi-view attention learning (MAL) modules proposed for opinion collection; (4) An image quality score regression module designed for quality estimation. The architecture of MobileNet-IQA is similar to the MobileViT-IQA, but uses the MobileNet as the backbone. In the following, we introduce the MobileViT-IQA in detail.

Figure 2: Framework of the teacher model (MobileViT-IQA). The student model (MobileNet-IQA) shares the same framework, but takes MobileNet as backbone.

3.1.1 (A) Multi-level Feature Perception.

The blocks in MobileViT replace local processing in traditional CNNs with global processing via transformers, integrating characteristics of both CNNs and ViTs. This architecture enables the MobileViT to learn representations more efficiently. Given an image I∈ℝ 3×H×W 𝐼 superscript ℝ 3 𝐻 𝑊 I\in\mathbb{R}^{3\times H\times W}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT, we extract the features from the MobileVit. Many existing work proves that the mutli-layer features are useful for the IQA task[6, 5, 29, 27, 12]. Thus, we extract multi-level features from the five stages in MobileViT, denoted as f j∈ℝ C j×H j×W j subscript 𝑓 𝑗 superscript ℝ subscript 𝐶 𝑗 subscript 𝐻 𝑗 subscript 𝑊 𝑗 f_{j}\in\mathbb{R}^{C_{j}\times H_{j}\times W_{j}}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where C j subscript 𝐶 𝑗 C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, H j subscript 𝐻 𝑗 H_{j}italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and W j subscript 𝑊 𝑗 W_{j}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represent the dimension of the feature map at the j 𝑗 j italic_j-th stage and 1≤j≤5 1 𝑗 5 1\leq j\leq 5 1 ≤ italic_j ≤ 5.

3.1.2 (B) Local Distortion Aware Module.

The Local Distortion Aware (LDA) module serves two key functions: (1) It extracts local features using a CNN with a small receptive field; (2) It standardizes the dimensions of these features using an adaptive pooling operation. Specifically, for an input feature f i∈ℝ(C j×H j×W j)subscript 𝑓 𝑖 superscript ℝ subscript 𝐶 𝑗 subscript 𝐻 𝑗 subscript 𝑊 𝑗 f_{i}\in\mathbb{R}^{(C_{j}\times H_{j}\times W_{j})}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT, a 1×1 1 1 1\times 1 1 × 1 CNN is applied to double the channel dimensions to 2×C j 2 subscript 𝐶 𝑗 2\times C_{j}2 × italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. After GELU activation, the adaptive pooling operation reshapes the feature into f i∈ℝ(2⁢C j×D×N)subscript 𝑓 𝑖 superscript ℝ 2 subscript 𝐶 𝑗 𝐷 𝑁 f_{i}\in\mathbb{R}^{(2C_{j}\times D\times N)}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 2 italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_D × italic_N ) end_POSTSUPERSCRIPT, where D 𝐷 D italic_D and N 𝑁 N italic_N denote the dimensions. Another 1×1 1 1 1\times 1 1 × 1 CNN is used to reduce the channel dimensions back to C j subscript 𝐶 𝑗 C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, producing aware features f i∈ℝ C i×D×N subscript 𝑓 𝑖 superscript ℝ subscript 𝐶 𝑖 𝐷 𝑁 f_{i}\in\mathbb{R}^{C_{i}\times D\times N}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_D × italic_N end_POSTSUPERSCRIPT for the i 𝑖 i italic_i-th stage.

3.1.3 (C) Multi-view Attention Learning Module.

The critical part of the MobileIQA is the multi-view attention learning (MAL) module. The motivation behind it is that individuals often have diverse subjective perceptions and regions of interest when viewing the same image. To this end, we employ multiple MALs to learn attentions from different viewpoints. Each MAL is initialized with different weights and updated independently to encourage diversity and avoid redundant output features. The number of MALs can be flexibly set as a hyper-parameter. In this work, we set it to 3 and we show in our results its effect on the performance of our model.

As shown in Fig. 2, the MAL starts from N 𝑁 N italic_N self-attentions (SAs), each of which is responsible to process a basic feature 𝐟 j subscript 𝐟 𝑗\mathbf{f}_{j}bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (1≤j≤N 1 𝑗 𝑁 1\leq j\leq N 1 ≤ italic_j ≤ italic_N). The outputs of all the SAs are concatenated, forming a multi-level aware feature 𝐅∈ℝ C×D×N 𝐅 superscript ℝ 𝐶 𝐷 𝑁\mathbf{F}\in\mathbb{R}^{C\times D\times N}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D × italic_N end_POSTSUPERSCRIPT. Then 𝐅 𝐅\mathbf{F}bold_F passes through two branches, i.e., a feature-wise SA branch and a channel-wise SA branch, which apply a SA across spatial and channel dimensions, respectively, to capture complementary non-local contexts and generate multi-view attention maps. In particular, for the channel-wise SA, the feature 𝐅 𝐅\mathbf{F}bold_F is first reshaped and permuted to convert the size from C×D×N 𝐶 𝐷 𝑁 C\times D\times N italic_C × italic_D × italic_N to D×(C×N)𝐷 𝐶 𝑁 D\times(C\times N)italic_D × ( italic_C × italic_N ). After the SA, the output feature is permuted and reshaped back to the original size C×D×N 𝐶 𝐷 𝑁 C\times D\times N italic_C × italic_D × italic_N. Subsequently, the outputs of the two branches are added and average pooled, generating an opinion feature. The design of the two branches has two key advantages. First, implementing the SA in different dimensions promotes diverse attention learning, yielding complementary information. Second, contextualized long-range relationships are aggregated, benefiting global quality perception.

In MobileIQA, there are four MALs in total. Three of them independently extract opinion features from the five-level features captured from the LDAs, representing the perspectives of different annotators during data annotation. The fourth MAL fuses these three opinion features into a final quality feature.

3.1.4 (D) Image Quality Score Regression.

Assuming that M 𝑀 M italic_M opinion features are generated from M 𝑀 M italic_M MALs employed in the MobileIQA. To derive a global quality score from the collected opinion features, we utilize an additional MAL. The MAL integrates diverse contextual perspectives, resulting in a comprehensive opinion feature that captures essential information. This feature is then processed through two CNN layers with kernel sizes of 1×1 1 1 1\times 1 1 × 1 and 3×3 3 3 3\times 3 3 × 3 to reduce the number of channels, followed by two fully connected layers that transform the feature size from 128 to 64 and from 64 to 1. Finally, we obtain a predicted quality score.

3.2 Knowledge Distillation

Despite the superior performance of MobileViT-IQA, its computational complexity still poses a burden on mobile devices. In contrast, MobileNet-IQA requires less computation but does not match the performance of MobileViT-IQA. To address this issue, we design a distillation process, as illustrated in Fig. 3, where MobileViT-IQA serves as the teacher model, guiding the learning of the student model MobileNet-IQA. Since MobileNet-IQA and MobileViT-IQA share the same architecture except for the backbone, the distillation process is pretty easy and efficient. Considering that different MALs in MobileIQA simulate the opinions from different evaluators, we apply M⁢S⁢E 𝑀 𝑆 𝐸 MSE italic_M italic_S italic_E loss to minimize the discrepancy between the MAL outputs from the teacher model and the student model, thereby enabling the MALs in the student to approximate the opinions from MALs in the teacher.

Figure 3: Knowledge distillation process. MSE loss is used to minimize the discrepancy between the Student Opinion Features and the Teacher Opinion Features.

Specifically, given an image I∈ℝ 3×H×W 𝐼 superscript ℝ 3 𝐻 𝑊 I\in\mathbb{R}^{3\times H\times W}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_H × italic_W end_POSTSUPERSCRIPT, the teacher and student models extract the multi-level aware features for the all five stages f i T superscript subscript 𝑓 𝑖 𝑇 f_{i}^{T}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and f i T superscript subscript 𝑓 𝑖 𝑇 f_{i}^{T}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT respectively. These features are then processed by three MALs in both models, producing teacher opinion features (𝐅 i T superscript subscript 𝐅 𝑖 𝑇\mathbf{F}{i}^{T}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) and student opinion features (𝐅 i S superscript subscript 𝐅 𝑖 𝑆\mathbf{F}{i}^{S}bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT). The discrepancy between these two types of opinion features is minimized using an MSE loss, effectively allowing the teacher model to guide the student model in how to assess images. This process can be formulated as follows:

l d=1 3⁢∑i=1 3 𝐌𝐒𝐄⁢(𝐅 i T,𝐅 i S).subscript 𝑙 𝑑 1 3 superscript subscript 𝑖 1 3 𝐌𝐒𝐄 superscript subscript 𝐅 𝑖 𝑇 superscript subscript 𝐅 𝑖 𝑆 l_{d}=\frac{1}{3}\sum_{i=1}^{3}\mathbf{MSE}(\mathbf{F}{i}^{T},\mathbf{F}{i}^% {S}).italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 3 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT bold_MSE ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) .(1)

Meanwhile, to improve the score prediction accuracy of the student model, we additionally employ the MSE loss during the distillation process to minimize the discrepancy between the student’s predicted scores and the GTs. The optimization objective for the distillation is to minimize the following loss function:

l=l d+α×𝐌𝐒𝐄⁢(P,G),𝑙 subscript 𝑙 𝑑 𝛼 𝐌𝐒𝐄 𝑃 𝐺 l=l_{d}+\alpha\times\mathbf{MSE}(P,G),italic_l = italic_l start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_α × bold_MSE ( italic_P , italic_G ) ,(2)

where P 𝑃 P italic_P represents the predicted score and G 𝐺 G italic_G the ground truth, with α 𝛼\alpha italic_α denoting a constant.

4 Experiments

4.1 Datasets

We train and evaluate our model on UHD-IQA [10] dataset, totally containing 6,073 HR images, where 4269 and 904 are used for training and validating, respectively. The organizers in UHD-IQA Challenge: Pushing the Boundaries of Blind Photo Quality Assessment[11] held by AIM 2024 Workshop 1 1 1https://www.cvlai.net/aim/2024/ use the remaining 900 inaccessible images as the test set to evaluate the performance. For training and distillation, only the training set from UHD-IQA is used, without any additional datasets.

4.2 Evaluation Metrics

We evaluate the performance of IQA models using five metrics: Kendall Rank Correlation Coefficient (KRCC), Spearman Rank-Order Correlation Coefficient (SRCC), Pearson Linear Correlation Coefficient (PLCC), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). SRCC and KRCC assess the monotonicity, PLCC measures the linearity of the model’s predictions, RMSE and MAE indicates prediction accuracy. An effective IQA model should aim for KRCC, SRCC, and PLCC values approaching 1, while minimizing RMSE and MAE values to 0.

4.3 Implementation Details

We take the pre-trained mobilevitv2_200 and mobilenetv3_large_100 as the backbone of the MobileViT-IQA and MobileNet-IQA. If not explicitly specified, the number of the MAL is set to 3 and the input images are resized into 1907×1231 1907 1231 1907\times 1231 1907 × 1231, which is the maximum training resolution that our hardware can support, during training and testing. We set the constant α=2 𝛼 2\alpha=2 italic_α = 2 in the Eq. (2). We use the Adam optimizer with a learning rate of 1×10−5 1 superscript 10 5 1\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a weight decay of 1×10−5 1 superscript 10 5 1\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The learning rate is adjusted using the Cosine Annealing for every 50 epochs. We train the teacher model for 100 epochs with a batch size of 4 and the student model for 300 epochs with a batch size of on 8 on one NVIDIA RTXA800.

4.4 Comparisons With State-of-the-Arts

We compare our model with 6 advanced IQA models, namely HyperIQA [27], Effnet-2C-MLSP [30], CONTRIQUE [23], ARNIQA [2], CLIP-IQA+ [28] and QualiCLIP [1]. Following [10], the computational efficiency of all these models is measured by the number of MACs required for a forward pass with the same input image size of 3840×2160 3840 2160 3840\times 2160 3840 × 2160.

The results on the validation and test set in the UHD-IQA datasets are shown in Tab. 1 and Tab. 2. The proposed MobileNet-IQA significantly outperforms the comparison methods in terms of both performance and computational efficiency. Particularly, compared to the comparison state-of-the-art (SOTA) models, namely QualiCLIP, our MobileNet-IQA model demonstrates significant improvements in key metrics. On the validation and set, it achieves increases of 4.49% and 4.91% in KRCC, 4.12% and 4.28% in PLCC, 44.30% and 20.48% in RMSE, 46.88% and 30.30% in MAE, and 2.38% and 2.34% in SRCC, while reducing computational complexity by 88.90%. Compared to HyperIQA, which has MACs closer to ours, MobileNet-IQA significantly outperforms in all five metrics, with improvements ranging from 38.18% to 330.22% on the validation set and 34.29% to 633.98% on the test set. These results highlight the clear advantages of our proposed method over most existing IQA models.

It is worth noting that through knowledge distillation, the performance of MobileNet-IQA across five metrics is only slightly lower than that of the teacher model (MobileViT-IQA), with a maximum performance drop of just 0.003, while significantly enhancing computational efficiency by approximately 91.66%. This clearly demonstrates that our designed network architecture and knowledge distillation approach significantly improve computational efficiency while maintaining the performance of the student network.

We also list the results of AIM 2024 UHD-IQA Challenge in Tab. 3. It shows that our model achieves the fourth place, which further demonstrates the effectiveness of our model.

Table 1: Evaluation of the performance of the baselines on the validation set. ↑↑\uparrow↑ means that higher values are better, ↓↓\downarrow↓ means that lower values are better. Best and second-best results are highlighted in bold and underlined, respectively.

Method KRCC↑↑>\uparrow↑PLCC↑↑>\uparrow↑RMSE↓↓>\downarrow↓MAE↓↓>\downarrow↓SRCC↑↑>\uparrow↑MACs (G)↓↓>\downarrow↓ HyperIQA [27]0.359 0.182 0.087 0.055 0.524 211 Effnet-2C-MLSP [30]0.445 0.627 0.060 0.050 0.615 345 CONTRIQUE [23]0.521 0.712 0.049 0.038 0.716 855 ARNIQA [2]0.523 0.717 0.050 0.039 0.718 855 CLIP-IQA+ [28]0.546 0.732 0.108 0.087 0.743 895 QualiCLIP [1]0.557 0.752 0.079 0.064 0.757 901 MobileViT-IQA(Teacher)0.585 0.784 0.043 0.034 0.777 1199 MobileNet-IQA(Student)0.582 0.783 0.044 0.034 0.775 100

Table 2: Evaluation of the performance of the baselines on the test set. Best and second-best results are highlighted in bold and underlined, respectively.

Table 3: The results on the private test set of AIM 2024 UHD-IQA Challenge 3 3 3 Results exceeding the competition’s computational limits are excluded..

4.5 Discussion about the Number of the MAL

To explore the effect of the MAL’s number M 𝑀 M italic_M on the performance of our model, we re-train the MobileViT-IQA using different settings of M 𝑀 M italic_M (1, 2 and 3). The results on the validation set are illustrated in Tab. 4. We can see that with the increase of the number of MALs, MobileViT-IQA consistently demonstrates an improved performance. This indicates that incorporating more MALs can benefit the performance, since more complementary contexts are learned. Additionally, we find that the discrepancy metrics (RMSE and MAE) remain unchanged, while the consistency (KRCC, PLCC and SRCC) show significant variation. We speculate that the additional complementary contexts provided by different MALs contribute to a more stable prediction of quality scores, leading to more reliable ranking and correlation rather than changes in absolute scores.

Table 4: The impact of the MAL’s number on the performance of MobileViT-IQA on validation set. The average results of KRCC, PLCC and SRCC are provided. The best results are marked in black bold.

4.6 Discussion about the impact of the resolution of input images

To investigate the impact of different input resolutions on model performance, we directly resize the original 4K resolution images (3840×2160 3840 2160 3840\times 2160 3840 × 2160) into smaller sizes, namely 238×153 238 153 238\times 153 238 × 153, 224×224 224 224 224\times 224 224 × 224, 317×205 317 205 317\times 205 317 × 205, 476×307 476 307 476\times 307 476 × 307, 1271×820 1271 820 1271\times 820 1271 × 820 and 1907×1231 1907 1231 1907\times 1231 1907 × 1231. We re-train the MobileViT-IQA with these 7 different types of resolutions. The results are summarized in Tab. 5, where the “Area Rate” denotes the ratio of the input resolution to the 4K resolution.

The results indicate that when the input resolution area rate is less than 1% of the 4K resolution (such as 224×224 224 224 224\times 224 224 × 224), there is a significant drop in model performance. This degradation is due to the substantial loss of detailed information when high-resolution images are resized to low resolutions. As resolution increases, model performance improves significantly. Specifically, when the resolution is increased from 476×307 476 307 476\times 307 476 × 307 (1.76%) to 1271×820 1271 820 1271\times 820 1271 × 820 (12.57%), performance metrics improve by 6.8% to 20.0%, where the resolution increases by approximately 7.13%. However, further increasing the resolution from 1271×820 1271 820 1271\times 820 1271 × 820 (12.57%) to 1907×1231 1907 1231 1907\times 1231 1907 × 1231 (28.30%) results in minimal performance improvement. This could be due to the relatively small difference between this two resolutions (about 2.25%), which may not significantly affect the model. Due to GPU computational limitations, further investigation with higher resolutions has not yet been conducted.

Table 5: The impact of the resolution of input images on the performance of MobileViT-IQA on the validation set. The average results of KRCC, PLCC and SRCC are provided. The best results are highlighted in bold.

4.7 Visualization Analysis on the MAL

To validate that the proposed MALs can learn diverse attentions, we compute the cosine similarity between the weights of each pairwise MALs and show it in Fig. 4-(A) and (B). We see that all the similarity scores except those in the diagonal are extremely low, meaning that there exists little redundancy between each pairwise MALs. Moreover, we compute the cosine similarity between the MALs of the teacher (MobileViT-IQA) and the student (MobileNet-IQA) to demonstrate whether the student learns from the teacher. As illustrated in Fig. 4-(C), the high diagonal similarity indicates that the distillation is effective at the corresponding positions, indicating that the student has successfully learned how to assess image from the teacher.

More intuitively, we visualize the output of different MALs in Fig. 5. It can be observed that different MALs have distinct attention regions. For example, the first MAL pays more attention to local regions, the second and third MALs mainly focus on both global and local features. The examples show that each MAL effectively learns complementary opinion features.

Figure 4: (A), (B) and (C) represent the cosine similarities of pairwise MALs within the MobileViT-IQA, MobileNet-IQA, and between MobileViT-IQA and MobileNet-IQA.

Figure 5: Attention maps produced by different MALs. The number of MALs is set to 3.

4.8 Running On Mobile Phones

To validate the proposed MobileNet-IQA can be applied on the mobile devices, we convert the MobileNet-IQA and HyperIQA[27] into TensorFlow Lite (TFLite) and evaluate the inference efficiency on the AI Benchmark 4 4 4 https://ai-benchmark.com/. We conduct the experiments on two mobile phones: Xiaomi 10S and HONOR Magic5 Pro. As illustrated in Fig. 6, we set the inference mode to FP16 and run these models on a single CPU. This process is repeated 10 times, and the average of the 10 scores are reported as the final inference times (ms). The results shown in Tab. 6 demonstrate that MobileNet-IQA (1271×820 1271 820 1271\times 820 1271 × 820) not only shows faster model efficiency than HyperIQA, but also surpasses HyperIQA in overall model performance, further confirming the effectiveness of our approach.

Figure 6: The AI Benchmark inference platform.

Table 6: The inference time comparisons 6 6 6 HyperIQA randomly crops 224×224 224 224 224\times 224 224 × 224 patches 25 times from the input image, and gets the quality score based on the average results of these 25 patches. MobileNet-IQA predict the quality score directly based on the full input image. In this experiment, the batch size (BS) for HyperIQA is 25, whereas the BS for MobileNet-IQA is 1.between MobileNetIQA and HyperIQA on different mobile phones. The model performance in terms of KRCC, PLCC and SRCC are provided for better comparison. The best restuls are marked in black bold.

4.9 Ablation Studies

In this paper, we develop MobileIQA based on the MAL module and employs knowledge distillation (KD) to train the student model (MobileNet-IQA) with the guidance from the teacher model (MobileViT-IQA).

To validate the effectiveness of these two key components, we conduct the following experiments. Firstly, we remove the three MALs in the MobileViT-IQA and re-train this model (W/O MAL). Then, we re-train the MobileNet-IQA directly without the guidance from the teacher model (W/O KD). The results from Tab. 7 reveal that the removal of any component degrades the model’s performance. We can see that the variant removing the MAL (W/O MAL) has the most remarkable decline in performance, validating the significance of the diverse opinion feature learning. In addition, without the guidance from the teacher model, the W/O KD variant also shows a noticeable drop in performance. This indicates that the knowledge distillation effectively transfers reliable knowledge from the teacher model to the student model, enhancing the performance of the student model. Such a simple knowledge distillation approach can achieve this effect further validates the rationale behind our design of the diverse opinion network based on the MAL module.

Table 7: Ablation studies on the critical components of our framework on the validation set. The average results of KRCC, PLCC and SRCC are provided. The best results are marked in black bold.

5 Conclusion

In this paper, we introduce MobileIQA, an innovative framework comprising a powerful teacher model (MobileViT-IQA) and a lightweight student model (MobileNet-IQA). Both models leverage lightweight networks, MobileViT and MobileNet, as their backbones, respectively. We significantly increase the input resolution from the 224×224 224 224 224\times 224 224 × 224 to 1907×1231 1907 1231 1907\times 1231 1907 × 1231, enhancing model performance by capturing more image detail. Furthermore, both models incorporate our proposed Multi-view Attention Learning modules, which provide diverse perspectives on input images and enhance network performance. The student model is trained with the guidance of the teacher model, achieving strong performance with much smaller computational complexity. Extensive experiments demonstrate the superior accuracy and computational efficiency of our approach.

Acknowledgements

This work was partially supported by the Humboldt Foundation. We thank the AIM 2024 sponsors: Meta Reality Labs, KuaiShou, Huawei, Sony Interactive Entertainment and University of Würzburg (Computer Vision Lab). Additionally, this work is also supported by the Key Research and Development Program of Xinjiang Urumgi Autonomous Region under Grant No.2023B01005, the Natural Science Foundation of China (Nos.62122086), the Natural Science Foundation of China under Grants 62202470. Bing Li is also supported by Youth Innovation Promotion Association, CAS.

References

[1] Agnolucci, L., Galteri, L., Bertini, M.: Quality-aware image-text alignment for real-world image quality assessment. arXiv preprint arXiv:2403.11176 (2024)
[2] Agnolucci, L., Galteri, L., Bertini, M., Del Bimbo, A.: ARNIQA: Learning Distortion Manifold for Image Quality Assessment. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 189–198 (2024)
[3] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: International Conference on Machine Learning. pp. 89–96 (2005)
[4] Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on Machine learning. pp. 129–136 (2007)
[5] Chen, Z., Qin, H., Wang, J., Yuan, C., Li, B., Hu, W., Wang, L.: Promptiqa: Boosting the performance and generalization for no-reference image quality assessment via prompts. arXiv preprint arXiv:2403.04993 (2024)
[6] Chen, Z., Wang, J., Li, B., Yuan, C., Xiong, W., Cheng, R., Hu, W.: Teacher-guided learning for blind image quality assessment. In: Proceedings of the Asian Conference on Computer Vision. pp. 2457–2474 (2022)
[7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[8] Gao, F., Tao, D., Gao, X., Li, X.: Learning to rank for blind image quality assessment. arXiv e-prints (2013)
[9] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[10] Hosu, V., Agnolucci, L., Wiedemann, O., Iso, D.: Uhd-iqa benchmark database: Pushing the boundaries of blind photo quality assessment. arXiv preprint arXiv:2406.17472 (2024)
[11] Hosu, V., Conde, M.V., Timofte, R., Agnolucci, L., Zadtootaghaj, S., Barman, N., et al.: AIM 2024 challenge on uhd blind photo quality assessment. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2024)
[12] Hosu, V., Goldlucke, B., Saupe, D.: Effective aesthetics prediction with multi-level spatially pooled features. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9375–9383 (2019)
[13] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
[14] Hu, C., Li, X., Liu, D., Wu, H., Chen, X., Wang, J., Liu, X.: Teacher-student architecture for knowledge distillation: A survey. arXiv preprint arXiv:2308.04268 (2023)
[15] Kang, L., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for no-reference image quality assessment. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1733–1740 (2014)
[16] Kang, L., Ye, P., Li, Y., Doermann, D.: Simultaneous estimation of image quality and distortion via multi-task convolutional neural networks. In: 2015 IEEE international conference on image processing (ICIP). pp. 2791–2795. IEEE (2015)
[17] Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5148–5157 (2021)
[18] Lin, K.Y., Wang, G.: Hallucinated-iqa: No-reference image quality assessment via adversarial learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 732–741 (2018)
[19] Liu, X., Van De Weijer, J., Bagdanov, A.D.: Rankiqa: Learning from rankings for no-reference image quality assessment. In: Proceedings of the IEEE international conference on computer vision. pp. 1040–1049 (2017)
[20] Ma, J., Wu, J., Li, L., Dong, W., Xie, X., Shi, G., Lin, W.: Blind image quality assessment with active inference. IEEE Transactions on Image Processing 30, 3650–3663 (2021)
[21] Ma, K., Liu, W., Liu, T., Wang, Z., Tao, D.: dipiq: Blind image quality assessment by learning-to-rank discriminable image pairs. IEEE Transactions on Image Processing 26(8), 3951–3964 (2017)
[22] Ma, K., Liu, X., Fang, Y., Simoncelli, E.P.: Blind image quality assessment by learning from multiple annotators. In: 2019 IEEE international conference on image processing (ICIP). pp. 2344–2348. IEEE (2019)
[23] Madhusudana, P.C., Birkbeck, N., Wang, Y., Adsumilli, B., Bovik, A.C.: Image quality assessment using contrastive learning. IEEE Transactions on Image Processing 31, 4149–4161 (2022)
[24] Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
[25] Qin, G., Hu, R., Liu, Y., Zheng, X., Liu, H., Li, X., Zhang, Y.: Data-efficient image quality assessment with attention-panel decoder. Proceedings of the AAAI Conference on Artificial Intelligence 37, 2091–2100 (2023)
[26] Saha, A., Mishra, S., Bovik, A.C.: Re-iqa: Unsupervised learning for image quality assessment in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5846–5855 (2023)
[27] Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3667–3676 (2020)
[28] Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol.37, pp. 2555–2563 (2023)
[29] Wang, J., Chen, Z., Yuan, C., Li, B., Ma, W., Hu, W.: Hierarchical curriculum learning for no-reference image quality assessment. International Journal of Computer Vision 131(11), 3074–3093 (2023)
[30] Wiedemann, O., Hosu, V., Su, S., Saupe, D.: Konx: Cross-resolution image quality assessment. Quality and User Experience 8(1), 8 (Dec 2023). https://doi.org/10.1007/s41233-023-00061-8
[31] Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Li, C., Sun, W., Yan, Q., Zhai, G., et al.: Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181 (2023)
[32] Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Xu, K., Li, C., Hou, J., Zhai, G., et al.: Q-instruct: Improving low-level visual abilities for multi-modality foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 25490–25500 (2024)
[33] Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1191–1200 (2022)
[34] Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology 30(1), 36–47 (2018)
[35] Zhang, W., Ma, K., Zhai, G., Yang, X.: Learning to blindly assess image quality in the laboratory and wild. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 111–115. IEEE (2020)
[36] Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Transactions on Image Processing 30, 3474–3486 (2021)
[37] Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14071–14081 (2023)
[38] Zhao, K., Yuan, K., Sun, M., Li, M., Wen, X.: Quality-aware pre-trained models for blind image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22302–22313 (2023)
[39] Zhu, H., Li, L., Wu, J., Dong, W., Shi, G.: Metaiqa: Deep meta-learning for no-reference image quality assessment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14143–14152 (2020)

Xet Storage Details

Size:: 52.5 kB
Xet hash:: e0b073549a6a6457e6f5c66b6150ca473c7d4514d0189dc38c1f3d6633aa5f9d

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.