59.1 kB

Title: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

URL Source: https://arxiv.org/html/2208.08340

Published Time: Thu, 13 Jul 2023 16:50:04 GMT

Markdown Content: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

Yinghui Xing*, Qirui Wu*, De Cheng🖂, Shizhou Zhang, Guoqiang Liang, Peng Wang, Yanning Zhang This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62101453, Grant 62201467, and Grant 62176198; in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2021A1515110544; in part by the Natural Science Basic Research Program of Shaanxi under Grant 2022JQ-686, 2019JQ-158, and in part by the Project funded by China Postdoctoral Science Foundation under Grant 2022TQ0260, and in part by the Young Talent Fund of Xi’an Association for Science and Technology under Grant 959202313088.Yinghui Xing, Qirui Wu, Shizhou Zhang, Guoqiang Liang, Peng Wang, Yanning Zhang are with the School of Computer Science, Northwestern Polytechnical University, Xi′′{}^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT an, China. Yinghui Xing is also with the Research & Development Institute of Northwestern Polytechnical University in Shenzhen. De Cheng is with School of Telecommunications Engineering, Xidian University, Xi′′{}^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT an, China. Corresponding author is De Cheng (email: dcheng@xidian.edu.cn)*The first two authors equally contributed to this work.

Abstract

With the emergence of large pretrained vison-language models such as CLIP , transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning probes for beneficial information for downstream tasks from the general knowledge stored in the pretrained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompts from the language side. However, tuning the text prompt alone can only adjust the synthesized “classifier”, while the computed visual features of the image encoder cannot be affected, thus leading to suboptimal solutions. In this paper, we propose a novel dual-modality prompt tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a class-aware visual prompt tuning (CAVPT) scheme is further proposed in our DPT. In this scheme, the class-aware visual prompt is generated dynamically by performing the cross attention between text prompt features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.

Index Terms:

Few-shot learning, Transfer learning, Image Classification, Prompt Tuning, Vision-Language Model

I Introduction

Recently, studies in large-scale vision-language models (VLM), such as CLIP[1] and ALIGN[2], have achieved remarkable progress in representation learning[3, 4, 5]. Benefiting from huge amounts of image-text data, the pretrained large-scale vision-language model is able to learn open-set visual concepts generated from natural language, thus further allowing zero-shot transfer to downstream tasks. Specifically, the vision-language model is composed of two components: the image encoder and the text encoder. When a new classification task arrives, one can synthesize the classifier by feeding the natural language description of the classes to the text encoder. Then, the similarity between the “classifier” and the image features generated by the image encoder is computed.

(a)

(b)

(c)

Figure 1: Visualization of the attention map of the image encoder. (a) Original Image. (b) Zero-Shot CLIP/CoOp. (c) Our DPT. The images are selected from OxfordPets and Caltech101. The GT annotated object is marked by a red box. Best viewed in color.

However, adapting these pretrained large-scale vision-language models efficiently to downstream tasks demonstrates its own challenge. Recent studies show that “prompting” is a simple and effective method[1], while designing a proper prompt is a nontrivial task. It always requires extensive domain expertise and takes a significant amount of time for manual word tuning. Usually, even with massive tuning, we cannot guarantee that the obtained prompt is optimal for downstream tasks.

Recent studies on prompt learning for vision representation have been mainly inspired by some prompt tuning approaches in natural language processing (NLP)[6, 7, 8], e.g., the representative CoOp[9]. These methods proposed modeling learnable contexts in prompt using continuous representations and then trained the model with these learnable prompts in an end-to-end way while keeping the pretrained parameters fixed. Although these methods have achieved great success and show promising performance, they only learn prompts for the text encoder.

From the perspective of conventional visual recognition, a typical vision model can be roughly divided into a feature extraction module and a classifier. Similarly, the process of feeding the text prompt into the text encoder can be viewed as the synthesis of a classifier, and the image encoder that extracts the visual features. Assume that the large-scale pretrained vision-language models have already captured most of the general knowledge (visual concepts) for the downstream tasks. What the prompting mechanism does is to query the suitable information, which is beneficial to the downstream tasks, from the pretrained model. As shown in Figure1, for an input image with multiple visual objects (concepts), e.g., the first case contains a person and a motorbike, the image encoder extracts all the visual features of the objects, i.e., the attention maps of Zero-Shot CLIP and CoOp highlight both the person and motorbike. However, the downstream task requires the output class label to be “motorbike”—the ground-truth annotation. CoOp tries to enable the model to output “motorbike” by adjusting the “classifier” alone while keeping the given highlighted “person” and “motorbike” visual features unchanged. There is a consensus in the vision community that features matter[10]! Therefore, we believe that adopting prompt tuning for the text encoder alone while directly utilizing the fixed image encoder for the downstream task is suboptimal. In this paper, we introduce visual prompts in the image input space and propose a dual-modality prompt tuning (DPT) paradigm by learning text prompts and visual prompts for both the text and image encoder simultaneously thus aiming at adapting the pretrained model to downstream tasks by adjusting both the “classifier” and “visual features”.

Specifically, for visual prompt tuning in a ViT-based image encoder, we introduce a small number of trainable parameters in the input of the transformer blocks while keeping the pretrained image encoder fixed. Inserting visual prompts can directly adjust the image patch token embeddings. image features, through the self-attention weights and absorbing the prompt-derived value vectors. To make the pretrained model better transfer to the downstream task, we further introduce a class-aware visual prompt tuning (CAVPT) mechanism into our DPT framework to help the final obtained image feature concentrate more on the target visual concept. Thus we aim at encoding both the task-related information and visual instance information into the visual prompts, The class-aware visual prompt is dynamically generated by performing cross attention between text prompt features and visual image patch embeddings and is expected to include richer semantic features of the target visual objects. Thus, the final obtained image feature, which is computed by absorbing the information from the image patch embeddings and our class-ware visual prompts, can concentrate more on the classes corresponding to the downstream tasks. Finally, the proposed overall DPT paradigm is learned with text prompts, visual prompts, and class-aware visual prompts simultaneously. As shown in Figure1, tuning the pretrained models with our DPT shows a more focused task-aware visual attention area.

The main contributions of this paper can be summarized in terms of the following three aspects:

•The proposed method demonstrates a new dual-modality prompt tuning paradigm for tuning the large pretrained vision-language model by simultaneously learning the visual and text prompts from the ends of both the text and image encoders.
•To encourage the visual prompts to explicitly contain downstream task-related information, we further introduce the class-aware visual prompt into our DPT. It is dynamically generated by performing cross attention between text prompt features and visual token embeddings.
•Extensive experimental results on 11 datasets demonstrate the effectiveness of the proposed method and shows its superiority to other prompt-tuning approaches by a large margin, as well as its generalization ability.

The remainder of this paper is organized as follows. SectionII introduces the related works. Details of our proposed method are elaborated in SectionIII. In Section IV, we report the results of comprehensive experiments on 11 datasets used in prompt tuning, which demonstrates the effectiveness of our method. Finally, the conclusion of our work is presented in Section V.

Figure 2: The overall architecture of our proposed DPT method. It consists of three learnable components: text prompt, visual prompt and class-aware visual prompt generated from a Class-Aware Visual Prompt Tuning (CAVPT) generator module whose detailed architecture is illustrated in Fig.3.

II Related Work

II-A Vision-Language Pretrained Models

Learning visual representations under the supervision of natural language has been demonstrated to be effective and has attracted much attention[11, 2, 12, 1]. For vision-language models, image-text matching and cross-modal contrastive learning are two important issues. In CLIP[1], two encoders related to the vision and language modalities are designed, and these image and text embeddings are then aligned using a symmetric cross entropy loss. Similarly, ALIGN[2] also utilizes a dual-encoder architecture, but it projects the image and text embeddings to the same semantic space to calculate the similarity scores between vision and language modalities This makes the vision-language interaction more efficient. Both these models are pretrained on large-scale image-text datasets with the contrastive loss, and can be transferred to downstream tasks. Research on transferring CLIP to various downstream tasks, such as image classification[13, 14, 9, 15], video-text retrieval[16], tracking[17], and so on[18, 19, 20, 21] is thriving. To boost the performance of CLIP to downstream tasks, CLIP-Adapter[13] introduced feature adapters on either visual or language branches and fine-tuned them on the few-shot classification task. Zhang et al.[14] further proposed a training-free CLIP-Adapter (i.e., TIP-Adapter), which creates the weights by a key-value cache model constructed from the few-shot training set. With much less training, TIP-Adapter is more efficient than CLIP-Adapter. As an alternative framework to reduce the gap between objective forms of model pretraining and fine-tuning, prompt-based learning has become an active topic in both NLP and computer vision communities. However, the discrepancy between the two different modalities causes difficulties in tuning the prompt. Recently, Zhou et al. [9] proposed a context optimization (CoOp) strategy to automatically learn the optimal prompts, which greatly boosts the recognition accuracy. Our work also focuses on transferring the pretrained vision-language model to downstream tasks through prompting.

II-B Prompt Learning

Prompt learning originated from the NLP community [6, 7, 22] and originally referred to the application of a fixed function to the input tokens, which provides an instruction about the task to the model. In the computer vision community, prompt learning has been explored in both visual models[23, 24, 25] and vision-language models[1, 9, 15, 18, 26]. In particular, visual prompt tuning (VPT)[23] has achieved significant performance gains with only a small amount of additional parameters, i.e., prompts, in the input space. Vision-language models have been investigated in image classification[9, 26, 15, 27, 28, 29], video recognition[30], and cross-modal learning[31, 32, 33]. Among them, CoOp[9] achieves continuous prompt optimization from downstream data to adapt the pretrained vision-language models. However, CoOp may introduce improper prompt tuning steps, which could hamper general knowledge probing[26]. To improve the generalization ability of CLIP, Zhu et al. [26] proposed a novel prompt tuning method, namely, i.e., ProGrad, to address the conflicts between each tuning step and the general knowledge CLIP has predicted. Conditional CoOp (CoCoOp)[15] extended CoOp by learning an input-conditional token for each image to improve the cross-domain generalization ability of CoOp. Motivated by the fact that contrastive loss can improve the generalization ability of models, Sahoo et al. [34] introduced a contrastive prompt tuning approach. It augmented the standard cross-entropy loss with two additional contrastive loss terms to learn generalizable prompts without introducing any additional parameters. Lu et al. [27] learned the output embeddings of prompts instead of the input embeddings and employed a Gaussian distribution to model them effectively. Bahng et al. [28] proposed a prompting method for CNN networks to adapt the pretrained vision-language models to downstream tasks. In contrast, Zhang et al. [29] used a neural architecture search algorithm to identify the optimal configuration with adapters and prompts as small components.

Most of the existing methods tune the prompts in the text encoders alone and neglect the clues in visual features. Our work proposes a dual-modality prompt tuning paradigm, which introduces both the text prompt and visual prompt for the vision-language model. Furthermore, a class-aware visual prompt is proposed to enable the image feature to pay more attention to the target foreground object for downstream tasks.

II-C Transfer Learning

Benefiting from the large scale of annotated data, the performance of deep neural networks has been greatly boosted. However, due to labeling costs, the collection of large-scale training datasets with accurate annotations is cumbersome[14]. Transfer learning[35, 36, 37, 38, 39] that aims to transfer general knowledge from one domain to some related domains with limited training data, has been proven to be a possible solution to few-shot learning[40, 41, 42, 43, 44, 45, 46, 47]. Some works have tried to tune a small number of parameters while keeping most of the parameters of pretrained models frozen. For example, [37] adapted the pretrained network by training a lightweight side network that was fused with the frozen pretrained network via summation. [38] proposed a new memory-efficient bias module, i.e. the lite residual module, to refine the feature extractor by learning small residual feature maps. Rebuffi et al. [39] introduced a residual adapter to the model and only trained the adapter network to improve the accuracy of domain-specific representations.

On the other hand, some self-supervised learning-based methods, such as MoCo[48], BYOL[49], and MAE[50], can also alleviate the requirement of large-scale training data. Recently, vision-language models pretrained on large-scale image-text pairs have demonstrated their superiority. Therefore, it is crucial to excavate the potential of these models for downstream tasks. This paper focuses on transferring knowledge learned from them to downstream tasks through prompting.

III Methodology

In this section, we first revisit the CLIP model. Then, we elaborate each component of the proposed dual-modality prompt-tuning (DPT) paradigm, including text prompts, visual prompts and class-aware visual prompts. The framework of our proposed DPT is illustrated in Figure2. Finally, we provide the loss function of DPT and a warm-up strategy to accelerate the training process.

III-A Contrastive Language-Image Pretraining (CLIP) Model

The CLIP model aims to align the image feature space and text feature space, which enables the model to have the capability of zero-shot transfer to downstream tasks. CLIP is composed of two encoders: one is designed for images, and the other is designed for text. The text encoder adopts a transformer[51] to encode the text information The image encoder can either be a CNN model, such as ResNet[5], or a vision transformer, such as ViT[52]. In our method, we choose ViT as the image encoder to be compatibile with the visual prompt in[23].

With a tremendous number of 400 million pairs of image-text samples, CLIP is trained under the contrastive learning framework, where the associated image and text are treated as positive samples, while the non-associated samples are treated as negative samples. After that, all the parameters of the pretrained CLIP model are kept frozen for downstream tasks without any fine-tuning. In downstream tasks, a hand-crafted prompt is fed into the text end to synthesize a zero-shot linear classifier by embedding the class names of the target dataset. Taking the classification task as an example, the “[CLASS]” token can be first extended by a template, such as “a photo of a [CLASS]”. Then, the sentence is treated as a prompt and is encoded by the text encoder to derive a weight vector 𝐰 i subscript 𝐰 𝑖{\bf w}_{i}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i={1,…,K}𝑖 1…𝐾 i={1,...,K}italic_i = { 1 , … , italic_K }, where K 𝐾 K italic_K is the total number of categories. At the same time, image features 𝐱 𝐱{\bf x}bold_x are obtained by the image encoder. The prediction probability can be calculated by

p⁢(y=i∣𝐱)=exp⁡(sim⁡(𝐱,𝐰 i)/τ)∑j=1 K exp⁡(sim⁡(𝐱,𝐰 j)/τ),𝑝 𝑦 conditional 𝑖 𝐱 sim 𝐱 subscript 𝐰 𝑖 𝜏 superscript subscript 𝑗 1 𝐾 sim 𝐱 subscript 𝐰 𝑗 𝜏 p(y=i\mid{\bf x})=\frac{\exp\left(\operatorname{sim}\left({\bf x},{\bf w}{i}% \right)/\tau\right)}{\sum{j=1}^{K}\exp\left(\operatorname{sim}\left({\bf x},{% \bf w}_{j}\right)/\tau\right)},italic_p ( italic_y = italic_i ∣ bold_x ) = divide start_ARG roman_exp ( roman_sim ( bold_x , bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( roman_sim ( bold_x , bold_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_τ ) end_ARG ,(1)

where sim⁡(⋅,⋅)sim⋅⋅\operatorname{sim}\left(\cdot,\cdot\right)roman_sim ( ⋅ , ⋅ ) represents the computation of cosine similarity, and τ 𝜏\tau italic_τ is the temperature coefficient learned by CLIP.

Figure 3: The detailed architecture of the proposed class-aware visual prompt tuning (CAVPT) generator module.

III-B Text Prompt and Visual Prompt

Text Prompt. It is known that hand-crafted prompts for the CLIP model may take considerable time and require expertise for word tuning, as a slight change in wording may lead to significant performance degradation. Motivated by prompt tuning in NLP models, CoOp[9] introduced a set of tunable word embedding vectors to learn machine-favorable prompts for the text end, which we call text prompts. Let {𝐮 1,𝐮 2,…,𝐮 M}subscript 𝐮 1 subscript 𝐮 2…subscript 𝐮 𝑀\left{{\bf u}{1},{\bf u}{2},...,{\bf u}{M}\right}{ bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } denote M 𝑀 M italic_M learnable context vectors, and the word embedding of the text class token be represented by 𝐜 i subscript 𝐜 𝑖{\bf c}{i}bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i={1,…,K}𝑖 1…𝐾 i={1,...,K}italic_i = { 1 , … , italic_K }; then, the prompt for the i t⁢h subscript 𝑖 𝑡 ℎ i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT class can be denoted as 𝐭 i={𝐮 1,𝐮 2,…,𝐮 M,𝐜 i}subscript 𝐭 𝑖 subscript 𝐮 1 subscript 𝐮 2…subscript 𝐮 𝑀 subscript 𝐜 𝑖{\bf t}{i}=\left{{\bf u}{1},{\bf u}{2},...,{\bf u}{M},{\bf c}{i}\right}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_u start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. By forwarding 𝐭 i subscript 𝐭 𝑖{\bf t}{i}bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the text encoder g⁢(⋅)𝑔⋅g(\cdot)italic_g ( ⋅ ), we can obtain a classification weight vector for the i t⁢h subscript 𝑖 𝑡 ℎ i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT visual concepts. The corresponding prediction probability can be calculated by

p⁢(y=i∣𝐱)=exp⁡(sim⁡(𝐱,g⁢(𝐭 i))/τ)∑j=1 K exp⁡(sim⁡(𝐱,g⁢(𝐭 j))/τ),𝑝 𝑦 conditional 𝑖 𝐱 sim 𝐱 𝑔 subscript 𝐭 𝑖 𝜏 superscript subscript 𝑗 1 𝐾 sim 𝐱 𝑔 subscript 𝐭 𝑗 𝜏 p(y=i\mid{\bf x})=\frac{\exp\left(\operatorname{sim}\left({\bf x},g({\bf t}{i% })\right)/\tau\right)}{\sum{j=1}^{K}\exp\left(\operatorname{sim}\left({\bf x}% ,g({\bf t}_{j})\right)/\tau\right)},italic_p ( italic_y = italic_i ∣ bold_x ) = divide start_ARG roman_exp ( roman_sim ( bold_x , italic_g ( bold_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_exp ( roman_sim ( bold_x , italic_g ( bold_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) / italic_τ ) end_ARG ,(2)

where 𝐱 𝐱{\bf x}bold_x represents the extracted image features, and g⁢(⋅)𝑔⋅g(\cdot)italic_g ( ⋅ ) denotes the text encoder.

Visual Prompt. For vision-language models, there are two encoders for visual and language modalities. Tuning text prompts alone is not enough to reduce the gap between pretrained and downstream tasks, thus leading to suboptimal results. Motivated by the visual prompt tuning (VPT)[23] proposed for tuning vision transformers, we introduce a visual prompt into the image encoder of the CLIP model. The image patches {𝐈 j∈ℝ 3×h×w∣j∈ℕ,1≤j≤N p}conditional-set subscript 𝐈 𝑗 superscript ℝ 3 ℎ 𝑤 formulae-sequence 𝑗 ℕ 1 𝑗 subscript 𝑁 𝑝\left{{\bf I}{j}\in\mathbb{R}^{3\times h\times w}\mid j\in\mathbb{N},1\leq j% \leq N{p}\right}{ bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_h × italic_w end_POSTSUPERSCRIPT ∣ italic_j ∈ blackboard_N , 1 ≤ italic_j ≤ italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } are first embedded into a d 𝑑 d italic_d-dimensional latent space as follows:

𝐞 0 j=Embed⁡(𝐈 j)𝐞 0 j∈ℝ d,j=1,2,…⁢N p.formulae-sequence superscript subscript 𝐞 0 𝑗 Embed subscript 𝐈 𝑗 formulae-sequence superscript subscript 𝐞 0 𝑗 superscript ℝ 𝑑 𝑗 1 2…subscript 𝑁 𝑝\mathbf{e}{0}^{j}=\operatorname{Embed}\left({\bf I}{j}\right)\quad{\bf e}{0% }^{j}\in\mathbb{R}^{d},j=1,2,\ldots N{p}.bold_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = roman_Embed ( bold_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_j = 1 , 2 , … italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT .(3)

Let 𝐄 l={𝐞 l j∈ℝ d∣j∈ℕ,1≤j≤N p}subscript 𝐄 𝑙 conditional-set superscript subscript 𝐞 𝑙 𝑗 superscript ℝ 𝑑 formulae-sequence 𝑗 ℕ 1 𝑗 subscript 𝑁 𝑝{\bf E}{l}=\left{{\bf e}{l}^{j}\in\mathbb{R}^{d}\mid j\in\mathbb{N},1\leq j% \leq N_{p}\right}bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { bold_e start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∣ italic_j ∈ blackboard_N , 1 ≤ italic_j ≤ italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT } and 𝐏 l={𝐩 l i∈ℝ d∣i∈ℕ,1≤i≤P}subscript 𝐏 𝑙 conditional-set subscript superscript 𝐩 𝑖 𝑙 superscript ℝ 𝑑 formulae-sequence 𝑖 ℕ 1 𝑖 𝑃{\bf P}{l}=\left{{\bf p}^{i}{l}\in\mathbb{R}^{d}\mid i\in\mathbb{N},1\leq i% \leq P\right}bold_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { bold_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∣ italic_i ∈ blackboard_N , 1 ≤ italic_i ≤ italic_P } represent a collection of image patch embeddings and visual prompts for the l t⁢h subscript 𝑙 𝑡 ℎ l_{th}italic_l start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT transformer layer, respectively. Suppose 𝐬 l∈ℝ d subscript 𝐬 𝑙 superscript ℝ 𝑑{\bf s}_{l}\in\mathbb{R}^{d}bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a learnable class token in the image encoder, which is different from the text class token used in text prompt that the latter is a category-related word embedding. There are two versions of visual prompts, VPT-Shallow and VPT-Deep, in[23]. We empirically found that VPT-Deep can achieve superior performances (see TableIII-B), and hence we take VPT-Deep into our implementation in SectionIV.

Visual prompts are introduced to each of the transformer layers, that is,

[𝐬 l,,𝐄 l]=Φ l⁢([𝐬 l−1,𝐏 l−1,𝐄 l−1]),l=1,2,…,L.formulae-sequence subscript 𝐬 𝑙 _ subscript 𝐄 𝑙 subscript Φ 𝑙 subscript 𝐬 𝑙 1 subscript 𝐏 𝑙 1 subscript 𝐄 𝑙 1 𝑙 1 2…𝐿\left[{\bf s}{l},_,{\bf E}{l}\right]=\Phi{l}\left(\left[{\bf s}{l-1},{\bf P% }{l-1},{\bf E}_{l-1}\right]\right),l=1,2,\ldots,L.[ bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , _ , bold_E start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] = roman_Φ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( [ bold_s start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ] ) , italic_l = 1 , 2 , … , italic_L .(4)

Generally, performance is positively correlated with prompt depth. Therefore, we utilize VPT-Deep in our model. 𝐬 L subscript 𝐬 𝐿{\bf s}_{L}bold_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is then projected by a linear projection layer L⁢P 𝐿 𝑃 LP italic_L italic_P to obtain the final image feature. For simplicity, the whole process of image feature extraction can be represented by

𝐱′=f⁢([𝐬 0,𝐏 0,⋯,𝐏 L,𝐄 0]),superscript 𝐱′𝑓 subscript 𝐬 0 subscript 𝐏 0⋯subscript 𝐏 𝐿 subscript 𝐄 0{\bf x}^{\prime}=f(\left[{\bf s}{0},{\bf P}{0},\cdots,{\bf P}{L},{\bf E}{0% }\right]),bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f ( [ bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , bold_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ) ,(5)

where f⁢(⋅)𝑓⋅f(\cdot)italic_f ( ⋅ ) denotes the image encoder.

Note that the calculation process of the image encoder, i.e., the ViT model, can be viewed as a process of global scene reasoning, and 𝐬 l subscript 𝐬 𝑙{\bf s}{l}bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT pools the visual concepts from the image patch embeddings layer-by-layer. With the help of visual prompts, the target visual concept corresponding to the downstream task may be further highlighted in 𝐬 l subscript 𝐬 𝑙{\bf s}{l}bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT via the self-attention operation in each transformer layer. By inserting visual prompts into each transformer layer, the self-attention operation for 𝐬 l subscript 𝐬 𝑙{\bf s}{l}bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can be affected in two ways, as both the keys and values are prepended through visual prompts: 1) The attention weights can be affected to allow 𝐬 l subscript 𝐬 𝑙{\bf s}{l}bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT to concentrate more on the image patch embeddings, which includes the target concept; 2) The visual prompts also serve as value vectors for the self-attention operation and thus 𝐬 l subscript 𝐬 𝑙{\bf s}_{l}bold_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT may absorb additional information that visual prompts learned.

However, naive visual prompts are devised as unconstrained learnable vectors, and they can only learn some downstream task-related information implicitly by tuning the prompts on downstream task datasets. In this work, we propose class-aware visual prompt tuning (CAVPT) to generate visual prompts by utilizing both task-related information from the text side and instance wise information from the visual side.

TABLE I: Main results of 11 datasets under 16-shots setting.

{NiceTabular} cccccccccccc—c Methods EuroSAT Caltech101 Oxford Flowers Food101 FGVC Aircraft DTD OxfordPets Stanford Cars Sun397 UCF101 ImageNet Average

ZSCLIP[1] 45.49 91.28 66.63 80.62 19.08 44.03 87.38 60.19 62.06 63.52 59.61 61.81

CoOp[9] 83.12 94.45 95.07 78.20 33.94 67.20 88.88 75.79 72.31 79.10 66.55 75.87

CoCoOp[15] 74.99 94.01 79.97 82.36 23.64 59.34 90.98 64.25 69.75 73.13 65.07 70.68

ProGrad[26] 82.49 95.18 94.60 81.15 32.50 65.98 90.43 74.85 73.22 78.52 66.60 75.96

ProDA[27] 83.28 95.5 95.98 81.89 34.68 70.76 90.6 77.64 75.07 81.85 67.62 77.72

VPT 92.17 94.85 93.80 81.29 39.98 67.16 90.32 72.03 69.84 80.17 64.17 76.89

VLP 91.90 95.10 96.05 78.42 42.92 68.06 90.33 78.81 72.12 82.04 66.91 78.42

DPT 91.16 95.61 96.60 79.25 48.37 70.16 91.22 82.55 70.97 81.43 66.85 79.47

III-C Class-Aware Visual Prompt Tuning

Class-aware visual prompts aim to explicitly encode task-related information. Our CAVPT generator takes two sides of inputs, the instance-wise information from the visual side and the task-related information from the text side. The text prompts features computed by the text encoder with all the text class tokens well represents the task-related information. However, when we input the text prompts features with all the text class tokens into the CAVPT generator, the computational complexity of CAVPT generator is linearly increased with the number of classes on each downstream task. To reduce the computational complexity of our CAVPT generator into constant, we select top-K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT text prompts features with the help of a Zero-Shot CLIP Inference module (the right part of Figure2). Note that the final performance is not sensitive to K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT and The final performance fluctuates with only 0.1%∼0.2%similar-to percent 0.1 percent 0.2 0.1%\sim 0.2%0.1 % ∼ 0.2 % when setting different K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. Then, feeding the text prompts with the top-K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT text class token [CLASS] into the text encoder produces K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT feature vectors, i.e., 𝐠 j∈ℝ D,1≤j≤K N formulae-sequence subscript 𝐠 𝑗 superscript ℝ 𝐷 1 𝑗 subscript 𝐾 𝑁{\bf g}{j}\in\mathbb{R}^{D},1\leq j\leq K{N}bold_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , 1 ≤ italic_j ≤ italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, in which the task-related information are encoded. A class-aware visual prompt is generated dynamically by performing cross-attention between text prompt features from the text side and the inputs of the transformer block from the visual side, as illustrated in Figure3.

After the mapping of a fully connected layer, we can obtain K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT query vectors 𝐪 j∈ℝ d,1≤j≤K N formulae-sequence subscript 𝐪 𝑗 superscript ℝ 𝑑 1 𝑗 subscript 𝐾 𝑁{\bf q}{j}\in\mathbb{R}^{d},1\leq j\leq K{N}bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , 1 ≤ italic_j ≤ italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. The key and value vectors 𝐤∈ℝ n×d 𝐤 superscript ℝ 𝑛 𝑑{\bf k}\in\mathbb{R}^{n\times d}bold_k ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT are both obtained from the corresponding visual transformer layer’s inputs, including image patch embeddings, image class token embedding, and visual prompts where n 𝑛 n italic_n stands for their total numbers. Our proposed class-aware visual prompt 𝐏~~l j∈ℝ d subscript superscript~~𝐏 𝑗 𝑙 superscript ℝ 𝑑\tilde{{\bf P}}^{j}{l}\in\mathbb{R}^{d}over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for the l t⁢h subscript 𝑙 𝑡 ℎ l{th}italic_l start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT layer is computed as

𝐨 l j=𝚂𝚘𝚏𝚝𝚖𝚊𝚡⁢(𝐪 j⁢𝐖 q⁢(𝐤𝐖 k)T d k)⁢𝐤𝐖 v,1≤j≤K N,formulae-sequence subscript superscript 𝐨 𝑗 𝑙 𝚂𝚘𝚏𝚝𝚖𝚊𝚡 subscript 𝐪 𝑗 subscript 𝐖 𝑞 superscript subscript 𝐤𝐖 𝑘 𝑇 subscript 𝑑 𝑘 subscript 𝐤𝐖 𝑣 1 𝑗 subscript 𝐾 𝑁{\bf o}^{j}{l}=\verb"Softmax"(\frac{{\bf q}{j}{\bf W}{q}({\bf k}{\bf W}{k}% )^{T}}{\sqrt{d_{k}}}){\bf k}{\bf W}{v},1\leq j\leq K{N},bold_o start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = typewriter_Softmax ( divide start_ARG bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_kW start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_kW start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , 1 ≤ italic_j ≤ italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ,(6)

𝐏~~l j=L⁢N⁢(𝐨 l j+𝐪 j),1≤j≤K N,formulae-sequence subscript superscript~~𝐏 𝑗 𝑙 𝐿 𝑁 subscript superscript 𝐨 𝑗 𝑙 subscript 𝐪 𝑗 1 𝑗 subscript 𝐾 𝑁\tilde{{\bf P}}^{j}{l}=LN({\bf o}^{j}{l}+{\bf q}{j}),1\leq j\leq K{N},over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_L italic_N ( bold_o start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , 1 ≤ italic_j ≤ italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ,(7)

where L⁢N⁢(⋅)𝐿 𝑁⋅LN(\cdot)italic_L italic_N ( ⋅ ) denotes layer normalization. 𝐖 𝐪∈ℝ d×d k subscript 𝐖 𝐪 superscript ℝ 𝑑 subscript 𝑑 𝑘{\bf W}{\bf q}\in\mathbb{R}^{d\times d{k}}bold_W start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐖 𝐤∈ℝ d×d k subscript 𝐖 𝐤 superscript ℝ 𝑑 subscript 𝑑 𝑘{\bf W}{\bf k}\in\mathbb{R}^{d\times d{k}}bold_W start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and 𝐖 𝐯∈ℝ d×d subscript 𝐖 𝐯 superscript ℝ 𝑑 𝑑{\bf W}_{\bf v}\in\mathbb{R}^{d\times d}bold_W start_POSTSUBSCRIPT bold_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT denote the parameters of cross attention.

To ensure the effect of the class-aware visual prompt, we additionally introduce a K 𝐾 K italic_K-way classifier on top of the K N subscript 𝐾 𝑁 K_{N}italic_K start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT outputs of the L⁢N 𝐿 𝑁 LN italic_L italic_N layer, and cross entropy loss is enforced on the K 𝐾 K italic_K-way logits as follows:

ℒ c⁢e c⁢a=−∑i 𝐲 i⁢log⁡p i,1≤i≤K,formulae-sequence superscript subscript ℒ 𝑐 𝑒 𝑐 𝑎 subscript 𝑖 subscript 𝐲 𝑖 subscript 𝑝 𝑖 1 𝑖 𝐾{\cal L}{ce}^{ca}=-\sum{i}{\bf y}{i}\log p{i},1\leq i\leq K,caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ≤ italic_i ≤ italic_K ,(8)

where p i subscript 𝑝 𝑖 p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i t⁢h subscript 𝑖 𝑡 ℎ i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT logit from classifying L⁢N⁢(𝐨 l j)𝐿 𝑁 subscript superscript 𝐨 𝑗 𝑙 LN({\bf o}^{j}{l})italic_L italic_N ( bold_o start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), K 𝐾 K italic_K denotes the number of classes and 𝐲 𝐲{\bf y}bold_y denotes the one-hot coding for the ground-truth target class. Note that only 𝐨 l j subscript superscript 𝐨 𝑗 𝑙{\bf o}^{j}{l}bold_o start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT derived from 𝐪 j subscript 𝐪 𝑗{\bf q}_{j}bold_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which corresponds to the ground-truth target class, will be classified.

As the image class token embedding in deeper layers usually contains more task-related semantic information, the class-aware visual prompt is only applied to the last few layers of the image encoder in our implementation.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 4: Main results on the 11 datasets with 1,2,4,8,16 shots with ViT-B/32. Note that we also compare our methods with CPT[33] on average accuracy.

III-D Training of DPT

A cross entropy loss is adopted to minimize the distance between the ground-truth annotation and the prediction probability computed by Equation(2).

ℒ c⁢e=−∑i 𝐲 i⁢log⁡p⁢(y=i∣𝐱′′),1≤i≤K,formulae-sequence subscript ℒ 𝑐 𝑒 subscript 𝑖 subscript 𝐲 𝑖 𝑝 𝑦 conditional 𝑖 superscript 𝐱′′1 𝑖 𝐾{\cal L}{ce}=-\sum{i}{\bf y}_{i}\log p(y=i\mid{\bf x}^{\prime\prime}),1\leq i% \leq K,caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p ( italic_y = italic_i ∣ bold_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , 1 ≤ italic_i ≤ italic_K ,(9)

where 𝐲 𝐲{\bf y}bold_y denotes the ground-truth annotation, p⁢(y=i∣𝐱′′)𝑝 𝑦 conditional 𝑖 superscript 𝐱′′p(y=i\mid{\bf x}^{\prime\prime})italic_p ( italic_y = italic_i ∣ bold_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) denotes the predicted probability from Equation(2), and 𝐱′′superscript 𝐱′′{\bf x}^{\prime\prime}bold_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT is the final obtained image feature,

𝐱′′=f⁢([𝐬 0,𝐏 0,⋯,𝐏 l,𝐏~~l+1 j,⋯,𝐏~~L j,𝐄 0]),superscript 𝐱′′𝑓 subscript 𝐬 0 subscript 𝐏 0⋯subscript 𝐏 𝑙 subscript superscript~~𝐏 𝑗 𝑙 1⋯subscript superscript~~𝐏 𝑗 𝐿 subscript 𝐄 0{\bf x}^{\prime\prime}=f(\left[{\bf s}{0},{\bf P}{0},\cdots,{\bf P}{l},% \tilde{{\bf P}}^{j}{l+1},\cdots,\tilde{{\bf P}}^{j}{L},{\bf E}{0}\right]),bold_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT = italic_f ( [ bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , bold_P start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ) ,(10)

The total loss function combines the two cross-entropy losses with a balancing hyperparameter α 𝛼\alpha italic_α as follows:

ℒ t⁢o⁢t⁢a⁢l=α⁢ℒ c⁢e c⁢a+ℒ c⁢e.subscript ℒ 𝑡 𝑜 𝑡 𝑎 𝑙 𝛼 superscript subscript ℒ 𝑐 𝑒 𝑐 𝑎 subscript ℒ 𝑐 𝑒{\cal L}{total}=\alpha{\cal L}{ce}^{ca}+{\cal L}_{ce}.caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_α caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT .(11)

III-E Warm-up Strategy

To accelerate the training process, we adopt a general knowledge-guided warmup strategy in the first few epochs of training. Considering that the CLIP model stores general knowledge, we train our model to learn from zero-shot CLIP. The loss function we used for the first few epochs can be described as follows:

ℒ=ℒ c⁢o⁢o⁢p+ℒ v⁢p⁢t+β⁢ℒ c⁢e+α⁢ℒ c⁢e c⁢a ℒ subscript ℒ 𝑐 𝑜 𝑜 𝑝 subscript ℒ 𝑣 𝑝 𝑡 𝛽 subscript ℒ 𝑐 𝑒 𝛼 superscript subscript ℒ 𝑐 𝑒 𝑐 𝑎{\cal L}={\cal L}{coop}+{\cal L}{vpt}+\beta{\cal L}{ce}+\alpha{\cal L}{ce}% ^{ca}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_o italic_p end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_v italic_p italic_t end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c italic_a end_POSTSUPERSCRIPT(12)

where ℒ c⁢o⁢o⁢p subscript ℒ 𝑐 𝑜 𝑜 𝑝{\cal L}{coop}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_o italic_p end_POSTSUBSCRIPT is the loss function we used for CoOp training, ℒ v⁢p⁢t subscript ℒ 𝑣 𝑝 𝑡{\cal L}{vpt}caligraphic_L start_POSTSUBSCRIPT italic_v italic_p italic_t end_POSTSUBSCRIPT is the loss function we used in VPT training, and ℒ c⁢e subscript ℒ 𝑐 𝑒{\cal L}{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT is the loss function we used in VLP training. β 𝛽\beta italic_β is a balancing hyperparameter. For ℒ c⁢o⁢o⁢p subscript ℒ 𝑐 𝑜 𝑜 𝑝{\cal L}{coop}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_o italic_p end_POSTSUBSCRIPT, we use the cross entropy loss to minimize the distance between the ground-truth annotation and the prediction probability computed by Equation (2).

ℒ c⁢o⁢o⁢p=−∑i 𝐲 i⁢log⁡p⁢(y=i∣𝐱),1≤i≤K,formulae-sequence subscript ℒ 𝑐 𝑜 𝑜 𝑝 subscript 𝑖 subscript 𝐲 𝑖 𝑝 𝑦 conditional 𝑖 𝐱 1 𝑖 𝐾{\cal L}{coop}=-\sum{i}{\bf y}_{i}\log p(y=i\mid{\bf x}),1\leq i\leq K,caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_o italic_p end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p ( italic_y = italic_i ∣ bold_x ) , 1 ≤ italic_i ≤ italic_K ,(13)

For ℒ v⁢p⁢t subscript ℒ 𝑣 𝑝 𝑡{\cal L}_{vpt}caligraphic_L start_POSTSUBSCRIPT italic_v italic_p italic_t end_POSTSUBSCRIPT, the predicted probability is computed by Equation (1) instead of Equation (2).

ℒ v⁢p⁢t=−∑i 𝐲 i⁢log⁡p⁢(y=i∣𝐱′′),1≤i≤K,formulae-sequence subscript ℒ 𝑣 𝑝 𝑡 subscript 𝑖 subscript 𝐲 𝑖 𝑝 𝑦 conditional 𝑖 superscript 𝐱′′1 𝑖 𝐾{\cal L}{vpt}=-\sum{i}{\bf y}_{i}\log p(y=i\mid{\bf x}^{\prime\prime}),1\leq i% \leq K,caligraphic_L start_POSTSUBSCRIPT italic_v italic_p italic_t end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p ( italic_y = italic_i ∣ bold_x start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ) , 1 ≤ italic_i ≤ italic_K ,(14)

By changing the loss function L c⁢e subscript 𝐿 𝑐 𝑒 L_{ce}italic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT in the first few epochs of training to Equation(12), we use general knowledge to guide the warm-up process. During training, the proposed DPT keeps the entire parameters of both the image and text encoder fixed while optimizing the Text prompt, Visual prompt and the parameters for generating class-aware visual prompt.

III-F Discussion on CAVPT

Fig.3 illustrates the detailed computation process of the proposed class-aware visual prompts. As shown in Fig.3, the CAVPT generator takes two types of inputs. Text prompt features from the text side include task-related information, while image patch embeddings from the image side represent visual instance information. First, the CAVPT generator performs cross-attention between the text prompt features and image patch embeddings, where query vectors are mapped from text prompt features while keys and values are derived from the image patch embeddings. Through the cross-attention operation, those image patch embeddings including more semantic information on objects belonging to the classes of downstream tasks will be more highlighted. As a result, the outputs of the cross-attention will include more features of the ground-truth objects. Then, our class-aware visual prompts are further generated with an additional “Add and Norm” operation similar to a typical transformer layer. As our class-aware visual prompts include richer semantic features of the ground-truth target objects, the final obtained image feature, which is computed by absorbing the information from the image patch embeddings and our class-aware visual prompts, can concentrate more on the classes corresponding to the downstream tasks.

TABLE II: Results of 11 datasets under 16-shots setting with ViT-B/16.

cccccccccccc—c Methods EuroSAT Caltech101 Oxford Flowers Food101 FGVC Aircraft DTD OxfordPets Stanford Cars Sun397 UCF101 ImageNet Average

ZSCLIP[1] 47.69 93.75 70.69 85.97 24.81 43.09 89.07 65.55 62.61 67.54 64.51 65.03

CoOp[9] 83.74 95.17 96.73 84.17 44.06 69.60 92.07 82.73 74.54 82.59 71.62 79.73

CoCoOp[15] 72.07 95.71 88.74 87.37 30.09 62.53 93.33 71.60 72.36 77.90 70.38 74.73

ProGrad[26] 84.29 95.89 96.30 86.68 41.23 68.83 93.25 81.71 75.10 81.16 71.94 79.67

ProDA[27] 85.17 96.23 97.54 87.29 44.40 72.46 93.42 83.89 77.19 85.12 72.73 81.40

VPT 92.67 96.27 96.59 87.03 51.11 71.26 92.76 81.44 72.93 85.19 69.98 81.57

VLP 91.87 96.08 97.37 84.57 52.99 72.20 93.11 85.62 74.48 86.36 72.46 82.46

DPT 92.10 96.06 97.59 85.00 57.85 72.65 93.45 88.24 74.29 85.31 72.49 83.18

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 5: Main results on the 11 datasets with 1,2,4,8,16 shots with ViT-B/16. Note that we also compare our methods with CPT[33] on average accuracy.

Generated on Thu Jul 13 16:49:07 2023 by [L A T E xml](http://dlmf.nist.gov/LaTeXML/)

Xet Storage Details

Size:: 59.1 kB
Xet hash:: 68965e124029963926dd7c24fd1ebc06d6a1e3295bf0bad4fd482a15ed68e8c8

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.