133 kB

Title: Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need

URL Source: https://arxiv.org/html/2303.07338

Published Time: Tue, 06 Aug 2024 01:27:10 GMT

Markdown Content: Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need

1 Introduction
2 Related Work 1. 2.1 Class-Incremental Learning (CIL) 2. 2.2 CIL with Pre-Trained Models 3. 2.3 Parameter-Efficient Tuning for Pre-Trained Models
3 From Old Classes to New Classes 1. 3.1 Class-Incremental Learning 2. 3.2 Adaptivity and Generalizability in Class-Incremental Learning
4 Aper: AdaPt and mERge PTMs for CIL 1. 4.1 Training Procedure of Aper 2. 4.2 Adapting the PTM 3. 4.3 Discussions on related concepts
5 Experiments 1. 5.1 Implementation Details 2. 5.2 Benchmark Comparison 3. 5.3 Ablation Study

4.   [5.4 Visualization of Incremental Sessions](https://arxiv.org/html/2303.07338v2#S5.SS4 "In 5 Experiments ‣ Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need")
5.   [5.5 Experiments with Multiple Domains](https://arxiv.org/html/2303.07338v2#S5.SS5 "In 5 Experiments ‣ Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need")

∎

1 1 institutetext: Da-Wei Zhou 2 2 institutetext: School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, China

2 2 email: zhoudw@lamda.nju.edu.cn 3 3 institutetext: Zi-Wen Cai 4 4 institutetext: School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, China

4 4 email: caizw@lamda.nju.edu.cn 5 5 institutetext: Han-Jia Ye 6 6 institutetext: School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, China

6 6 email: yehj@lamda.nju.edu.cn 7 7 institutetext: De-Chuan Zhan 8 8 institutetext: School of Artificial Intelligence, National Key Laboratory for Novel Software Technology, Nanjing University, China

8 8 email: zhandc@nju.edu.cn 9 9 institutetext: Ziwei Liu 10 10 institutetext: S-Lab, Nanyang Technological University, Singapore

10 10 email: ziwei.liu@ntu.edu.sg 11 11 institutetext: Han-Jia Ye and Ziwei Liu are corresponding authors.

Work done when Da-Wei Zhou was a visiting scholar at NTU. Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need

Da-Wei Zhou Zi-Wen Cai Han-Jia Ye De-Chuan Zhan Ziwei Liu

(Received: date / Accepted: date)

Abstract

Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Traditional CIL models are trained from scratch to continually acquire knowledge as data evolves. Recently, pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. Contrary to traditional methods, PTMs possess generalizable embeddings, which can be easily transferred for CIL. In this work, we revisit CIL with PTMs and argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring. 1) We first reveal that frozen PTM can already provide generalizable embeddings for CIL. Surprisingly, a simple baseline (SimpleCIL) which continually sets the classifiers of PTM to prototype features can beat state-of-the-art even without training on the downstream task. 2) Due to the distribution gap between pre-trained and downstream datasets, PTM can be further cultivated with adaptivity via model adaptation. We propose AdaPt and mERge (Aper), which aggregates the embeddings of PTM and adapted models for classifier construction. Aper is a general framework that can be orthogonally combined with any parameter-efficient tuning method, which holds the advantages of PTM’s generalizability and adapted model’s adaptivity. 3) Additionally, considering previous ImageNet-based benchmarks are unsuitable in the era of PTM due to data overlapping, we propose four new benchmarks for assessment, namely ImageNet-A, ObjectNet, OmniBenchmark, and VTAB. Extensive experiments validate the effectiveness of Aper with a unified and concise framework. Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.

Keywords:

Class-Incremental Learning, Pre-Trained Models, Continual Learning, Catastrophic Forgetting

1 Introduction

With the advancement of deep learning, deep models have achieved impressive feats in many fieldszhong2021neighborhood; liu2017cross; akbari2023alternating; wang2019learning; jaimes2007multimodal; ning2024moiretracker; yang2002detecting; ning2023rf. However, most research focuses on recognizing a limited number of classes in static environments. In the real world, applications often deal with streaming data with incoming new classesgomes2017survey. To address this issue, Class-Incremental Learning (CIL) has been proposed, which allows the model to learn from the evolving data and continuously build a unified classification model. Nevertheless, when new classes are added sequentially, the notorious catastrophic forgetting occursfrench1999catastrophic, which erases the previously learned knowledge. While typical CIL methods assume that the model is “trained from scratch,” recent advancements in pre-traininghan2021pre have made Pre-Trained Models (PTMs) more accessible for designing models in downstream tasks. These PTMs are often trained on massive corpusradford2021learning or abundant imagesdeng2009imagenet; ridnik2021imagenet with handcrafted trickssteiner2021train, resulting in strong generalizability. Consequently, several methodswang2022learning; wang2022dualprompt; wang2022s; villa2022pivot propose to leverage PTM for better incremental learning.

Powerful PTMs alleviate the burden of CILzhou2023class. However, upon revisiting the objective of CIL, we find essential differences between these protocols. Without PTMs, CIL models are trained from random initialization to continually acquire the knowledge of new classes and build a unified embedding space, which requires the adaptivity for sequential updating. In contrast, PTMs are trained with massive datasets, which makes it easier to achieve a powerful embedding space with strong generalizability. To use a human learning analogy, non-PTM methods aim to teach an infant to grow up and continually acquire knowledge through college, while PTM-based methods teach an experienced adult to do the same thing, which is much easier.

Figure 1: Comparison of different PTM-based CIL methods on VTAB dataset. The X-axis stands for the number of tunable parameters, and the Y-axis represents the average accuracy. The radius stands for the training time. Although consuming more tuning parameters and training time, current state-of-the-art (i.e., L2P and DualPrompt) still show inferior performance than the baseline method SimpleCIL. By contrast, our Aper consistently improves the baseline with tiny costs. For a fair comparison, all methods are based on pre-trained ViT-B/16-IN1K. SimpleCIL utilizes the training set to calculate the average embeddings.

To evaluate the generalizability of PTMs, we formulate a CIL task using the VTABzhai2019large dataset and test the performance of state-of-the-art PTM-based methodswang2022dualprompt; wang2022learning with a pre-trained ViT-B/16-IN1K in Figure1. As a comparison, we present a simple baseline SimpleCIL to evaluate the quality of the pre-trained features. With the pre-trained embedding function frozen, SimpleCIL sets the classifier weights to the average embeddingssnell2017prototypical of each new class for classification. The average embeddings are calculated with training images. In this way, SimpleCIL serves as a direct indicator of the quality of pre-trained features. If PTMs possess generalizable features, directly matching the average pattern to each query instance could also achieve competitive results. Surprisingly, Figure 1 shows that SimpleCIL outperforms the current SOTA by 5% even without any tuning on these downstream tasks, verifying its strong generalizability in knowledge transfer.

Although PTMs are generalizable for CIL, a domain gap may still exist between pre-trained and incremental datasetszhou2022domain; you2020co. For instance, the ImageNet pre-trained model may not generalize well to out-of-distributionhendrycks2021natural or specialized tasksalfassy2022feta. Under such circumstances, freezing the embedding for knowledge transferring is not a “panacea.” Accordingly, adaptivity becomes essential to enable the model to grasp task-specific features. Nevertheless, sequentially tuning the PTM will harm the structural information and weaken the generalizabilitykumarfine, leading to the irreversible forgetting of previous knowledge. Is there a way to unify the generalizability of PTM with the adaptivity of the adapted model?

In this paper, we present AdaPt and mERge (Aper) for CIL, which employs PTM to enhance generalizability and adaptivity in a unified framework. To improve adaptivity, we adapt the PTM in the first incremental stage via parameter-efficient tuning. Adapting the model helps to obtain task-specific features and fills the domain gap between PTM and incremental data. We then concatenate the adapted model with the PTM to extract average embeddings as the classifier, thereby maintaining generalizability. Aper restricts model tuning in the first stage, striking a balance between adaptivity and generalizability. Moreover, typical ImageNet-based CIL benchmarks are unsuitable for evaluation due to overlapping between pre-trained and downstream tasks. Therefore, we benchmark PTM-based CIL with four new datasets that have large domain gaps with the pre-trained data. Extensive experiments under various settings demonstrate the effectiveness of Aper. Our main contributions can be summarized as follows:

•With extensive empirical evaluations, we reveal a simple baseline (i.e., SimpleCIL) can directly transfer the strong generalizability of PTMs in CIL, which even outperforms current state-of-the-art without training on the downstream task.
•We propose Aper that employs PTM to enhance generalizability and adaptivity in a unified framework. It enjoys the generalizability of PTMs using a prototype-based classifier and the strong adaptivity of downstream tasks by model adaptation. Due to its uniformity, Aper can be applied to different network structures and different tuning techniques;
•Due to the overlapping between pre-trained data and traditional CIL benchmarks, we benchmark pre-trained model-based CIL with several new datasets with large domain gaps with ImageNet. Extensive experiments on these benchmark datasets verify Aper’s state-of-the-art performance.

2 Related Work

2.1 Class-Incremental Learning (CIL)

Class-incremental learning enables a learning system to continually incorporate new concepts without forgetting old oneszhou2023class; wang2023comprehensive; masana2022class; zhang2023vision; li2023dpps; cai2024single; yang2024exploring; li2023configure; ding2023structural; zheng2023preserving; zheng2024multi. Typical CIL methods can be divided into several categories. Exemplar-based methods save and replay exemplars from old classes to recover former knowledgealjundi2019gradient; chaudhry2018riemannian; iscen2020memory; liu2020mnemonics. Apart from direct saving exemplars, other methods work on saving featureszhao2021memory; iscen2020memory; zhu2021prototype or using generative modelsshin2017continual; jiang2021ib; smith2021always; gao2023ddgr to construct the memory. Knowledge distillation-based methods aim to align the outputs of old and new models during updating, thereby maintaining knowledge of old conceptsli2017learning; rebuffi2017icarl; douillard2020podnet; zhang2020class; hu2021distilling. The alignment can be built in several aspects, resulting in different optimization targets. iCaRLrebuffi2017icarl and LwFli2017learning utilize logit distillation, which requires the output logits on old classes to be the same. LUCIRhou2019learning utilizes feature distillation and forces the output features to be the same across models. Some following works distill other feature products to resist forgetting, e.g., attention mapdhar2019learning, weighted feature mapkang2022class, pooled featuresdouillard2020podnet, casual effecthu2021distilling, subspace featuresimon2021learning, and spatial/temporal featureszhao2021video. Additionally, other works also consider distilling the relational information among a group of instancesdong2021few; gao2022rdfcil; tao2020few; dong2023heterogeneous. The third group finds the inductive bias in the incremental model and designs rectification algorithms for an unbiased prediction. BiCwu2019large and IL2Mbelouadah2019il2m calibrate the logit scales between old and new classes to resist forgetting former classes. WAzhao2020maintaining directly normalizes the fully connected layer to alleviate its influence on the final prediction. SDCyu2020semantic estimates the prototype drift of former classes via new class instances. TEENwang2024few designs a prototype calibration process to adjust the few-shot prototypes of new classes for better classification. The following works also consider rectifying the biased BN statisticspham2021continual and feature representationsshi2022mimicking. Recently, network expansion-based methods have shown competitive performance, which can be further divided into neuron-wise, backbone-wise, and token-wise. Neuron-wiseyoon2018lifelong; xu2018reinforced expansion aims tom to expand the network’s width to enhance its representation ability. Backbone-wiseyan2021dynamically; wang2022foster; zhou2022model; wang2023beef expansion methods aim to build a holistic embedding by training a separate backbone for each new task and aggregate them as the final representation. Finally, token-wisedouillard2022dytox; wang2022dualprompt; wang2022learning; wang2022s expansion are designed to add lightweight tokens to adapt the model while preserving its knowledge. CIL algorithms are also widely adopted in other real-world applications, e.g., federated learningdong2022federated, semantic segmentationcermelli2022incremental, text-to-image diffusionsun2024create, and object detectiondong2021i3dol; perez2020incremental.

2.2 CIL with Pre-Trained Models

Pre-trained model-based CILzhou2024continual is becoming popular with the increasing prevalence of pre-trained modelsdosovitskiy2020image; radford2021learning. The aim is to sequentially adjust the PTM to stream data with new classes without forgetting. L2Pwang2022learning applies visual prompt tuningjia2022visual to CIL based on the pre-trained Vision Transformerdosovitskiy2020image and learns a prompt pool to select the instance-specific prompt. During training, L2P retrieves the nearest prompts to the query instance and appends them to get the instance-specific embedding. DualPromptwang2022dualprompt extends L2P with general and expert prompts. Specifically, general prompts are equally assigned to all tasks, while expert prompts are selected for the specific task via the prompt retrieval process. Unlike the key-value search in L2P, CODA-Promptsmith2023coda improves the prompt selection process with an attention mechanism so that the reweighted prompt can reflect the task-specific information of all seen classes. Furthermore, DAPjung2023generating learns a prompt generator that can generate instance-specific prompts instead of the complex prompt retrieval process. SLCAzhang2023slca explores the classifier rectification processzhu2021prototype during model updating. ESNwang2022isolation adopts the anchor-based energy self-normalization strategy to aggregate multiple pre-trained classifiers. CPPli2024steering designs task-specific prompt-tuning with a contrastive learning objective. LAEgao2023unified utilizes exponential moving average (EMA) among online and offline models to resist forgetting. Although it also considers learning multiple models for CIL, our work differs from it in the inference format and the updating policy. Apart from the single modality for visual recognition, recent research also involves incremental learning of pre-trained vision-language modelsyu2022coca; yuan2021florence; tschannen2022image. When changing ViT into CLIPradford2021learning, S-Promptswang2022s and Pivotvilla2022pivot extend L2P by learning prompts for both text and image modalitieszhou2022learning. A contemporary workliu2023large explores the application of PTMs in class-incremental novel class discoveryroy2022class, which shows a frozen PTM can detect and learn new classes with high performance. However, apart from the frozen PTM, this manuscript also explores the effect of downstream data in adapting the model, aiming to unify the generalizability of PTM and adaptivity of the downstream data.

2.3 Parameter-Efficient Tuning for Pre-Trained Models

Parameter-efficient tuning aims to adapt the pre-trained model to downstream tasks by tuning only a small number of (extra) parameters. Compared to fully finetuning, parameter-efficient tuning obtains competitive or even better performance at a much lower cost. Visual prompt tuning (VPT)jia2022visual prepends tunable prefix tokensli2021prefix to the input or hidden layers. LoRAhulora learns low-rank matrices to approximate parameter updates. AdaptFormerchenadaptformer learns extra adapterrebuffi2017learning modules with downsize and upsize projection. AdapterFusionpfeiffer2021adapterfusion merges the learned adapters with a fusion module. SSFlianscaling addresses the scaling and shifting operation for model tuning. BitFitzaken2021bitfit only tunes the bias term in the pre-trained model, and FacTjie2023fact tensorizes the weights of each ViT into a single 3D tensor and updates it during finetuning. Apart from additional modules in the network, Visual promptingbahng2022visual proposes learning tunable parameters in the input space. MAM-Adapterhetowards formulates these works in a unified framework, and NOAHzhang2022neural searches for the optimal design of prompt modules for downstream tasks. Apart from tuning a pre-trained Vision Transformer, CoOpzhou2022learning and CoCoOpzhou2022conditional explores the application of prompt tuning for CLIPradford2021learning via learning textual prompts. Maplekhattak2023maple further promotes strong coupling between the vision-language prompts to ensure mutual synergy.

3 From Old Classes to New Classes

3.1 Class-Incremental Learning

CIL aims to learn from an evolving data stream with new classes to build a unified classifierrebuffi2017icarl. There is a sequence of B 𝐵 B italic_B training tasks {𝒟 1,𝒟 2,⋯,𝒟 B}superscript 𝒟 1 superscript 𝒟 2⋯superscript 𝒟 𝐵\left{\mathcal{D}^{1},\mathcal{D}^{2},\cdots,\mathcal{D}^{B}\right}{ caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT }, where 𝒟 b={(𝐱 i b,y i b)}i=1 n b superscript 𝒟 𝑏 superscript subscript superscript subscript 𝐱 𝑖 𝑏 superscript subscript 𝑦 𝑖 𝑏 𝑖 1 subscript 𝑛 𝑏\mathcal{D}^{b}=\left{\left({\bf x}{i}^{b},y{i}^{b}\right)\right}{i=1}^{n% {b}}caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the b 𝑏 b italic_b-th incremental step with n b subscript 𝑛 𝑏 n{b}italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT instances. Here, the training instance 𝐱 i b∈ℝ D superscript subscript 𝐱 𝑖 𝑏 superscript ℝ 𝐷{\bf x}{i}^{b}\in\mathbb{R}^{D}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT belongs to class y i∈Y b subscript 𝑦 𝑖 subscript 𝑌 𝑏 y_{i}\in Y_{b}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, where Y b subscript 𝑌 𝑏 Y_{b}italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is the label space of task b 𝑏 b italic_b. Y b∩Y b′=∅subscript 𝑌 𝑏 subscript 𝑌 superscript 𝑏′Y_{b}\cap Y_{b^{\prime}}=\varnothing italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∩ italic_Y start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∅ for b≠b′𝑏 superscript 𝑏′b\neq b^{\prime}italic_b ≠ italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. During the b 𝑏 b italic_b-th training stage, we can only access data from 𝒟 b superscript 𝒟 𝑏\mathcal{D}^{b}caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT for model updating. This paper focuses on the exemplar-free CIL settingzhu2021prototype; wang2022learning, where no historical data can be fetched for rehearsal. The goal of CIL is to build a unified model for all seen classes incrementally, i.e., acquiring knowledge from new classes and meanwhile preserving knowledge from former ones. The model’s capability is evaluated over all seen classes 𝒴 b=Y 1∪⋯⁢Y b subscript 𝒴 𝑏 subscript 𝑌 1⋯subscript 𝑌 𝑏\mathcal{Y}{b}=Y{1}\cup\cdots Y_{b}caligraphic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ ⋯ italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT after each incremental task. Formally, the target is to fit a model f⁢(𝐱):X→𝒴 b:𝑓 𝐱→𝑋 subscript 𝒴 𝑏 f({\bf x}):X\rightarrow\mathcal{Y}_{b}italic_f ( bold_x ) : italic_X → caligraphic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT that minimizes the empirical risk across all testing datasets:

1 N⁢∑(𝐱 j,y j)∈𝒟 t 1∪⋯⁢𝒟 t b ℓ⁢(f⁢(𝐱 j),y j),1 𝑁 subscript subscript 𝐱 𝑗 subscript 𝑦 𝑗 superscript subscript 𝒟 𝑡 1⋯superscript subscript 𝒟 𝑡 𝑏 ℓ 𝑓 subscript 𝐱 𝑗 subscript 𝑦 𝑗\textstyle\frac{1}{N}\sum_{(\mathbf{x}{j},y{j})\in\mathcal{D}{t}^{1}\cup% \cdots\mathcal{D}{t}^{b}}\ell\left(f\left({\bf x}{j}\right),{y}{j}\right),,divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∪ ⋯ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,(1)

where ℓ⁢(⋅,⋅)ℓ⋅⋅\ell(\cdot,\cdot)roman_ℓ ( ⋅ , ⋅ ) measures the discrepancy between prediction and ground-truth label. 𝒟 t b superscript subscript 𝒟 𝑡 𝑏\mathcal{D}_{t}^{b}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT denotes the testing set of task b 𝑏 b italic_b, and N 𝑁 N italic_N is the number of instances. A good CIL model satisfying Eq.1 has discriminability among all classes, which strikes a balance between learning new classes and remembering old ones.

Following wang2022learning; wang2022dualprompt; zhou2024continual, we assume the availability of a pre-trained model (e.g., a ViTdosovitskiy2020image or ResNethe2016deep) on ImageNetdeng2009imagenet, which we use as the initialization of f⁢(𝐱)𝑓 𝐱 f({\bf x})italic_f ( bold_x ). For clarity, we decouple the deep model into two parts: f⁢(𝐱)=W⊤⁢ϕ⁢(𝐱)𝑓 𝐱 superscript 𝑊 top italic-ϕ 𝐱 f({\bf x})=W^{\top}\phi({\bf x})italic_f ( bold_x ) = italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_x ), where ϕ⁢(⋅):ℝ D→ℝ d:italic-ϕ⋅→superscript ℝ 𝐷 superscript ℝ 𝑑\phi(\cdot):\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}italic_ϕ ( ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the embedding function and W∈ℝ d×|𝒴 b|𝑊 superscript ℝ 𝑑 subscript 𝒴 𝑏 W\in\mathbb{R}^{d\times|\mathcal{Y}{b}|}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × | caligraphic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT is the classification head. We denote the classifier for class k 𝑘 k italic_k as 𝐰 k subscript 𝐰 𝑘{\bf w}{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT: W=[𝐰 1,⋯,𝐰|𝒴 b|]𝑊 subscript 𝐰 1⋯subscript 𝐰 subscript 𝒴 𝑏 W=[{\bf w}{1},\cdots,{\bf w}{|\mathcal{Y}{b}|}]italic_W = [ bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_w start_POSTSUBSCRIPT | caligraphic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT | end_POSTSUBSCRIPT ]. We refer to the features after pooling as ϕ⁢(𝐱)italic-ϕ 𝐱\phi({\bf x})italic_ϕ ( bold_x ) for convolutional networks. In a plain ViT, the input encoding layer transforms the image into a sequence of output features 𝐱 e∈ℝ L×d subscript 𝐱 𝑒 superscript ℝ 𝐿 𝑑{\bf x}{e}\in\mathbb{R}^{L\times d}bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT, where L 𝐿 L italic_L is the sequence length. We assume the first token in 𝐱 e subscript 𝐱 𝑒{\bf x}{e}bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT to be the [CLS] token to simplify notation. 𝐱 e subscript 𝐱 𝑒{\bf x}{e}bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is then fed into the subsequent layers (i.e., multi-head self-attention and MLP) to produce the final embeddings. We treat the embedded [CLS] token as ϕ⁢(𝐱)italic-ϕ 𝐱\phi({\bf x})italic_ϕ ( bold_x ) for ViT.

(a)Accuracy of new classes

(b)Accuracy of old classes

Figure 2: Performance of new and old classes in CIL with PTM. Sequentially finetuning the model fills the domain gap and performs better on new classes, while freezing the model has better generalizability and performs better on old classes.

3.2 Adaptivity and Generalizability in Class-Incremental Learning

CIL with Adaptivity: Before introducing PTMs into CIL, models are trained from scratch to gradually acquire knowledge of new classes. The common solution is to update the incremental model with cross-entropy loss, which equips the model with adaptivity to adapt to new tasks:

ℒ=∑(𝐱 i,y i)∈𝒟 b ℓ⁢(f⁢(𝐱 i),y i)+ℒ r⁢e⁢g,ℒ subscript subscript 𝐱 𝑖 subscript 𝑦 𝑖 superscript 𝒟 𝑏 ℓ 𝑓 subscript 𝐱 𝑖 subscript 𝑦 𝑖 subscript ℒ 𝑟 𝑒 𝑔\textstyle\mathcal{L}=\sum_{(\mathbf{x}{i},y{i})\in\mathcal{D}^{b}}\ell\left% (f\left({\bf x}{i}\right),{y}{i}\right)+\mathcal{L}_{reg},,caligraphic_L = ∑ start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT ,(2)

where ℒ r⁢e⁢g subscript ℒ 𝑟 𝑒 𝑔\mathcal{L}_{reg}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g end_POSTSUBSCRIPT stands for the regularization terms to resist forgetting, e.g., knowledge distillationhinton2015distilling; li2017learning or parameter regularizationkirkpatrick2017overcoming.

CIL with Generalizability: With the introduction of PTM to CILwang2022learning, continual learners are born with generalizability, which can be directly transferred to downstream tasks without learning. Correspondingly, we define a simple baseline, SimpleCIL, to transfer PTM for incremental tasks. With the embedding function ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ )frozen throughout the learning process, we extract average embedding (i.e., prototypesnell2017prototypical) of each class:

𝐩 i=1 K⁢∑j=1|𝒟 b|𝕀⁢(y j=i)⁢ϕ⁢(𝐱 j),subscript 𝐩 𝑖 1 𝐾 superscript subscript 𝑗 1 superscript 𝒟 𝑏 𝕀 subscript 𝑦 𝑗 𝑖 italic-ϕ subscript 𝐱 𝑗\displaystyle\textstyle{\bf p}{i}=\frac{1}{K}{\sum{j=1}^{|\mathcal{{D}}^{b}|% }\mathbb{I}(y_{j}=i)\phi({\bf x}_{j})},,bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT blackboard_I ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i ) italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,(3)

Generalizability vs. Adaptivity: Eq.2 and Eq.3 address different aspects of CIL models. The former aims to enhance the adaptivity by enabling the model to be gradually tuned. By contrast, the latter highlights the model’s generalizability by freezing it throughout the learning process. To understand their roles in CIL, we conduct an experiment on CIFAR100 with 20 incremental tasks and compare the performance of finetuning versus SimpleCIL. These methods are based on pre-trained ViT-B/16-IN21K, and we separately report the performance of new (Y b subscript 𝑌 𝑏 Y_{b}italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT) and old (𝒴 b−1 subscript 𝒴 𝑏 1\mathcal{Y}_{b-1}caligraphic_Y start_POSTSUBSCRIPT italic_b - 1 end_POSTSUBSCRIPT) classes in Figure2. Specifically, SimpleCIL relies on the generalizability of PTM, which works competitively even without training on the target dataset. However, it can be further improved to grasp the task-specific features, and finetuning shows better performance in new classes with the help of adaptivity. However, finetuning suffers catastrophic forgetting of old classes since features are continually changing.

To summarize, these characteristics are two core aspects of CIL — adaptivity enables the model to bridge the domain gap between pre-training and incremental learning. At the same time, generalizability encourages knowledge transfer from pre-training to incremental learning. Therefore, both of them should be cultivated to facilitate CIL.

Figure 3: Illustration of Aper. Left: the training protocol of Aper. We adapt the PTM using the first stage training set 𝒟 1 superscript 𝒟 1\mathcal{D}^{1}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and then concatenate the embedding functions of PTM and the adapted model to maintain generalizability and adaptivity. The aggregated embedding function [ϕ∗⁢(⋅),ϕ⁢(⋅)]superscript italic-ϕ⋅italic-ϕ⋅[\phi^{*}(\cdot),\phi(\cdot)][ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) , italic_ϕ ( ⋅ ) ] is frozen throughout the following stages, and we extract the prototypes via Eq.6 to set the classifier. Middle: adapting pre-trained ViT for CIL. We provide VPT Deep/Shallow, Scale & Shift, and Adapter for model adaptation. Right: adapting pre-trained CNN for CIL. We provide BN tuning and Scale & Shift for model adaptation. Aper is a general framework that can be orthogonally combined with these adapting techniques. Red modules in the figure are trainable, while gray ones are frozen.

4 Aper: AdaPt and mERge PTMs for CIL

Motivated by the potential for enhancing generalizability and adaptivity, can we achieve these characteristics in a unified framework? Specifically, we aim to achieve this goal from two aspects. On the one hand, to bridge the domain gap between the PTM and downstream datasets, model adaptation is essential to move the PTM towards incremental data. On the other hand, since the adapted model may lose the generalizability of high-level features, we attempt to merge the adapted model and PTM into a unified network for future tasks. The merged embedding function is kept frozen throughout the incremental learning process, transferring the generalizable embedding of model sets to incoming new classes. In this way, generalizability and adaptivity are achieved in the unified framework. We first introduce the framework of Aper and then discuss the specific techniques for model adaptation.

4.1 Training Procedure of Aper

Although PTMs have discriminating features, there may exist a significant domain gap between the pre-trained dataset and incremental data. For example, the PTM is optimized to capture the characteristics of classes in ImageNet, while the incremental data stream may correspond to specialized data requiring domain knowledge or has extensive concept drift from ImageNet. To bridge this gap, an adapting process can be developed with the incremental data:

f∗⁢(𝐱)=ℱ⁢(f⁢(𝐱),𝒟,Θ),superscript 𝑓 𝐱 ℱ 𝑓 𝐱 𝒟 Θ\displaystyle f^{*}({\bf x})=\mathcal{F}(f({\bf x}),\mathcal{D},\Theta),,italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) = caligraphic_F ( italic_f ( bold_x ) , caligraphic_D , roman_Θ ) ,(4)

where the adapting algorithm ℱ ℱ\mathcal{F}caligraphic_F takes the current model f⁢(𝐱)𝑓 𝐱 f({\bf x})italic_f ( bold_x ) and the dataset 𝒟 𝒟\mathcal{D}caligraphic_D as input. It optimizes the parameter set Θ Θ\Theta roman_Θ and produces the adapted model f∗⁢(𝐱)superscript 𝑓 𝐱 f^{*}({\bf x})italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) that gains the domain-specific knowledge in the corresponding dataset. We introduce the variations of ℱ ℱ\mathcal{F}caligraphic_F in Section4.2. If we could obtain all the incremental training sets at once, adapting the model via ℱ⁢(f⁢(𝐱),𝒟 1∪𝒟 2⁢⋯∪𝒟 B,Θ)ℱ 𝑓 𝐱 superscript 𝒟 1 superscript 𝒟 2⋯superscript 𝒟 𝐵 Θ\mathcal{F}(f({\bf x}),\mathcal{D}^{1}\cup\mathcal{D}^{2}\cdots\cup\mathcal{D}% ^{B},\Theta)caligraphic_F ( italic_f ( bold_x ) , caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∪ caligraphic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋯ ∪ caligraphic_D start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , roman_Θ ) can transfer the knowledge from the PTM to the incremental dataset and grasp the task-specific features for better performance.

However, since data in CIL arrive sequentially, we cannot hold all the training sets at once. Continuously adapting the model would consequently result in catastrophic forgetting (as shown in Figure2(b)). Hence, an alternative choice is to adapt the model only in the first incremental stage:

f∗⁢(𝐱)=ℱ⁢(f⁢(𝐱),𝒟 1,Θ).superscript 𝑓 𝐱 ℱ 𝑓 𝐱 superscript 𝒟 1 Θ\displaystyle f^{*}({\bf x})=\mathcal{F}(f({\bf x}),\mathcal{D}^{1},\Theta),.italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) = caligraphic_F ( italic_f ( bold_x ) , caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , roman_Θ ) .(5)

Since 𝒟 1 superscript 𝒟 1\mathcal{D}^{1}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is a subset of the incremental data stream, it also possesses domain-specific knowledge that could facilitate model adaptation. The tuning process enhances the adaptivity of the CIL model, and the next question is to ensure generalizability. Since Eq.5 forces the original generalizable feature to become more specialized to the downstream task, high-level features irrelevant to 𝒟 1 superscript 𝒟 1\mathcal{D}^{1}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT shall be overwritten and forgotten. Therefore, a better solution is to concatenate the features extracted by the PTM and the adapted model, i.e., [ϕ∗⁢(𝐱),ϕ⁢(𝐱)]superscript italic-ϕ 𝐱 italic-ϕ 𝐱[\phi^{}({\bf x}),\phi({\bf x})][ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) , italic_ϕ ( bold_x ) ], where ϕ∗⁢(𝐱)superscript italic-ϕ 𝐱\phi^{}({\bf x})italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) and ϕ⁢(𝐱)italic-ϕ 𝐱\phi({\bf x})italic_ϕ ( bold_x ) stand for the adapted and pre-trained embedding functions, respectively.

To maintain generalizability, we freeze the concatenated embedding functions [ϕ∗⁢(⋅),ϕ⁢(⋅)]superscript italic-ϕ⋅italic-ϕ⋅[\phi^{*}(\cdot),\phi(\cdot)][ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) , italic_ϕ ( ⋅ ) ] after adaptation and extract prototypes for the following classes:

𝐩 i=1 K⁢∑j=1|𝒟 b|𝕀⁢(y j=i)⁢[ϕ∗⁢(𝐱 j),ϕ⁢(𝐱 j)],subscript 𝐩 𝑖 1 𝐾 superscript subscript 𝑗 1 superscript 𝒟 𝑏 𝕀 subscript 𝑦 𝑗 𝑖 superscript italic-ϕ subscript 𝐱 𝑗 italic-ϕ subscript 𝐱 𝑗\displaystyle\textstyle{\bf p}{i}=\frac{1}{K}{\sum{j=1}^{|\mathcal{{D}}^{b}|% }\mathbb{I}(y_{j}=i)[\phi^{*}({\bf x}{j}),\phi({\bf x}{j})]},,bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT blackboard_I ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i ) [ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_ϕ ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ] ,(6)

where K=∑j=1|𝒟 b|𝕀⁢(y j=i)𝐾 superscript subscript 𝑗 1 superscript 𝒟 𝑏 𝕀 subscript 𝑦 𝑗 𝑖 K={\sum_{j=1}^{|\mathcal{{D}}^{b}|}\mathbb{I}(y_{j}=i)}italic_K = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT blackboard_I ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_i ). Compared to Eq.3, Eq.6 contains additional information from the adapted model, which incorporates domain-specific features for better recognition. These prototypes reveal the most common patterns from the adapted and pre-trained models, ensuring both generalizability and adaptivity. We directly adopt the class prototype as the classifier weight, i.e., 𝐰 i=𝐩 i subscript 𝐰 𝑖 subscript 𝐩 𝑖{\bf w}{i}={\bf p}{i}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and utilize a cosine classifier for classification: f⁢(𝐱)=(W‖W‖2)⊤⁢([ϕ∗⁢(𝐱),ϕ⁢(𝐱)]‖[ϕ∗⁢(𝐱),ϕ⁢(𝐱)]‖2)𝑓 𝐱 superscript 𝑊 subscript norm 𝑊 2 top superscript italic-ϕ 𝐱 italic-ϕ 𝐱 subscript norm superscript italic-ϕ 𝐱 italic-ϕ 𝐱 2 f(\mathbf{x})=(\frac{W}{|W|{2}})^{\top}(\frac{[\phi^{}({\bf x}),\phi({\bf x% })]}{|[\phi^{}({\bf x}),\phi({\bf x})]|{2}})italic_f ( bold_x ) = ( divide start_ARG italic_W end_ARG start_ARG ∥ italic_W ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( divide start_ARG [ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) , italic_ϕ ( bold_x ) ] end_ARG start_ARG ∥ [ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) , italic_ϕ ( bold_x ) ] ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ). Based on the similarity between instance embedding and class prototype, it assigns a higher probability to the class with a more similar prototype.

Effect of Adapt and Merge: We give the visualizations of Aper in Figure3 (left). Although 𝒟 1 superscript 𝒟 1\mathcal{D}^{1}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is a subset of the entire training set, adapting with it still helps transfer the PTM from the upstream dataset to the downstream task. The adapting process can be viewed as a further pre-training procedure, which adapts the PTM to the incremental dataset and bridges the domain gap. By merging the embedding functions of the PTM and the adapted model, the extracted features are more representative than any one of them alone. Additionally, since the model is only trainable in the first incremental task, the efficiency of Aper is comparable to SimpleCIL, which does not require sequential tuning. On the other hand, since the model is frozen in the subsequent tasks, it does not suffer catastrophic forgetting of former concepts. We give the pseudo-code of Aper in Algorithm1. Given the pre-trained model, we first adapt it with the first training dataset via Eq.5 to get the adapted model. Afterward, we freeze the pre-trained model and adapted model and merge the embeddings. For the subsequent tasks, we get a new dataset and replace the classifier weights with prototypical features (i.e., class centers). In the extreme case where the adaptation process in Eq.5 does nothing to the PTM, Aper will degrade to SimpleCIL, which guarantees the performance lower bound.

Algorithm 1 AdaPt and mERge (Aper) for CIL

Input: Incremental datasets: {𝒟 1,𝒟 2,⋯,𝒟 B}superscript 𝒟 1 superscript 𝒟 2⋯superscript 𝒟 𝐵\left{\mathcal{D}^{1},\mathcal{D}^{2},\cdots,\mathcal{D}^{B}\right}{ caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , caligraphic_D start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT }, Pre-trained Model: f⁢(𝐱)𝑓 𝐱 f({\bf x})italic_f ( bold_x );

Output: Updated model;

1:Adapt the model to 𝒟 1 superscript 𝒟 1\mathcal{D}^{1}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT via Eq.5; ▷▷\triangleright▷Model adapt

2:Freeze the embedding functions ϕ∗⁢(⋅)superscript italic-ϕ⋅\phi^{*}(\cdot)italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ) and ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ );

3:Merge the embeddings, i.e., [ϕ∗⁢(𝐱),ϕ⁢(𝐱)]superscript italic-ϕ 𝐱 italic-ϕ 𝐱[\phi^{*}({\bf x}),\phi({\bf x})][ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) , italic_ϕ ( bold_x ) ];▷▷\triangleright▷Model merge

4:for b=1,2⁢⋯,B 𝑏 1 2⋯𝐵 b=1,2\cdots,B italic_b = 1 , 2 ⋯ , italic_B do▷▷\triangleright▷Incremental learning

5:Get the incremental training set 𝒟 b superscript 𝒟 𝑏\mathcal{D}^{b}caligraphic_D start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT;

6:Extract the prototypes via Eq.6;

7:Replace the classifier with prototype; return the updated model;

4.2 Adapting the PTM

To bridge the distribution gap between the pre-trained and incremental datasets, Aper’s performance depends on the effective adapting algorithm ℱ ℱ\mathcal{F}caligraphic_F. In this section, we discuss six specializations of ℱ ℱ\mathcal{F}caligraphic_F in Aper that can handle different types of PTMs, such as ViTs and CNNs.

Fully Finetune: is a common solution when transferring the model to downstream tasks. It involves tuning all parameters in the adapting process, i.e., Θ=θ ϕ∪θ W Θ subscript 𝜃 italic-ϕ subscript 𝜃 𝑊\Theta=\theta_{\phi}\cup\theta_{W}roman_Θ = italic_θ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, and minimizing the discrepancy between the model’s output and the ground truth:

min θ ϕ∪θ W⁢∑(𝐱 j,y j)∈𝒟 1 ℓ⁢(f⁢(𝐱 j),y j).subscript subscript 𝜃 italic-ϕ subscript 𝜃 𝑊 subscript subscript 𝐱 𝑗 subscript 𝑦 𝑗 superscript 𝒟 1 ℓ 𝑓 subscript 𝐱 𝑗 subscript 𝑦 𝑗\displaystyle\min_{\theta_{\phi}\cup\theta_{W}}\sum_{(\mathbf{x}{j},y{j})\in% \mathcal{D}^{1}}\ell\left(f\left({\bf x}{j}\right),{y}{j}\right),.roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_f ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .(7)

However, the tuning cost could be relatively high for large-scale PTMs, e.g., ViTs. Therefore, some parameter-efficient tuning techniques can alleviate the tuning cost and be better solutions.

Visual Prompt Tuning (VPT)jia2022visual: is a lightweight tuning technique for adapting ViTs, which only prepends some learnable prompts 𝐏∈ℝ p×d 𝐏 superscript ℝ 𝑝 𝑑\mathbf{P}\in\mathbb{R}^{p\times d}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_d end_POSTSUPERSCRIPT to form the extended features [𝐏,𝐱 e]𝐏 subscript 𝐱 𝑒[\mathbf{P},{\bf x}{e}][ bold_P , bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ], where 𝐱 e subscript 𝐱 𝑒{\bf x}{e}bold_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the encoded features of the input image. The extended features are then fed into the subsequent layers of ViT to calculate the final embeddings. There are two variations of VPT: VPT-Deep, which prepends the prompts at every attention layer, and VPT-Shallow, which only prepends the prompts at the first layer. During optimization, it freezes the pre-trained weights in the embedding function and optimizes these prompts and classification head, i.e., Θ=θ 𝐏∪θ W Θ subscript 𝜃 𝐏 subscript 𝜃 𝑊\Theta=\theta_{\mathbf{P}}\cup\theta_{W}roman_Θ = italic_θ start_POSTSUBSCRIPT bold_P end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT.

Scale & Shift (SSF)lianscaling: aims to adjust the feature activation by scaling and shifting. It appends an extra SSF layer after each operation layer (i.e., MSA and MLP) and adjusts the output of these operations. Given the input 𝐱 i∈ℝ L×d subscript 𝐱 𝑖 superscript ℝ 𝐿 𝑑{\bf x}{i}\in\mathbb{R}^{L\times d}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT, the output 𝐱 o∈ℝ L×d subscript 𝐱 𝑜 superscript ℝ 𝐿 𝑑{\bf x}{o}\in\mathbb{R}^{L\times d}bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT is formulated as:

𝐱 o=γ⊗𝐱 i+β,subscript 𝐱 𝑜 tensor-product 𝛾 subscript 𝐱 𝑖 𝛽\displaystyle\textstyle{\bf x}{o}=\gamma\otimes{\bf x}{i}+\beta,,bold_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = italic_γ ⊗ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β ,(8)

where γ∈ℝ d 𝛾 superscript ℝ 𝑑\gamma\in\mathbb{R}^{d}italic_γ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and β∈ℝ d 𝛽 superscript ℝ 𝑑\beta\in\mathbb{R}^{d}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are the scale and shift factors, respectively. ⊗tensor-product\otimes⊗ is Hadamard product (element-wise multiplication). The model optimizes the SSF layers and classifier, i.e., Θ=θ S⁢S⁢F∪θ W Θ subscript 𝜃 𝑆 𝑆 𝐹 subscript 𝜃 𝑊\Theta=\theta_{SSF}\cup\theta_{W}roman_Θ = italic_θ start_POSTSUBSCRIPT italic_S italic_S italic_F end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, to trace the features of new tasks.

Adapterhoulsby2019parameter; chenadaptformer: is a bottleneck module which contains a down-projection W down∈ℝ d×r subscript 𝑊 down superscript ℝ 𝑑 𝑟 W_{\text{down}}\in\mathbb{R}^{d\times r}italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT to reduce the feature dimension, a non-linear activation function, and an up-projection W up∈ℝ r×d subscript 𝑊 up superscript ℝ 𝑟 𝑑 W_{\text{up}}\in\mathbb{R}^{r\times d}italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_d end_POSTSUPERSCRIPT to project back to the original dimension. We followchenadaptformer to equip the original MLP structure in ViT with the adapter. We denote the input of the MLP layer as 𝐱 ℓ subscript 𝐱 ℓ{\bf x}_{\ell}bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, and the output of AdaptMLP is formatted as:

MLP⁢(𝐱 ℓ)+ReLU⁢(𝐱 ℓ⁢W down)⁢W up.MLP subscript 𝐱 ℓ ReLU subscript 𝐱 ℓ subscript 𝑊 down subscript 𝑊 up\displaystyle\text{MLP}({\bf x}{\ell})+\text{ReLU}({\bf x}{\ell}W_{\text{% down}})W_{\text{up}},.MLP ( bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) + ReLU ( bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT .(9)

With pre-trained weights frozen, it optimizes the adapter and classification head, i.e., Θ=θ W down∪θ W up∪θ W Θ subscript 𝜃 subscript 𝑊 down subscript 𝜃 subscript 𝑊 up subscript 𝜃 𝑊\Theta=\theta_{W_{\text{down}}}\cup\theta_{W_{\text{up}}}\cup\theta_{W}roman_Θ = italic_θ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT down end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT up end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ italic_θ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT.

Batch Normalization Tuning: If the PTM is a convolutional network, e.g., CNNs, we can adjust the BNioffe2015batch parameters. Since the running mean and variance in BN are compatible with the upstream data distribution, they could be unstable for downstream tasks. Correspondingly, we can reset the running statistics in BN and adapt to the current data via forward passing. No backpropagation is required, making it quick and simple for the pre-trained model.

Discussions: We visualize the adapting process of Aper in Figure3. Compared to fully finetuning, parameter-efficient tuning adjusts the PTM towards the downstream task with lightweight modules. Besides, recent parameter-efficient tuning methods show stronger performance than fully finetune, indicating stronger adaptivity in the adapting process. The adapted model can capture the specialized features in the incremental data, leading to better adaptivity. Since L2P and DualPrompt are based on pre-trained ViT, they cannot be deployed with CNN. In contrast, Aper is a general framework that efficiently handles diverse structures. Specifically, Aper can be combined with VPT/SSF/Adapter for ViT and SSF/BN Tuning for CNN. Since Aper adopts the prototype-based classifier, the linear classifier W 𝑊 W italic_W will be dropped after adaptation.

Table 1: Average and last performance comparison on seven datasets with ViT-B/16-IN21K as the backbone. ‘IN-R/A’ stands for ‘ImageNet-R/A,’ ‘ObjNet’ stands for ‘ObjectNet,’ and ‘OmniBench’ stands for ‘OmniBenchmark.’ The best performance is shown in bold.

Method	CIFAR B0 Inc5	CUB B0 Inc10	IN-R B0 Inc5	IN-A B0 Inc10	ObjNet B0 Inc10	OmniBench B0 Inc30	VTAB B0 Inc10
𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG	𝒜 B subscript 𝒜 𝐵{\mathcal{A}_{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT	𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG	𝒜 B subscript 𝒜 𝐵{\mathcal{A}_{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT	𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG	𝒜 B subscript 𝒜 𝐵{\mathcal{A}_{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT	𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG	𝒜 B subscript 𝒜 𝐵{\mathcal{A}_{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT
Finetune	38.90	20.17	26.08	13.96	21.61	10.79	21.60
Finetune Adapterchenadaptformer	60.51	49.32	66.84	52.99	47.59	40.28	43.05
LwFli2017learning	46.29	41.07	48.97	32.03	39.93	26.47	35.39
SDCyu2020semantic	68.21	63.05	70.62	66.37	52.17	49.20	26.65
L2Pwang2022learning	85.94	79.93	67.05	56.25	66.53	59.22	47.16
DualPromptwang2022dualprompt	87.87	81.15	77.47	66.54	63.31	55.22	52.56
CODA-Promptsmith2023coda	89.11	81.96	84.00	73.37	64.42	55.08	48.51
CPPli2024steering	85.21	78.64	86.60	85.27	64.33	60.74	53.70
LAEgao2023unified	92.47	87.62	83.13	77.78	69.05	63.17	57.19
SimpleCIL	87.57	81.26	92.20	86.73	62.58	54.55	60.50
Aper w/ Finetune	87.67	81.27	91.82	86.39	70.51	62.42	61.57
Aper w/ VPT-Shallow	90.43	84.57	92.02	86.51	66.63	58.32	57.72
Aper w/ VPT-Deep	88.46	82.17	91.02	84.99	68.79	60.48	60.59
Aper w/ SSF	87.78	81.98	91.72	86.13	68.94	60.60	62.81
Aper w/ Adapter	90.65	85.15	92.21	86.73	72.35	64.33	60.53

4.3 Discussions on related concepts

There is a famous concept in CIL, namely “stability-plasticity dilemma”grossberg2012studies; mermillod2013stability; mirzadeh2020understanding. Specifically, “stability” denotes the ability of a continual learner to remember old knowledge, while “plasticity” refers to the ability to learn new concepts. These concepts are similar to the concept of “generalizability and adaptivity” in this paper. However, there are some main differences that need to be highlighted.

Firstly, “stability-plasticity dilemma” mainly refers to the problem of training from scratch (i.e., randomly initialized weights), where the model needs to balance learning new concepts and remembering the old. These concepts are two ultimate goals of continual learning, which do not conflict with “generalizability and adaptivity” raised in this paper. Secondly, “Generalizability and adaptivity” in this paper is the new characteristic in the era of PTMs. Specifically, a randomly initialized model does not have such “generalizability,” which cannot be directly applied to the downstream tasks. However, continual learners are born with “generalizability” if starting with a PTM, and we observe a simple baseline shows strong performance. Furthermore, we find that “generalizability” is insufficient for all downstream tasks, especially when downstream tasks come from a different distribution. In this case, we need to enhance the PTM’s “adaptivity” by further tuning it with the downstream task. Finally, by aggregating the features extracted by the pre-trained and adapted models, we unify these characteristics in a single model.

In summary, the “generalizability and adaptivity” in this paper is a new characteristic in class-incremental learning with pre-trained models. We aim to unify these characteristics in CIL and propose our Aper by aggregating the adapted and pre-trained models.

(a)CIFAR B50 Inc5

(b)CUB B100 Inc5

(c)ImageNet-A B100 Inc5

(d)ImageNet-R B100 Inc5

(e) OmniBenchark B150 Inc15

(f)ObjectNet B100 Inc5

(g)CUB B0 Inc5

(h)ImageNet-A B0 Inc5

(i)CIFAR B0 Inc5

Figure 4: (a)∼similar-to\sim∼(f): Incremental performance with ViT-B/16-IN1K as the backbone when half of the total classes are base classes. (g)∼similar-to\sim∼(i): Incremental performance when using ResNet18 as backbone. Since L2P and Dualprompt cannot be deployed with ResNet, we do not report their performance in (g)∼similar-to\sim∼(i). Aper consistently improves the performance of different backbones, i.e., ViT and CNN.

5 Experiments

This section compares Aper with state-of-the-art methods on benchmark datasets to show its superiority. Due to the overlap between pre-trained datasets and traditional class-incremental learning benchmarks, we also advocate four new benchmarks for evaluating PTM-based methods. Ablations and visualizations verify the effectiveness of Aper with new classes. We also explore the performance of different pre-trained models in class-incremental learning.

5.1 Implementation Details

Dataset: Followingwang2022dualprompt; yu2020semantic, we evaluate the performance on CIFAR100krizhevsky2009learning, CUB200WahCUB2002011, and ImageNet-Rhendrycks2021many. Since PTMs are often trained with ImageNet21Kdeng2009imagenet, evaluating PTM-based methods with ImageNet is meaningless. Hence, we advocate four new datasets that have large domain gap with ImageNet, namely ImageNet-Ahendrycks2021natural, ObjectNetbarbu2019objectnet, Omnibenchmarkzhang2022benchmarking and VTABzhai2019large. Among them, ImageNet-A and ObjectNet contain challenging samples that ImageNet pre-trained models cannot handle, while Omnibenchmark and VTAB contain diverse classes from multiple complex realms. To construct the CIL task, we sample 200 classes from ObjectNet and ImageNet-A, and 300 from Omnibenchmark. We sample 5 datasets from VTAB, each containing 10 classes, to construct the cross-domain CIL setting. The aim of using the subset is to ensure a simple split of training classes. Followingrebuffi2017icarl, we shuffle the classes with the same random seed and split them into ‘B/Base-m 𝑚 m italic_m, Inc-n 𝑛 n italic_n.’ It means the first dataset contains m 𝑚 m italic_m classes, and each following dataset contains n 𝑛 n italic_n classes. m=0 𝑚 0 m=0 italic_m = 0 means the total classes are equally divided into each task.

Comparison methods: We first compare to SOTA PTM-based CIL methods L2Pwang2022learning, DualPromptwang2022dualprompt, CODA-Promptsmith2023coda, CPPli2024steering, and LAEgao2023unified. Additionally, we also modify classical CIL methods LwFli2017learning, SDCyu2020semantic, iCaRLrebuffi2017icarl, LUCIRhou2019learning, DERyan2021dynamically, FOSTERwang2022foster, MEMOzhou2022model, FACTzhou2022forward to utilize the same PTM as the initialization. Apart from SimpleCIL, we also report the baseline, sequentially tuning the model, denoted as Finetune.

Training details: We use PyTorchpaszke2019pytorch and Pilotsun2023pilot to deploy all models on Tesla V100 with the same network backbone. As there are various PTMs publicly availablerw2019timm, we followwang2022learning; wang2022dualprompt to choose the most representative ones, denoted as ViT-B/16-IN1K and ViT-B/16-IN21K. Both are pre-trained on ImageNet21K, while the former is additionally finetuned on ImageNet1K. During adaptation, we train the model with a batch size of 48 48 48 48 for 20 20 20 20 epochs and use SGD with momentum for optimization. The learning rate starts from 0.01 0.01 0.01 0.01 and decays with cosine annealing. The prompt length p 𝑝 p italic_p is 5 5 5 5 for VPT, and the projection dim r 𝑟 r italic_r is 16 16 16 16 for Adapter.

Evaluation protocol: Followingrebuffi2017icarl, we denote accuracy after the b 𝑏 b italic_b-th stage as 𝒜 b subscript 𝒜 𝑏\mathcal{A}{b}caligraphic_A start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. We use 𝒜 B subscript 𝒜 𝐵\mathcal{A}{B}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (the performance after the last stage) and 𝒜¯=1 B⁢∑b=1 B 𝒜 b¯𝒜 1 𝐵 superscript subscript 𝑏 1 𝐵 subscript 𝒜 𝑏\bar{\mathcal{A}}=\frac{1}{B}\sum_{b=1}^{B}\mathcal{A}_{b}over¯ start_ARG caligraphic_A end_ARG = divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (average performance along incremental stages) as measurements.

5.2 Benchmark Comparison

We report the performance against SOTA methods in Table1, where all methods are based on the pre-trained ViT-B/16-IN21K. We also train these models with pre-trained ViT-B/16-IN1K and show the incremental trend in Figure4(a)∼similar-to\sim∼4(f). These data splits include settings with large and small base classes for a holistic evaluation.

Firstly, we can infer that the embeddings of PTMs are generalizable and can be directly applied for CIL to beat the SOTA. Specifically, the baseline SimpleCIL outperforms DualPrompt by 20% on CUB and 8% on ImageNet-A in terms of 𝒜 B subscript 𝒜 𝐵\mathcal{A}_{B}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. However, strong PTMs can be further improved if they are adapted by Aper, as downstream tasks have a large domain gap with the pre-trained dataset. Specifically, we find Aper consistently outperforms SimpleCIL in seven benchmark datasets. In contrast, sequentially finetuning the model suffers severe forgetting, which verifies the effectiveness of the adapt and merge protocol. Specifically, L2P and DualPrompt suffer from forgetting due to prompts being overwritten in the latter stages and the linear layer’s imbalanced weight norms. Since Aper only requires tuning the PTM in the first stage, it requires less training time and extra parameters than L2P and DualPrompt, as shown in Figure1. Among the variations of adapting techniques, we find SSF and Adapter are more efficient than VPT. We also compare to state-of-the-art traditional CIL methods and modify their backbones into pre-trained ViT for a fair comparison. However, we can infer from Table2 that these methods work poorly without exemplars. Hence, Aper achieves the best performance compared to various CIL algorithms under the fair comparison.

Apart from ViTs, Aper also works well with pre-trained CNNs. We adopt the ImageNet1K pre-trained ResNet18he2016deep for evaluation and plot the incremental performance in Figure4(g),4(h),4(i). Results show that Aper consistently boosts the performance of pre-trained ViTs and CNNs. Specifically, we find a simple BN tuning technique achieves better performance than fully or partially finetuning the ResNet.

Lastly, as shown in Table1, the performance on typical benchmarks is approaching saturation as they have a small domain gap with ImageNet. By contrast, due to the large domain gap between our newly established benchmarks and ImageNet, there is still space for improvement, indicating the effectiveness and necessity of these new benchmarks.

(a) PCA projected features

(b) Randomly sampled features

(d) Compositional components

(e) Number of parameters

(f) Adapting stages

Figure 5: Ablation study. (a)-(c): We use PCA or random sample to downscale the dimension of aggregated embeddings. (d): We compare Aper to its sub-modules for ablation. (e): Number of total parameters of different compared methods. The bars with shadow denote the parameters used during training but dropped during inference. (f): The accuracy trend with the change of adapting stages. Adapt-T 𝑇 T italic_T denotes the model is adapted for the first T 𝑇 T italic_T incremental tasks. T=0 𝑇 0 T=0 italic_T = 0 denotes SimpleCIL.

Table 2: Comparison to SOTA classical CIL methods with ViT-B/16-IN1K. All methods are deployed without exemplars.

| Method | ObjNet B0 Inc20 | IN-A B0 Inc20 | | --- | | 𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG | 𝒜 B subscript 𝒜 𝐵{\mathcal{A}{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | 𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG | 𝒜 B subscript 𝒜 𝐵{\mathcal{A}{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT | | iCaRL | 33.43 | 19.18 | 29.22 | 16.16 | | LUCIR | 41.17 | 25.89 | 31.09 | 18.59 | | DER | 35.47 | 23.19 | 33.85 | 22.27 | | FOSTER | 37.83 | 25.07 | 34.82 | 23.01 | | MEMO | 38.52 | 25.41 | 36.37 | 24.46 | | FACT | 60.59 | 50.96 | 60.13 | 49.82 | | SimpleCIL | 62.11 | 51.13 | 59.67 | 49.44 | | Aper w/ SSF | 68.75 | 56.79 | 63.59 | 52.67 |

5.3 Ablation Study

In this section, we conduct an ablation study to investigate the influence of each part in Aper, e.g., using sampled features, part of its components, and different tuning stages. We also analyze the parameter number of different methods.

5.3.1 Downscale features

Since the feature of Aper is aggregated with PTM and adapted model, it has a larger feature dimension than a vanilla PTM (e.g., 1536 versus 768). We conduct an ablation with Aper w/ SSF on ObjectNet B100 Inc5 to show whether these features are essential for CIL. Specifically, we train a PCApearson1901liii model in the first stage to reduce embedding dimension for the following stages. Denote the target dimension as k 𝑘 k italic_k, we train the PCA model PCA⁢([ϕ∗⁢(𝐱),ϕ⁢(𝐱)]):ℝ d→ℝ k:PCA superscript italic-ϕ 𝐱 italic-ϕ 𝐱→superscript ℝ 𝑑 superscript ℝ 𝑘\text{PCA}([\phi^{*}({\bf x}),\phi({\bf x})]):\mathbb{R}^{d}\to\mathbb{R}^{k}PCA ( [ italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_x ) , italic_ϕ ( bold_x ) ] ) : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and append it to the feature extractor. Hence, the features and prototypes are projected to k 𝑘 k italic_k dimensions. We plot the performance with the change of k 𝑘 k italic_k in Figure5(a). Specifically, we find Aper obtains competitive performance to DualPrompt (with 768 dims) even if the features are projected to 30 30 30 30 dims.

Apart from the PCA projection, we also experiment by randomly sampling k 𝑘 k italic_k features from the original feature space and report the results in Figure5(b). The conclusions are consistent with the former ones, showing that randomly sampling 200 200 200 200 dimensions in the concatenated space achieves the same performance scale as DualPrompt. We show the accuracy-dimension curves in Figure5(c).

5.3.2 Sub-modules

Since Aper is concatenated with PTM and adapted model, we conduct ablations on ImageNet-A Base100 Inc5 with ViT-B/16-IN21K to compare Aper w/ Finetune and its sub-modules. Specifically, we build SimpleCIL with ϕ⁢(⋅)italic-ϕ⋅\phi(\cdot)italic_ϕ ( ⋅ ) and ϕ∗⁢(⋅)superscript italic-ϕ⋅\phi^{*}(\cdot)italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⋅ ), respectively, denoted as SimpleCIL-PTM and SimpleCIL-Adapted. The former represents the capability of PTM, while the latter stands for the power of the adapted model. Both are compositional modules in Aper. Besides, we build SimpleCIL based on concatenated pre-trained ViT-B/16-IN21K and ViT-B/16-IN1K, denoted as SimpleCIL-21K+1K. It utilizes the aggregated features of two embedding functions, which has the same dimension as Aper.

As shown in Figure5(d), SimpleCIL-PTM performs the worst among all variations, indicating that although pre-trained features are effective and generalizable, it still requires extracting features of downstream tasks for better representation. In comparison, SimpleCIL-Adapted outperforms SimpleCIL-PTM, indicating the importance of model adaptation and adaptivity. However, adapting the model also overwrites the high-level features, which reduces the model’s generalizability. The adapted model suffers more extensive performance degradation than vanilla SimpleCIL, indicating the effect of generalizability in resisting forgetting. Finally, Aper w/ Finetune outperforms any of these sub-modules with the help of unified adaptivity and generalizability.

5.3.3 Parameter scale

The parameter scale is another core factor influencing CIL algorithms in real-world applications, e.g., edge devices or mobile phones. In this section, we compare the parameter scale of different methods to investigate the possibility of real-world applications. In the comparison, all methods are based on the pre-trained ViT. For methods requiring backbone expansion (e.g., DER, MEMO, and FOSTER), we also use pre-trained ViT as the initialization of new backbones.

We list the total number of parameters of all compared methods in Figure5(e), which indicates that Aper obtains better performance than the compared methods with the same scale or fewer parameters. Since L2P and DualPrompt have prompt pools, they rely on another pre-trained ViT as the ‘retriever’ to search for the instance-specific prompt. Hence, Aper shares the same scale of total parameters as these methods. Additionally, since Aper utilizes parameter-efficient tuning techniques to obtain the adapted model, most of the parameters in the adapted model are the same as the pre-trained weight. Hence, the memory budget of Aper can be further alleviated, which we will explore in future works.

5.3.4 Influence of Adapting Stages

In Aper, we only adapt the pre-trained model in the first incremental stage with 𝒟 1 superscript 𝒟 1\mathcal{D}^{1}caligraphic_D start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. There are two reasons: 1) Sequentially tuning the model will suffer catastrophic forgetting. 2) Since we utilize a prototype-based classifier, tuning the model with multiple stages will result in incompatible features between former and new prototypes.

In this section, we conduct an ablation to determine the influence of adapting stages and report the results in Figure5(f). We conduct the experiment on CIFAR100 Base0 Inc10 setting with pre-trained ViT-B/16-IN21K. There are 10 incremental stages in total. We denote the tuning stages as T 𝑇 T italic_T and train Aper w/ Adapter for ablation. Specifically, we change the tuning stages among {0,1,2,⋯,10}0 1 2⋯10{0,1,2,\cdots,10}{ 0 , 1 , 2 , ⋯ , 10 } to determine the influence on the final performance. In the first T 𝑇 T italic_T stages, we adapt the PTM incrementally with adapter and replace the classifier with prototypes. Afterward, in the T 𝑇 T italic_T-th stage, we freeze the encoding functions and only extract prototypes for the following stages. T=0 𝑇 0 T=0 italic_T = 0 denotes vanilla SimpleCIL. To prevent forgetting, we freeze the classifier weight of former classes when learning new classes.

As shown in the figure, tuning the model with the first stage achieves the best performance among all settings. Specifically, multi-stage tuning harms generalizability and results in the incompatible features of former and new classes.

Figure 6: CIL with different kinds of PTMs on ImageNet-R Base0 Inc20. Aper consistently improves the performance of different PTMs. We report the best variation of Aper in the figure and its relative improvement above SimpleCIL in the figure.

5.3.5 Different PTMs

Observing the performance gap between ViT-B/16-IN21K and ViT-B/16-IN1K, we seek to explore different kinds of PTMs on ImageNet-R Base0 Inc20. We choose publicly available PTMs, i.e., ResNet18/50/152he2016deep, ViT-B/16-IN1K/21K, ViT-L/16-IN1K, ViT-B/16-DINO-v2oquab2023dinov2, ViT-B/16-SAMchen2022vision, ViT-B/16-MAEhe2022masked, ViT-B/16-CLIPradford2021learning (image encoder) for a holistic evaluation, and report the results in Figure6. We can draw three main conclusions. Firstly, pre-trained ViTs show better generalizability than ResNets by achieving better performance with SimpleCIL. The main reason comes from the more extensive training data and parameters. Secondly, larger ViTs generalize better than small ones, and ViTs trained with supervised loss perform better than unsupervised ones. However, DINO-v2 shows the best performance due to its 1.2B training instances. Thirdly, owing to the massive training corpus and the cross-modal information, CLIP performs better than ImageNet21K pre-trained ViTs. Finally, we find the best Aper variation consistently improves the performance of SimpleCIL for any PTM, thus validating its effectiveness.

(a)First stage

(b)Second stage

(c)Grad-CAM

Figure 7: Top: visualization of the decision boundary on CIFAR100 between two incremental tasks. Dots represent old classes, and triangles stand for new classes. Decision boundaries are shown with the shadow region. Bottom: Grad-CAM visualizations of PTM and Aper. Important regions are highlighted with warm colors.

5.4 Visualization of Incremental Sessions

In this section, we visualize the learned decision boundaries with t-SNEvan2008visualizing on CIFAR100 dataset between two incremental stages, as shown in Figure7(a), 7(b). We visualize the classes from the first and second incremental tasks with colorful dots and triangles. Correspondingly, the class prototypes are represented by squares. As we can infer from these figures, PTM works competitively, which well separates the instances into their corresponding classes. The class prototypes are situated at the center of each class, verifying their representativeness in recognition. When extending the model from the first to the second stage, we find Aper performs well on both old and new classes. Visualizations verify the generalizability and adaptivity of Aper.

We also visualize the Grad-CAMselvaraju2017grad results on OmniBenchmark dataset based on pre-trained ResNet18. Grad-CAM is utilized to highlight the critical regions in the image to predict the corresponding concept. The results are shown in Figure7(c), indicating Aper concentrates more on the task-specific features than vanilla PTMs. Hence, visualizations verify the importance of adaptivity in PTM-based class-incremental learning.

Table 3: Experiments on domain-incremental learning and cross-domain CIL. All methods are implemented with pre-trained ViT-B/16-IN1K.

Method	Office-Home	DomainNet
𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG	𝒜 B subscript 𝒜 𝐵{\mathcal{A}_{B}}caligraphic_A start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT	𝒜¯¯𝒜\bar{\mathcal{A}}over¯ start_ARG caligraphic_A end_ARG
L2P	71.61	59.48
DualPrompt	70.85	58.26
SimpleCIL	72.18	75.85
Aper w/ Finetune	75.44	77.47
Aper w/ VPT-Shallow	71.44	63.26
Aper w/ VPT-Deep	71.06	61.27
Aper w/ SSF	80.81	83.14
Aper w/ Adapter	72.46	63.73

5.5 Experiments with Multiple Domains

This paper mainly considers the CIL setting, where all the data are from the same domain. However, we also consider two challenging tasks to investigate the performance with significant domain gaps. 1): Domain-incremental learning with Office-Home datasetvenkateswara2017deep. Office-Home is a benchmark dataset for domain adaptation, containing four domains, each consisting of 65 categories. The four domains are: art, clip-art, product, and real-world, with an average of around 70 images per class and a maximum of 99 images in a class. We followwang2022learning to organize the domain-incremental learning scenario where each task contains the classes of a new domain. 2): Class-incremental learning with cross-domain data. We followsmith2023coda and split DomainNetpeng2019moment into five tasks. Specifically, DomainNet is a dataset of common objects in six different domains. All domains include 345 classes of objects. The domains include clip-art, real, sketch, infograph, painting, and quickdraw. To construct a class-incremental learning setting, we split the dataset into five tasks, each containing 69 classes of individual domains.

The above settings contain datasets from multiple domains, and we report the experimental results in Table3. We adopt the same pre-trained ViT-B/16-IN1K for all compared methods. Although Aper is not specially designed with multiple stages, it still shows competitive performance against L2P and DualPrompt in these cross-domain tasks.

6 Conclusion

Learning with incremental classes is of great importance in real-world applications, which requires adaptivity for updating and generalizability for knowledge transfer. In this paper, we systematically revisit CIL with PTMs and draw three conclusions. Firstly, a frozen PTM can provide generalizable embeddings for CIL, enabling a prototype-based classifier to outperform the current state-of-the-art. Secondly, due to the distribution gap between pre-trained and downstream datasets, PTMs can be further harnessed to enhance their adaptivity. To this end, we propose Aper, which can be orthogonally combined with any parameter-efficient tuning method to unify generalizability and adaptivity for CIL. Lastly, due to data overlapping, traditional ImageNet-based benchmarks are unsuitable for evaluation in the era of PTM. Hence, we propose four new benchmarks to evaluate PTM-based CIL methods. Extensive experiments verify Aper’s state-of-the-art performance. Future work includes exploring task-specific tuning methods and structures.

Limitations: The limitations are two-fold. Firstly, the model cannot make full use of exemplars since the adapted model should fully reflect the downstream features. It turns into exemplar-based CIL if sufficient old class instances are available, where adaptivity can be further addressed through data rehearsal. Secondly, since Aper is only adapted with the first incremental stage, it shall face challenges when extensive domain gaps exist in the continual learning process, i.e., domain-incremental learning. In summary, extending Aper to these more challenging scenarios are interesting future works.

7 Acknowledgements

This work is partially supported by National Science and Technology Major Project (2022ZD0114805), Fundamental Research Funds for the Central Universities (2024300373), NSFC (62376118, 62006112, 62250069, 61921006), Collaborative Innovation Center of Novel Software Technology and Industrialization, China Scholarship Council, Ministry of Education, Singapore, under its MOE AcRF Tier 2 (MOET2EP20221- 0012), NTU NAP, and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative.

8 Data Availability Statement

Code is available at https://github.com/zhoudw-zdw/RevisitingCIL.

References

[1] Hassan Akbari, Dan Kondratyuk, Yin Cui, Rachel Hornung, Huisheng Wang, and Hartwig Adam. Alternating gradient descent and mixture-of-experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324, 2023.
[2] Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Peter Staar, et al. Feta: Towards specializing foundational models for expert task applications. NeurIPS, pages 29873–29888, 2022.
[3] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Bengio. Gradient based sample selection for online continual learning. In NeurIPS, pages 11816–11825, 2019.
[4] Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274, 2022.
[5] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 32, 2019.
[6] Eden Belouadah and Adrian Popescu. Il2m: Class incremental learning with dual memory. In ICCV, pages 583–592, 2019.
[7] Xiaoxu Cai, Jianwen Lou, Jiajun Bu, Junyu Dong, Haishuai Wang, and Hui Yu. Single depth image 3d face reconstruction via domain adaptive learning. Frontiers of Computer Science, 18(1), 2024.
[8] Fabio Cermelli, Dario Fontanel, Antonio Tavera, Marco Ciccone, and Barbara Caputo. Incremental learning in semantic segmentation from image labels. In CVPR, pages 4371–4381, 2022.
[9] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In ECCV, pages 532–547, 2018.
[10] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. NeurIPS, pages 16664–16678, 2022.
[11] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In ICLR, 2022.
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
[13] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without memorizing. In CVPR, pages 5138–5146, 2019.
[14] Zhiyong Ding, Haibin Xie, Peng Li, and Xin Xu. A structural developmental neural network with information saturation for continual unsupervised learning. CAAI Transactions on Intelligence Technology, 8(3):780–795, 2023.
[15] Jiahua Dong, Yang Cong, Gan Sun, Bingtao Ma, and Lichen Wang. I3dol: Incremental 3d object learning without catastrophic forgetting. In AAAI, pages 6066–6074, 2021.
[16] Jiahua Dong, Wenqi Liang, Yang Cong, and Gan Sun. Heterogeneous forgetting compensation for class-incremental learning. In ICCV, pages 11742–11751, 2023.
[17] Jiahua Dong, Lixu Wang, Zhen Fang, Gan Sun, Shichao Xu, Xiao Wang, and Qi Zhu. Federated class-incremental learning. In CVPR, pages 10164–10173, 2022.
[18] Songlin Dong, Xiaopeng Hong, Xiaoyu Tao, Xinyuan Chang, Xing Wei, and Yihong Gong. Few-shot class-incremental learning via relation knowledge distillation. In AAAI, pages 1255–1263, 2021.
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
[20] Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, and Eduardo Valle. Podnet: Pooled outputs distillation for small-tasks incremental learning. In ECCV, pages 86–102, 2020.
[21] Arthur Douillard, Alexandre Ramé, Guillaume Couairon, and Matthieu Cord. Dytox: Transformers for continual learning with dynamic token expansion. In CVPR, pages 9285–9295, 2022.
[22] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
[23] Qiankun Gao, Chen Zhao, Bernard Ghanem, and Jian Zhang. R-DFCIL: relation-guided representation learning for data-free class incremental learning. In ECCV, pages 423–439, 2022.
[24] Qiankun Gao, Chen Zhao, Yifan Sun, Teng Xi, Gang Zhang, Bernard Ghanem, and Jian Zhang. A unified continual learning framework with general parameter-efficient tuning. In ICCV, pages 11483–11493, 2023.
[25] Rui Gao and Weiwei Liu. Ddgr: continual learning with deep diffusion-based generative replay. In ICML, pages 10744–10763, 2023.
[26] Heitor Murilo Gomes, Jean Paul Barddal, Fabrício Enembreck, and Albert Bifet. A survey on ensemble learning for data stream classification. CSUR, 50(2):1–36, 2017.
[27] Stephen T Grossberg. Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control, volume 70. Springer Science & Business Media, 2012.
[28] Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. Pre-trained models: Past, present and future. AI Open, 2:225–250, 2021.
[29] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In ICLR, 2022.
[30] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[32] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021.
[33] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In CVPR, pages 15262–15271, 2021.
[34] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[35] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified classifier incrementally via rebalancing. In CVPR, pages 831–839, 2019.
[36] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799, 2019.
[37] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
[38] Xinting Hu, Kaihua Tang, Chunyan Miao, Xian-Sheng Hua, and Hanwang Zhang. Distilling causal effect of data in class-incremental learning. In CVPR, pages 3957–3966, 2021.
[39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
[40] Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, and Cordelia Schmid. Memory-efficient incremental learning through feature adaptation. In ECCV, pages 699–715, 2020.
[41] Alejandro Jaimes and Nicu Sebe. Multimodal human–computer interaction: A survey. Computer vision and image understanding, 108(1-2):116–134, 2007.
[42] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge J. Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, pages 709–727. Springer, 2022.
[43] Jian Jiang, Edoardo Cetin, and Oya Celiktutan. Ib-drr-incremental learning with information-back discrete representation replay. In CVPRW, pages 3533–3542, 2021.
[44] Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. In AAAI, pages 1060–1068, 2023.
[45] Dahuin Jung, Dongyoon Han, Jihwan Bang, and Hwanjun Song. Generating instance-level prompts for rehearsal-free continual learning. In ICCV, pages 11847–11857, 2023.
[46] Minsoo Kang, Jaeyoo Park, and Bohyung Han. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In CVPR, pages 16071–16080, 2022.
[47] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023.
[48] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. PNAS, 114(13):3521–3526, 2017.
[49] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, 2009.
[50] Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. In ICLR, 2022.
[51] Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, and Xu Yang. How to configure good in-context sequence for visual question answering. arXiv preprint arXiv:2312.01571, 2023.
[52] Long Li, Jianbo Huang, Liang Chang, Jian Weng, Jia Chen, and Jingjing Li. Dpps: A novel dual privacy-preserving scheme for enhancing query privacy in continuous location-based services. Frontiers of Computer Science, 17(5):175814, 2023.
[53] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, pages 4582–4597, 2021.
[54] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
[55] Zhuowei Li, Long Zhao, Zizhao Zhang, Han Zhang, Di Liu, Ting Liu, and Dimitris N Metaxas. Steering prototypes with prompt-tuning for rehearsal-free continual learning. In WACV, pages 2523–2533, 2024.
[56] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. NeurIPS, 35:109–123, 2022.
[57] Hong Liu, Rongrong Ji, Yongjian Wu, Feiyue Huang, and Baochang Zhang. Cross-modality binary code learning via fusion similarity hashing. In CVPR, pages 7380–7388, 2017.
[58] Mingxuan Liu, Subhankar Roy, Zhun Zhong, Nicu Sebe, and Elisa Ricci. Large-scale pre-trained models are surprisingly strong in incremental novel class discovery. arXiv preprint arXiv:2303.15975, 2023.
[59] Yaoyao Liu, Yuting Su, An-An Liu, Bernt Schiele, and Qianru Sun. Mnemonics training: Multi-class incremental learning without forgetting. In CVPR, pages 12245–12254, 2020.
[60] Marc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost Van De Weijer. Class-incremental learning: survey and performance evaluation on image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):5513–5533, 2022.
[61] Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology, 4:54654, 2013.
[62] Seyed Iman Mirzadeh, Mehrdad Farajtabar, Razvan Pascanu, and Hassan Ghasemzadeh. Understanding the role of training regimes in continual learning. NeurIPS, 33:7308–7320, 2020.
[63] Jingyi Ning, Lei Xie, Yi Li, Yingying Chen, Yanling Bu, Chuyu Wang, Sanglu Lu, and Baoliu Ye. Moirétracker: Continuous camera-to-screen 6-dof pose tracking based on moiré pattern. IEEE Journal on Selected Areas in Communications, 2024.
[64] Jingyi Ning, Lei Xie, Chuyu Wang, Yanling Bu, Fengyuan Xu, Da-Wei Zhou, Sanglu Lu, and Baoliu Ye. Rf-badge: Vital sign-based authentication via rfid tag array on badges. IEEE Transactions on Mobile Computing, 22(02):1170–1184, 2023.
[65] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
[66] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8026–8037, 2019.
[67] Karl Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572, 1901.
[68] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415, 2019.
[69] Juan-Manuel Perez-Rua, Xiatian Zhu, Timothy M Hospedales, and Tao Xiang. Incremental few-shot object detection. In CVPR, pages 13846–13855, 2020.
[70] Jonas Pfeiffer, Aishwarya Kamath, Andreas Rücklé, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. In ACL, pages 487–503, 2021.
[71] Quang Pham, Chenghao Liu, and HOI Steven. Continual normalization: Rethinking batch normalization for online continual learning. In ICLR, 2022.
[72] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
[73] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. NIPS, pages 506–516, 2017.
[74] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In CVPR, pages 2001–2010, 2017.
[75] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
[76] Subhankar Roy, Mingxuan Liu, Zhun Zhong, Nicu Sebe, and Elisa Ricci. Class-incremental novel class discovery. In ECCV, pages 317–333, 2022.
[77] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
[78] Yujun Shi, Kuangqi Zhou, Jian Liang, Zihang Jiang, Jiashi Feng, Philip HS Torr, Song Bai, and Vincent YF Tan. Mimicking the oracle: An initial phase decorrelation approach for class incremental learning. In CVPR, pages 16722–16731, 2022.
[79] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In NIPS, pages 2990–2999, 2017.
[80] Christian Simon, Piotr Koniusz, and Mehrtash Harandi. On learning the geodesic path for incremental learning. In CVPR, pages 1591–1600, 2021.
[81] James Smith, Yen-Chang Hsu, Jonathan Balloch, Yilin Shen, Hongxia Jin, and Zsolt Kira. Always be dreaming: A new approach for data-free class-incremental learning. In ICCV, pages 9374–9384, 2021.
[82] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In CVPR, pages 11909–11919, 2023.
[83] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4080–4090, 2017.
[84] Andreas Peter Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
[85] Gan Sun, Wenqi Liang, Jiahua Dong, Jun Li, Zhengming Ding, and Yang Cong. Create your world: Lifelong text-to-image diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[86] Hai-Long Sun, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Pilot: A pre-trained model-based continual learning toolbox. arXiv preprint arXiv:2309.07117, 2023.
[87] Xiaoyu Tao, Xiaopeng Hong, Xinyuan Chang, Songlin Dong, Xing Wei, and Yihong Gong. Few-shot class-incremental learning. In CVPR, pages 12183–12192, 2020.
[88] Michael Tschannen, Basil Mustafa, and Neil Houlsby. Image-and-language understanding from pixels only. arXiv preprint arXiv:2212.08045, 2022.
[89] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
[90] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In CVPR, pages 5018–5027, 2017.
[91] Andrés Villa, Juan León Alcázar, Motasem Alfarra, Kumail Alhamoud, Julio Hurtado, Fabian Caba Heilbron, Alvaro Soto, and Bernard Ghanem. Pivot: Prompting for video continual learning. In CVPR, pages 24214–24223, 2023.
[92] C.Wah, S.Branson, P.Welinder, P.Perona, and S.Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
[93] Fu-Yun Wang, Da-Wei Zhou, Liu Liu, Han-Jia Ye, Yatao Bian, De-Chuan Zhan, and Peilin Zhao. BEEF: Bi-compatible class-incremental learning via energy-based expansion and fusion. In ICLR, 2023.
[94] Fu-Yun Wang, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Foster: Feature boosting and compression for class-incremental learning. In ECCV, pages 398–414, 2022.
[95] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023.
[96] Qi-Wei Wang, Da-Wei Zhou, Yi-Kai Zhang, De-Chuan Zhan, and Han-Jia Ye. Few-shot class-incremental learning via training-free prototype calibration. NeurIPS, 36, 2023.
[97] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam’s razor for domain incremental learning. NeurIPS, 35:5682–5695, 2022.
[98] Yabin Wang, Zhiheng Ma, Zhiwu Huang, Yaowei Wang, Zhou Su, and Xiaopeng Hong. Isolation and impartial aggregation: A paradigm of incremental learning without interference. In AAAI, pages 10209–10217, 2023.
[99] Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, and Shin’ichi Satoh. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In CVPR, pages 618–626, 2019.
[100] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV, pages 631–648, 2022.
[101] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In CVPR, pages 139–149, 2022.
[102] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
[103] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In CVPR, pages 374–382, 2019.
[104] Ju Xu and Zhanxing Zhu. Reinforced continual learning. In NeurIPS, pages 899–908, 2018.
[105] Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynamically expandable representation for class incremental learning. In CVPR, pages 3014–3023, 2021.
[106] Ming-Hsuan Yang. Detecting faces in images: A survey. IEEE Transactions on pattern analysis and machine intelligence, 24(1):34–58, 2002.
[107] Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. Exploring diverse in-context configurations for image captioning. NeurIPS, 36, 2024.
[108] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. In ICLR, 2018.
[109] Kaichao You, Zhi Kou, Mingsheng Long, and Jianmin Wang. Co-tuning for transfer learning. NeurIPS, pages 17236–17246, 2020.
[110] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
[111] Lu Yu, Bartlomiej Twardowski, Xialei Liu, Luis Herranz, Kai Wang, Yongmei Cheng, Shangling Jui, and Joost van de Weijer. Semantic drift compensation for class-incremental learning. In CVPR, pages 6982–6991, 2020.
[112] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
[113] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, pages 1–9, 2022.
[114] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
[115] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. Slca: Slow learner with classifier alignment for continual learning on a pre-trained model. In ICCV, pages 19148–19158, 2023.
[116] Junting Zhang, Jie Zhang, Shalini Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming Zhang, and C-C Jay Kuo. Class-incremental learning via deep model consolidation. In WACV, pages 1131–1140, 2020.
[117] Tianwei Zhang, Huayan Zhang, and Xiaofei Li. Vision-audio fusion slam in dynamic environments. CAAI Transactions on Intelligence Technology, 8(4):1364–1373, 2023.
[118] Yuanhan Zhang, Zhenfei Yin, Jing Shao, and Ziwei Liu. Benchmarking omni-vision representation through the lens of visual realms. In ECCV, pages 594–611. Springer, 2022.
[119] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural prompt search. arXiv preprint arXiv:2206.04673, 2022.
[120] Bowen Zhao, Xi Xiao, Guojun Gan, Bin Zhang, and Shu-Tao Xia. Maintaining discrimination and fairness in class incremental learning. In CVPR, pages 13208–13217, 2020.
[121] Hanbin Zhao, Xin Qin, Shihao Su, Yongjian Fu, Zibo Lin, and Xi Li. When video classification meets incremental classes. In ACM MM, pages 880–889, 2021.
[122] Hanbin Zhao, Hui Wang, Yongjian Fu, Fei Wu, and Xi Li. Memory-efficient class-incremental learning for image classification. IEEE Transactions on Neural Networks and Learning Systems, 33(10):5966–5977, 2021.
[123] Bowen Zheng, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Preserving locality in vision transformers for class incremental learning. In ICME, pages 1157–1162, 2023.
[124] Bowen Zheng, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. Multi-layer rehearsal feature augmentation for class-incremental learning. In ICML, pages 61649–61663, 2024.
[125] Zhun Zhong, Enrico Fini, Subhankar Roy, Zhiming Luo, Elisa Ricci, and Nicu Sebe. Neighborhood contrastive learning for novel class discovery. In CVPR, pages 10867–10875, 2021.
[126] Da-Wei Zhou, Hai-Long Sun, Jingyi Ning, Han-Jia Ye, and De-Chuan Zhan. Continual learning with pre-trained models: A survey. In IJCAI, pages 8363–8371, 2024.
[127] Da-Wei Zhou, Fu-Yun Wang, Han-Jia Ye, Liang Ma, Shiliang Pu, and De-Chuan Zhan. Forward compatible few-shot class-incremental learning. In CVPR, pages 9046–9056, 2022.
[128] Da-Wei Zhou, Qi-Wei Wang, Zhi-Hong Qi, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Class-incremental learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[129] Da-Wei Zhou, Qi-Wei Wang, Han-Jia Ye, and De-Chuan Zhan. A model or 603 exemplars: Towards memory-efficient class-incremental learning. In ICLR, 2023.
[130] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415, 2022.
[131] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
[132] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
[133] Fei Zhu, Xu-Yao Zhang, Chuang Wang, Fei Yin, and Cheng-Lin Liu. Prototype augmentation and self-supervision for incremental learning. In CVPR, pages 5871–5880, 2021.

Generated on Mon Aug 5 16:30:58 2024 by L a T e XML

Xet Storage Details

Size:: 133 kB
Xet hash:: ea7c9e54cd1f553b7379d0070f69acf0e2f4bf53cbea3ec8db546b0fb30127ad

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.