natmin322 commited on
Commit
9ef4bf7
·
0 Parent(s):
.gitattributes ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.pt filter=lfs diff=lfs merge=lfs -text
2
+ *.jsonl filter=lfs diff=lfs merge=lfs -text
3
+ *.pdf filter=lfs diff=lfs merge=lfs -text
4
+ *.zip filter=lfs diff=lfs merge=lfs -text
5
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
6
+ *.txt filter=lfs diff=lfs merge=lfs -text
7
+ *.yaml filter=lfs diff=lfs merge=lfs -text
8
+ *.png filter=lfs diff=lfs merge=lfs -text
9
+ *.pyc filter=lfs diff=lfs merge=lfs -text
10
+ *.model filter=lfs diff=lfs merge=lfs -text
11
+ *.gz filter=lfs diff=lfs merge=lfs -text
12
+ *.npz filter=lfs diff=lfs merge=lfs -text
LwI ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit e310f5b926b8dda4991b8240dc008e7accd40bdd
LwI_baseline ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit e310f5b926b8dda4991b8240dc008e7accd40bdd
method.md ADDED
@@ -0,0 +1,458 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RIEMANNIAN TOPOLOGICAL ALIGNMENT (RTA) FOR CONTINUAL LEARNING
2
+
3
+ ## I. MOTIVATION & THEORETICAL FOUNDATION
4
+
5
+ ### Problem Statement
6
+ Trong Continual Learning (CL), encoder trôi dạt (encoder drift) khi học new tasks, dẫn đến catastrophic forgetting. Các phương pháp hiện tại (e.g., MINION v17) chỉ bảo tồn knowledge ở level output, không model hóa feature distribution geometry.
7
+
8
+ ### Core Insight
9
+ Features sau normalization nằm trên hypersphere $\mathbb{S}^{d-1}$, không phải Euclidean space. Do đó:
10
+ - Khoảng cách/góc giữa features phải đo bằng Riemannian metric, không Euclidean distance
11
+ - Cấu trúc phân phối (covariance) trên manifold cong khác fundamentally với Euclidean case
12
+ - Bảo tồn topology = bảo tồn Fisher Information Metric (FIM), không chỉ bảo tồn weights
13
+
14
+ ### Transition từ MINION v17 → RTA
15
+ **MINION v17 limitations:**
16
+ - Mô hình vMF đẳng hướng: giả định mọi chiều có độ xòe như nhau (isotropic)
17
+ - Procrustes alignment tuyến tính: sai số tích lũy qua layers
18
+ - Không detect feature drift, chỉ align parameters
19
+ - Không formal definition của "bảo tồn knowledge"
20
+
21
+ **RTA improvements:**
22
+ - Bingham distribution (anisotropic): học được hình ellipsoidal clusters
23
+ - Parallel transport trên manifold: bảo tồn metric relationships
24
+ - Feature-level monitoring + Riemannian distillation
25
+ - Formalize bảo tồn via Fisher Information Metric
26
+
27
+ ---
28
+
29
+ ## II. FRAMEWORK COMPONENTS
30
+
31
+ ### Giai đoạn 1: Biểu diễn xác suất phi đẳng hướng (Anisotropic Probability Modeling)#### Từ vMF (isotropic) sang Bingham (anisotropic)
32
+
33
+ Mô hình von Mises-Fisher chuẩn chỉ capture symmetry:
34
+ $$f(z; \mu, \kappa) = \frac{\kappa^{d/2-1}}{(2\pi)^{d/2} I_{d/2-1}(\kappa)} \exp(\kappa \mu^T z)$$
35
+
36
+ Nhưng điều này giả định **mọi hướng từ trung tâm $\mu$ có xác suất như nhau** - không phù hợp vì:
37
+ - Các feature dimensions có ý nghĩa khác nhau
38
+ - Task-specific dimensions có variance cao hơn
39
+ - Catastrophic forgetting xảy ra khi task-specific dimensions bị overwrite
40
+
41
+ #### Bingham Distribution - Giải pháp Anisotropic
42
+
43
+ Trên siêu cầu $\mathbb{S}^{d-1}$, ta dùng **Bingham distribution**:
44
+ $$f(z; A_c) = \frac{1}{F(A_c)} \exp(z^T A_c z), \quad z \in \mathbb{S}^{d-1}$$
45
+
46
+ **Ưu điểm:**
47
+ - $A_c = \sum_{i=1}^{d} \lambda_i v_i v_i^T$ là ma trận đối xứng
48
+ - Eigenvectors $\{v_i\}$: các trục chính của cụm features
49
+ - Eigenvalues $\{\lambda_i\}$: độ "dài" của cụm dọc từng axis (anisotropy)
50
+ - Tự động learn hình ellipsoidal clusters, không gồing circular
51
+
52
+ **Mô hình hóa per-class:**
53
+ $$P_c^{(t)} = \{A_c^{(t)}, \text{variance}_c^{(t)}\}$$
54
+
55
+ Lưu **toàn bộ covariance structure**, không chỉ mean + concentration like vMF.### Giai đoạn 2: Khóa Topology via Riemannian Knowledge Distillation
56
+
57
+ #### Problem: Catastrophic Forgetting từ Topology Shift
58
+
59
+ Khi encoder update trên task $t$, mean + covariance của old classes thay đổi:
60
+ - **Mean shift**: $\mu_c^{(t-1)} \to \tilde{\mu}_c^{(t)}$
61
+ - **Axis rotation**: $V_{c}^{(t-1)} \to V_{c}^{(t)}$
62
+ - **Anisotropy change**: $\Lambda_c^{(t-1)} \to \Lambda_c^{(t)}$
63
+
64
+ → **Topology bị deform**, dù output predictions còn hợp lý
65
+
66
+ #### Solution: Riemannian Kullback-Leibler Divergence
67
+
68
+ Thay vì chỉ dùng output-level distillation:
69
+ $$\mathcal{L}_{old} = \text{KL}(p_{old}(y|x) \| p_{new}(y|x))$$
70
+
71
+ Ta thêm **Riemannian KL trên parameter manifold**:
72
+ $$\mathcal{L}_{geo} = D_{RKL}(P_{old}^{(t-1)} \| P_{new}^{(t)})$$
73
+
74
+ **Formal definition:**
75
+ $$D_{RKL}(P_1 \| P_2) = \int_{\Theta} P_1(\theta) \log \frac{P_1(\theta)}{P_2(\theta)} d\theta$$
76
+
77
+ Trong đó $\{\Theta\}$ được trang bị **Fisher Information Metric (FIM)**:
78
+ $$g_{ij}(\theta) = \mathbb{E}_{x,y \sim P(\cdot|\theta)} \left[ \frac{\partial \log p(y|x;\theta)}{\partial \theta_i} \frac{\partial \log p(y|x;\theta)}{\partial \theta_j} \right]$$
79
+
80
+ #### Ý nghĩa: Bảo tồn Thông tin
81
+ - KL divergence qua FIM = "bao lâu parameter move mà vẫn bảo tồn classification boundary"
82
+ - Geometry lock: nếu $D_{RKL} \approx 0 \Rightarrow$ structure của $P_{old}$ intact
83
+ - Automatic trade-off giữa performance mới vs retention cũ (không cần tune multiple λ's)
84
+
85
+ #### Implementation Detail
86
+ Per-layer:
87
+ $$\mathcal{L}_{geo} = \sum_{l=1}^{L} D_{RKL}^{(l)}(A_c^{(t-1)} \| A_c^{(t)})$$
88
+
89
+ Approximate bằng **Bure-Wasserstein distance** trên covariance:
90
+ $$W_2(A_c^{old}, A_c^{new}) = \text{Tr}(A_c^{old} + A_c^{new} - 2(A_c^{old})^{1/2} A_c^{new} (A_c^{old})^{1/2})^{1/2}$$### Giai đoạn 3: Drift Correction via Parallel Transport on Manifold
91
+
92
+ #### Limitation của Procrustes Rotation (MINION v17)
93
+
94
+ Procrustes tìm ma trận quay tối ưu $R^*$ để align $W_0$ sang $W_1$:
95
+ $$R^* = \arg\min_R \|R W_0 - W_1\|_F$$
96
+
97
+ **Vấn đề:**
98
+ 1. Giả định **Euclidean metric** - nhưng features nằm trên hypersphere
99
+ 2. **Sai số tích lũy**: Apply qua $L$ layers, error accumulate exponentially
100
+ 3. Không preserve **inner products** trên manifold
101
+ 4. Không capture **non-linear drift** (e.g., rotation + dilation cùng lúc)
102
+
103
+ #### Riemannian Alternative: Parallel Transport
104
+
105
+ **Intuition**: Trên manifold cong, khi move từ point A → B, bằng cách nào để "move" một vector mà vẫn giữ "orientation" của nó?
106
+
107
+ **Answer**: Parallel Transport - di chuyển vector dọc **geodesic** từ A đến B.
108
+
109
+ #### Mathematical Framework
110
+
111
+ Cho feature distribution trôi dạt từ $\mu_c^{old}$ → $\mu_c^{new}$ trên $\mathbb{S}^{d-1}$:
112
+
113
+ **Bước 1: Xác định Geodesic**
114
+ Đường cong ngắn nhất trên sphere nối points $\mu_c^{old}$ và $\mu_c^{new}$:
115
+ $$\gamma(t) = \sin((1-t)\theta) \mu_c^{old} + \sin(t\theta) \mu_c^{new}, \quad t \in [0,1]$$
116
+
117
+ Với $\theta = \arccos(\mu_c^{old} \cdot \mu_c^{new})$ là khoảng cách trắc địa.
118
+
119
+ **Bước 2: Vận chuyển Covariance**
120
+ Covariance matrix $A_c^{old}$ cần di chuyển dọc geodesic để trở thành $A_c^{aligned}$:
121
+
122
+ $$A_c^{aligned} = \text{ParallelTransport}_{\gamma}(A_c^{old})$$
123
+
124
+ **Bước 3: Tính Toán ParallelTransport**
125
+ Trên sphere, Parallel Transport của tangent vector $v$ dọc geodesic được định nghĩa bởi **Levi-Civita connection**:
126
+
127
+ $$\frac{D v}{dt} = 0 \quad \text{along } \gamma(t)$$
128
+
129
+ **Explicit formula cho Bingham covariance:**
130
+ $$A_c^{aligned} = A_c^{old} - (\theta \cot(\theta) - 1)(A_c^{old} \cdot \mu_c^{old})\mu_c^{old}^T$$
131
+
132
+ #### Ưu điểm so với Procrustes
133
+ 1. **Metric preserving**: $\langle v, w \rangle_{aligned} = \langle v, w \rangle_{old}$ (inner products preserved)
134
+ 2. **Path-independent**: Kết quả không phụ thuộc cách drift xảy ra
135
+ 3. **Error bounded**: Sai số không tích lũy qua layers (orthogonality guaranteed)
136
+ 4. **Theoretically sound**: Dựa trên Riemannian geometry, không ad-hoc
137
+
138
+ #### Implementation Consideration
139
+ Trong practice, chỉ cần $M=1$ exemplar từ old class để estimate $\mu_c^{new}$:
140
+ - Tính $\mu_c^{obs} = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}} z_i^{(new)}$ trên test set của class $c$
141
+ - Update geodesic = $\arccos(\mu_c^{old} \cdot \mu_c^{obs})$
142
+ - Apply parallel transport tới all $A_c$ parameters### Giai đoạn 4: Unified Learning Objective
143
+
144
+ #### Full Loss Function
145
+
146
+ Kết hợp cả tính phân biệt (discrimination) và bảo tồn (retention):
147
+
148
+ $$\mathcal{L}_{total} = \underbrace{\mathcal{L}_{CE}(f(x), y)}_{\text{new task}} + \lambda_1 \underbrace{\mathcal{I}(z; y)}_{\text{discriminativity}} + \lambda_2 \underbrace{D_{RKL}(P_{old} \| P_{new})}_{\text{geometry lock}}$$
149
+
150
+ **Chi tiết từng term:**
151
+
152
+ **Term 1: Task-specific Cross-Entropy**
153
+ $$\mathcal{L}_{CE} = -\log p(y|x; \theta_t)$$
154
+ Standard supervised loss trên task $t$ mới.
155
+
156
+ **Term 2: Mutual Information (Discriminativity)**
157
+ $$\mathcal{I}(z; y) = H(y) - H(y|z) = \mathbb{E}_{z,y}[\log p(y|z)] - \mathbb{E}_y[\log p(y)]$$
158
+
159
+ Estimate via **InfoNCE** (contrastive learning):
160
+ $$\mathcal{I} \approx \mathbb{E}_{(x,y)} \left[ \log \frac{\exp(z^T z_{pos}/\tau)}{\sum_{k} \exp(z^T z_k/\tau)} \right]$$
161
+
162
+ Mục đích: Đảm bảo features vẫn nhân được hifi discriminatory information cho class separation.
163
+
164
+ **Term 3: Riemannian KL Distillation**
165
+ $$D_{RKL}(P_{old} \| P_{new}) = \sum_{c \in \text{old}} W_2(A_c^{old}, A_c^{new})$$
166
+
167
+ + Áp dụng parallel transport correction từ giai đoạn 3
168
+ + Tối thiểu hóa covariance shift trên toàn layer
169
+
170
+ #### Dynamic Weight Scheduling
171
+
172
+ Thay vì fixed $\lambda_1, \lambda_2$, dùng **adaptive weighting**:
173
+
174
+ $$\lambda_1(t) = \lambda_1^{init} \times (1 - \frac{t}{T})^p, \quad p \in [1,2]$$
175
+ $$\lambda_2(t) = \lambda_2^{init} \times (1 + \frac{t}{T})^q, \quad q \in [1,2]$$
176
+
177
+ - Early epochs: emphasize task learning ($\lambda_1 \uparrow$, $\lambda_2 \downarrow$)
178
+ - Later epochs: emphasize retention ($\lambda_1 \downarrow$, $\lambda_2 \uparrow$)
179
+ - $t = $ number of gradient updates
180
+ - $T = $ total updates in task
181
+
182
+ #### Per-Layer Adaptation
183
+
184
+ Vì early layers có ít drift (general features) vs late layers (task-specific):
185
+
186
+ $$\lambda_2^{(l)} = \lambda_2 \times (1 + \alpha \cdot l / L)^{\beta}$$
187
+
188
+ với $\alpha, \beta > 0$ learned via validation.
189
+ ---
190
+
191
+ ## III. COMPARATIVE ANALYSIS: RTA vs. MINION v17
192
+
193
+ | Criterion | MINION v17 | RTA | Advantage |
194
+ |-----------|-----------|-----|-----------|
195
+ | **Distribution Model** | von Mises-Fisher (isotropic) | Bingham (anisotropic) | RTA captures task-specific anisotropy |
196
+ | **Parameter Geometry** | Euclidean assumptions | Riemannian manifold | RTA preserves topology on $\mathbb{S}^{d-1}$ |
197
+ | **Drift Correction** | Procrustes (linear rotation) | Parallel transport (geodesic path) | RTA avoids error accumulation |
198
+ | **Knowledge Retention** | KL divergence on outputs | Riemannian KL + FIM-weighted | RTA locks feature topology, not just predictions |
199
+ | **Adaptation** | Fixed ensemble weights | Dynamic per-layer scheduling | RTA adapts to feature drift rate |
200
+ | **Drift Detection** | None (implicit in weight change) | Explicit geodesic distance | RTA quantifies drift magnitude |
201
+ | **$M=1$ Reliability** | Low (mean estimate unstable) | Medium-High (only for geodesic direction) | RTA robust with single exemplar |
202
+ | **Computational Cost** | O($d^2$) per layer | O($d^3$) for eigendecomposition | RTA slightly higher cost, justified by robustness |
203
+
204
+ **Summary**: RTA มี theoretical guarantees về metric preservation, automatic feature-level monitoring, และ principled drift correction. MINION v17 faster nhưng ad-hoc hơn.
205
+
206
+ ---
207
+
208
+ ## IV. THEORETICAL JUSTIFICATION
209
+
210
+ ### Why Bingham > von Mises-Fisher?
211
+
212
+ Consider binary classification on sphere. Features nằm trên hemi-sphere $\mathbb{S}^{d-1}$:
213
+ - Features của class 0: clustered around $\mu_0$
214
+ - Features của class 1: clustered around $\mu_1$
215
+
216
+ **vMF assumption**: Tất cả eigenvectors của covariance có eigenvalue $\kappa$ (same concentration)
217
+ → Circular clusters, nguy hiểm khi:
218
+ - Task-specific directions overlap (confusable features)
219
+ - Early-stop causes under-learning in some dimensions
220
+
221
+ **Bingham modeling**: Eigenvalues $\lambda_i$ khác nhau
222
+ → Ellipsoidal clusters capture:
223
+ - Discriminative dimensions (high $\lambda_i$)
224
+ - Non-discriminative "noise" dimensions (low $\lambda_i$)
225
+ - Automatically learns importance weighting per dimension
226
+
227
+ ### Why Parallel Transport > Procrustes?
228
+
229
+ **Procrustes on Hypersphere:**
230
+ Nếu áp dụng $\hat{z} = R z$ với $R \in SO(d)$ trên hypothesized z ∈ $\mathbb{S}^{d-1}$:
231
+ $$\|R z\|_2 = \|z\|_2 = 1 \checkmark$$
232
+
233
+ Nhưng **lặp lại qua layers:**
234
+ $$z^{(L)} = R_L \cdots R_2 R_1 z^{(0)}$$
235
+
236
+ Due to numerical precision, $\|z^{(L)}\|_2 \approx 1 - \epsilon L$ (accumulates!)
237
+
238
+ **Parallel Transport preservation:**
239
+ ForVector $v \in T_p \mathbb{S}^{d-1}$ và Parallel Transport $\text{PT}_\gamma(v)$ along geodesic $\gamma$:
240
+ $$\|\text{PT}_\gamma(v)\|_p = \|v\|_p \quad \text{for ALL } p \in \gamma$$
241
+ $$\langle \text{PT}_\gamma(v), \gamma'(t) \rangle = 0 \quad \text{(stays orthogonal to manifold)}$$
242
+
243
+ → **No accumulation**, guaranteed metric preservation.
244
+
245
+ ### Why RKL > Output-level KL?
246
+
247
+ **Output-level KL:**
248
+ $$\text{KL}(p_t(y|x) \| p_{t+1}(y|x))$$
249
+
250
+ Problem: Có thể minimize nếu $p_{t+1}$ "soften" predictions qua temperature scaling. Nhưng features shift dramatically!
251
+
252
+ **RKL via Fisher Information Metric:**
253
+ $$D_{RKL}(\theta_t \| \theta_{t+1}) = \int \text{FIM}(\theta_t) \| \Delta\theta \|^2 d\theta$$
254
+
255
+ iff $D_{RKL} \approx 0$:
256
+ - Decision boundaries stable
257
+ - Features bảo tồn discriminative structure
258
+ - Weight changes thuộc trong "safe region"
259
+
260
+ ---
261
+
262
+ ## V. ALGORITHMIC DETAILS & IMPLEMENTATION
263
+
264
+ ### Training Algorithm (RTA-CL)
265
+
266
+ **Input**: Current task data $D_t$, old learned distributions $\{P_c^{(t-1)}\}_{c \in C_{old}}$, network $f_\theta$
267
+
268
+ **Output**: Updated parameters $\theta_t$, updated distributions $\{P_c^{(t)}\}$
269
+
270
+ ```
271
+ Algorithm: Continual Learning with RTA
272
+
273
+ for each task t = 1, 2, ..., T:
274
+
275
+ # Phase 1: Collect Feature Statistics
276
+ Z_c = [] # Buffer per old class
277
+ for c in C_old:
278
+ Z_c = collect_features(D_test^c, f_{θ_{t-1}}) # M=1 exemplar per class
279
+ μ_c^{obs} ← mean(Z_c)
280
+
281
+ # Phase 2: Detect Drift & Compute Geodesics
282
+ geodesic_dist = []
283
+ for c in C_old:
284
+ θ_c ← arccos(μ_c^{old} · μ_c^{obs}) # geodesic angle
285
+ geodesic_dist.append(θ_c)
286
+
287
+ # Phase 3: Train on New Task
288
+ for epoch = 1 to num_epochs:
289
+ for batch (x, y) in D_t:
290
+
291
+ # Forward pass
292
+ z = encoder(x) # features on sphere
293
+ logits = classifier(z)
294
+
295
+ # Task loss
296
+ L_CE = CrossEntropy(logits, y)
297
+
298
+ # Mutual information (discriminativity)
299
+ L_MI = -InfoNCE(z, y)
300
+
301
+ # Geometry lock with drift correction
302
+ L_geo = 0
303
+ for c in C_old:
304
+ # Parallel transport correction
305
+ A_c^{aligned} = ParallelTransport(
306
+ A_c^{old},
307
+ μ_c^{old},
308
+ μ_c^{obs}
309
+ )
310
+
311
+ # Compute current covariance
312
+ A_c^{new} = compute_covariance(
313
+ features_c^{new}, method='Bingham_MLE'
314
+ )
315
+
316
+ # Wasserstein distance between old and new
317
+ L_geo += W_2(A_c^{aligned}, A_c^{new})
318
+
319
+ # Adaptive weighting
320
+ λ₁ = λ₁_init * (1 - epoch/num_epochs)^1.5
321
+ λ₂ = λ₂_init * (1 + epoch/num_epochs)^1.5
322
+
323
+ # Total loss
324
+ L_total = L_CE + λ₁*L_MI + λ₂*L_geo
325
+
326
+ # Backward
327
+ θ ← θ - α ∇L_total
328
+
329
+ # Phase 4: Update Distributions for Next Task
330
+ θ_{t} ← θ
331
+ for c in C_old ∪ C_new:
332
+ A_c^{(t)} ← compute_covariance(
333
+ collect_features(D_train^c, f_{θ_t}),
334
+ method='Bingham_MLE'
335
+ )
336
+ P_c^{(t)} = {A_c^{(t)}, variance_c^{(t)}}
337
+ ```
338
+
339
+ ### Computational Complexity Analysis
340
+
341
+ | Operation | Complexity | Notes |
342
+ |-----------|-----------|-------|
343
+ | Bingham MLE (per class) | $O(d^3 + n_c d^2)$ | eigendecomposition dominates |
344
+ | Parallel Transport | $O(d^2)$ | simple matrix-vector ops |
345
+ | Wasserstein W_2 | $O(d^3)$ | one matrix sqrt call |
346
+ | Drift detection (M=1) | $O(d)$ | just dot product |
347
+ | Per-batch overhead | $O(d^2)$ | Computing A_c during training |
348
+
349
+ **Total per task**:
350
+ - Training: $O(N_{epochs} \times N_{batches} \times d^2)$ (manageable)
351
+ - Evaluation: $O(|C_{old}| \times d^3)$ (one-time, after training)
352
+
353
+ **Memory**: $O(L \times |C_{old}| \times d^2)$ cho lưu covariance matrices (reasonable)
354
+
355
+ ### Hyperparameter Settings (Recommended)
356
+
357
+ ```
358
+ λ₁_init = 0.1 # mutual information weight
359
+ λ₂_init = 0.01 # RKL weight (start small)
360
+ α_layer = 0.5 # per-layer RKL scaling
361
+ τ = 0.05 # temperature for InfoNCE
362
+ warmup_epochs = 5 # before applying geometry loss
363
+ num_exemplars_M = 1 # per old class (memory efficient)
364
+ ```
365
+
366
+ ---
367
+
368
+ ## VI. COMPARATIVE ANALYSIS & EXPECTED IMPACT
369
+
370
+ ### RTA vs. MINION v17 (Detailed)
371
+
372
+ | Criterion | MINION v17 | RTA | Advantage |
373
+ |-----------|-----------|-----|-----------|
374
+ | **Distribution Model** | von Mises-Fisher (isotropic) | Bingham (anisotropic) | RTA captures task-specific anisotropy |
375
+ | **Parameter Geometry** | Euclidean assumptions | Riemannian manifold | RTA preserves topology on $\mathbb{S}^{d-1}$ |
376
+ | **Drift Correction** | Procrustes (linear rotation) | Parallel transport (geodesic path) | RTA avoids error accumulation |
377
+ | **Knowledge Retention** | KL divergence on outputs | Riemannian KL + FIM-weighted | RTA locks feature topology |
378
+ | **Adaptation** | Fixed ensemble weights | Dynamic per-layer scheduling | RTA adapts to feature drift rate |
379
+ | **Drift Detection** | Implicit | Explicit geodesic distance | RTA quantifies drift magnitude |
380
+ | **$M=1$ Reliability** | Low | Medium-High | RTA robust with one exemplar |
381
+ | **Computational Cost** | O($d^2$) per layer | O($d^3$) per task | RTA justified for architecture $d < 2048$ |
382
+
383
+ ### Expected Benefits
384
+
385
+ 1. **Theoretical Soundness** ✅
386
+ - Formalized từ Riemannian geometry + Information theory
387
+ - Metric preservation guaranteed (no accumulation error)
388
+ - FIM-weighted retention (principled trade-off)
389
+
390
+ 2. **Feature-Level Monitoring** ✅
391
+ - Explicit encoder drift detection (geodesic angle)
392
+ - Adapt weighting per layer based on drift rate
393
+ - Early warning: predict forgetting before it happens
394
+
395
+ 3. **Robustness with Few Exemplars** ✅
396
+ - Only M=1 exemplar per class required
397
+ - Used only for geodesic direction (not mean estimation)
398
+ - Stable covariance via Bingham MLE regularization
399
+
400
+ 4. **Anisotropy Learning** ✅
401
+ - Auto-discover task-specific dimensions
402
+ - Protect important features while allowing update in noise
403
+ - Implicit soft-attention to discriminative directions
404
+
405
+ ### Limitations & Mitigation
406
+
407
+ 1. **Computational Cost** ⚠️
408
+ - Eigendecomposition ($O(d^3)$) per task
409
+ - Practical for $d < 2048$, problematic for ViT ($d > 4096$)
410
+ - **Mitigation**: Low-rank Bingham approximation (top-k eigenvectors)
411
+
412
+ 2. **Small M Assumption** ⚠️
413
+ - M=1 not reliable if exemplar outlier
414
+ - **Mitigation**: Robust covariance (Huber-type)
415
+
416
+ 3. **Hyperparameter Tuning** ⚠️
417
+ - Multiple $\lambda$'s to tune
418
+ - **Mitigation**: Automatic scheduling via validation
419
+
420
+ 4. **Feature Normalization Requirement** ⚠️
421
+ - Assumes normalized embeddings
422
+ - **Mitigation**: Standard practice in modern architectures
423
+
424
+ ---
425
+
426
+ ## VII. CONCLUSION & RECOMMENDATIONS
427
+
428
+ ### Summary: Why RTA is "Tighter" than MINION v17
429
+
430
+ 1. ✅ **Rigorous Mathematics**: Bingham + Riemannian geometry unified framework
431
+ 2. ✅ **Explicit Monitoring**: Track feature drift via geodesic distance
432
+ 3. ✅ **Metric Preservation**: Parallel Transport guarantees no accumulation error
433
+ 4. ✅ **Formal Retention**: RKL via Fisher Information Metric (not ad-hoc)
434
+ 5. ✅ **Adaptive Learning**: Per-layer + dynamic weighting based on real drift
435
+
436
+ ### Trade-offs
437
+
438
+ - Higher computational cost (eigendecomposition per task)
439
+ - More hyperparameters (automatic scheduling helps)
440
+ - Requires normalized features (okay for modern architectures)
441
+
442
+ ### When to Use RTA
443
+
444
+ **Use RTA if:**
445
+ - ✅ Catastrophic forgetting is main bottleneck
446
+ - ✅ Feature drift is large (domain shift / diverse tasks)
447
+ - ✅ Can afford $O(d^3)$ computation per task
448
+ - ✅ $d < 2048$ (typical CNN/small transformer)
449
+
450
+ **Use simpler methods (EWC, LwI) if:**
451
+ - ✅ Only incremental learning needed (similar domains)
452
+ - ✅ Memory/compute severely limited
453
+ - ✅ Model is large ($d > 4096$)
454
+
455
+ **Hybrid approach:**
456
+ - Apply RTA to early+middle layers (detect drift early)
457
+ - Simple EWC regularization on final layer (cheap)
458
+ - 70% of benefits, 40% of cost
nlp_paper_survey/download_all.sh ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Download all 109 NLP/LLM CL papers
3
+ cd /Users/nnminh322/Desktop/personal/Continual/nlp_paper_survey/papers
4
+
5
+ download() {
6
+ local name="$1"
7
+ local url="$2"
8
+ if [ ! -f "$name" ] || [ $(stat -f%z "$name" 2>/dev/null || echo 0) -lt 50000 ]; then
9
+ curl -sL --max-time 30 -o "$name" "$url"
10
+ local size=$(stat -f%z "$name" 2>/dev/null || echo 0)
11
+ if [ "$size" -gt 50000 ]; then
12
+ echo "OK: $name ($size bytes)"
13
+ else
14
+ echo "FAIL: $name ($size bytes)"
15
+ fi
16
+ else
17
+ echo "SKIP: $name (already exists)"
18
+ fi
19
+ }
20
+
21
+ # NeurIPS 2025
22
+ download "02_MINGLE_NeurIPS2025.pdf" "https://openreview.net/pdf/2319ec18a77bdfff982549ce8b4354498ed4e21f.pdf"
23
+ download "03_MemEIC_NeurIPS2025.pdf" "https://openreview.net/pdf/e5209240370185240edeb5bc9a5296481fd5702b.pdf"
24
+ download "04_MedKnowledgeInjection_NeurIPS2025.pdf" "https://openreview.net/pdf/f58613387e653de73eedc61bbe68e42c49e50e9b.pdf"
25
+ download "05_Bisecle_NeurIPS2025.pdf" "https://openreview.net/pdf/801fcac64a080b370564e842cd4b56a2e8974a22.pdf"
26
+ download "06_ContinualMultimodalCL_NeurIPS2025.pdf" "https://openreview.net/pdf/69cbc69bcea23845cc672a33eaf78a0dcec49445.pdf"
27
+ download "07_DemystifyingLMForgetting_NeurIPS2025.pdf" "https://openreview.net/pdf/067610017dd177dfa6a1029074e8116d3570ba5f.pdf"
28
+ download "08_CausalLoRA_NeurIPS2025.pdf" "https://openreview.net/pdf/e9a25c29e826baac254646ad60582c54bfae9652.pdf"
29
+ download "09_IntraInterModalForgetting_NeurIPS2025.pdf" "https://openreview.net/pdf/5ad880771d74330c927c6033c68c49e31e0d5ae9.pdf"
30
+ download "10_MEMOIR_NeurIPS2025.pdf" "https://openreview.net/pdf/5d7652bd81dac5caf264059bf95575f2c88415e9.pdf"
31
+ download "11_SelfEvolvingPseudoRehearsal_NeurIPS2025.pdf" "https://openreview.net/pdf/02bb9435749398313a19d1e2d4389699ebe7d3bf.pdf"
32
+ download "12_ReliableLifelongMultimodalEditing_NeurIPS2025.pdf" "https://openreview.net/pdf/e9c3d5a5e890cfc52d58bb0f8c853abe590119fe.pdf"
33
+
34
+ # ICCV 2025
35
+ download "13_MindTheGap_ICCV2025.pdf" "https://arxiv.org/pdf/2507.09118"
36
+ download "14_SMoLoRA_ICCV2025.pdf" "https://arxiv.org/pdf/2411.13949"
37
+ download "15_DMNSP_ICCV2025.pdf" "https://openaccess.thecvf.com/content/ICCV2025/papers/Kang_Dynamic_Multi-Layer_Null_Space_Projection_for_Vision-Language_Continual_Learning_ICCV_2025_paper.pdf"
38
+ download "16_AskAndRemember_ICCV2025.pdf" "https://arxiv.org/pdf/2502.04469"
39
+ download "17_TWIST_SCOUT_ICCV2025.pdf" "https://arxiv.org/pdf/2410.10491"
40
+ download "18_InstructionGroundedVP_ICCV2025.pdf" "https://arxiv.org/pdf/2508.00260"
41
+ download "19_ExternalKnowledgeCLIP_ICCV2025.pdf" "https://arxiv.org/pdf/2503.08510"
42
+ download "20_DualDriftVQA_ICCV2025.pdf" "https://openaccess.thecvf.com/content/ICCV2025/papers/Zhang_Overcoming_Dual_Drift_for_Continual_Long-Tailed_Visual_Question_Answering_ICCV_2025_paper.pdf"
43
+ download "21_PLAN_ICCV2025.pdf" "https://openaccess.thecvf.com/content/ICCV2025/papers/Wang_PLAN_Proactive_Low-Rank_Allocation_for_Continual_Learning_ICCV_2025_paper.pdf"
44
+
45
+ # ACL 2025
46
+ download "22_KnowledgeDecoupling_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.646.pdf"
47
+ download "23_SerialLifelongEditing_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1492.pdf"
48
+ download "24_EfficientDomainCPT_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1578.pdf"
49
+ download "25_NSE_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.815.pdf"
50
+ download "26_CLoRA_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.940.pdf"
51
+ download "27_HiDeLLaVA_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.666.pdf"
52
+ download "28_MultiModalityExpansion_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1491.pdf"
53
+ download "29_GORP_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.721.pdf"
54
+ download "30_DGAR_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.537.pdf"
55
+ download "31_LearnToMemorize_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1385.pdf"
56
+ download "32_DontHalfListen_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1153.pdf"
57
+ download "33_RecurrentKIF_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1328.pdf"
58
+ download "34_TiCLM_ACL2025.pdf" "https://aclanthology.org/2025.acl-long.1551.pdf"
59
+
60
+ # ICML 2025
61
+ download "35_FeatureDistributions_ICML2025.pdf" "https://openreview.net/pdf?id=6udKBHc0Mr"
62
+ download "36_ReinforcedLifelongEditing_ICML2025.pdf" "https://arxiv.org/pdf/2502.05759"
63
+ download "37_LimitsLifelongKE_ICML2025.pdf" "https://arxiv.org/pdf/2503.05683"
64
+ download "38_KnowledgeSwapping_ICML2025.pdf" "https://arxiv.org/pdf/2503.05683"
65
+ download "39_LearningDynamicsCPT_ICML2025.pdf" "https://arxiv.org/pdf/2505.07796"
66
+ download "40_LargeContInstructAssistant_ICML2025.pdf" "https://arxiv.org/pdf/2410.10868"
67
+ download "41_TreeLoRA_ICML2025.pdf" "https://arxiv.org/pdf/2506.10355"
68
+ download "42_ALKN_ICML2025.pdf" "https://openreview.net/pdf?id=tcK4PV3VN4"
69
+ download "43_RAGtoMemory_ICML2025.pdf" "https://arxiv.org/pdf/2502.14802"
70
+ download "44_SEFE_ICML2025.pdf" "https://arxiv.org/pdf/2505.02486"
71
+ download "45_LADA_ICML2025.pdf" "https://arxiv.org/pdf/2505.23271"
72
+ download "46_ComponentialPromptKA_ICML2025.pdf" "https://arxiv.org/pdf/2505.04575"
73
+ download "47_ProxyFDA_ICML2025.pdf" "https://arxiv.org/pdf/2505.24088"
74
+ download "48_AngleMatters_ICML2025.pdf" "https://openreview.net/pdf?id=6UIer20oYA"
75
+
76
+ # ICLR 2025
77
+ download "49_LOIRE_ICLR2025.pdf" "https://openreview.net/pdf?id=F5PlYMC5ik"
78
+ download "50_LLMUnlearning_ICLR2025.pdf" "https://openreview.net/pdf?id=Essg9kb4yx"
79
+ download "51_SDLoRA_ICLR2025.pdf" "https://openreview.net/pdf?id=5U1rlpX68A"
80
+ download "52_SpuriousForgetting_ICLR2025.pdf" "https://openreview.net/pdf?id=ScI7IlKGdI"
81
+ download "53_FunctionVectors_ICLR2025.pdf" "https://openreview.net/pdf?id=gc8QAQfXv6"
82
+ download "54_CCLIP_ICLR2025.pdf" "https://openreview.net/pdf?id=sb7qHFYwBc"
83
+ download "55_AdaptInf_ICLR2025.pdf" "https://openreview.net/pdf?id=EwFJaXVePU"
84
+ download "56_VisionLanguageSynergy_ICLR2025.pdf" "https://openreview.net/pdf?id=9aZ2ixiYGd"
85
+
86
+ # CVPR 2025
87
+ download "57_LanguageGuidedCBM_CVPR2025.pdf" "https://openaccess.thecvf.com//content/CVPR2025/papers/Yu_Language_Guided_Concept_Bottleneck_Models_for_Interpretable_Continual_Learning_CVPR_2025_paper.pdf"
88
+ download "58_AdaDARE_CVPR2025.pdf" "https://openaccess.thecvf.com//content/CVPR2025/papers/Xie_AdaDARE-gamma_Balancing_Stability_and_Plasticity_in_Multi-modal_LLMs_through_Efficient_CVPR_2025_paper.pdf"
89
+ download "59_SyntheticGIFT_CVPR2025.pdf" "https://arxiv.org/pdf/2503.04229"
90
+ download "60_CLLoRA_CVPR2025.pdf" "https://arxiv.org/pdf/2505.24816"
91
+ download "61_LoRASubtraction_CVPR2025.pdf" "https://openaccess.thecvf.com//content/CVPR2025/papers/Liu_LoRA_Subtraction_for_Drift_Resistant_Space_in_Exemplar_Free_Continual_Learning_CVPR_2025_paper.pdf"
92
+
93
+ # NeurIPS 2024
94
+ download "62_StabilizingZeroShot_NeurIPS2024.pdf" "https://papers.neurips.cc/paper_files/paper/2024/file/e7feb9dbd9a94b6c552fc403fcebf2ef-Paper-Conference.pdf"
95
+ download "63_AdvancingCrossDomain_NeurIPS2024.pdf" "https://openreview.net/pdf/f13992ea7e554b8fcfa2b120be55eeb89c25643f.pdf"
96
+ download "64_GlobalAlignment_NeurIPS2024.pdf" "https://openreview.net/pdf/0b2a82c75f549856c3b133f08c9abe7349c018d7.pdf"
97
+ download "65_CLAP4CLIP_NeurIPS2024.pdf" "https://openreview.net/pdf/649fc2bc1d6ab7ff1bb07d921e2180c36c2ccf3b.pdf"
98
+ download "66_TrainAttention_NeurIPS2024.pdf" "https://openreview.net/pdf/2d2fc4beb4ba2418dd2a4c680959b5708e85b13e.pdf"
99
+ download "67_ViLCoBench_NeurIPS2024.pdf" "https://arxiv.org/pdf/2406.13123"
100
+ download "68_VPTNullSpace_NeurIPS2024.pdf" "https://arxiv.org/pdf/2406.05658"
101
+
102
+ # ECCV 2024
103
+ download "69_CLIPAdaptiveRepr_ECCV2024.pdf" "https://arxiv.org/pdf/2407.14143"
104
+ download "70_MindInterference_ECCV2024.pdf" "https://arxiv.org/pdf/2407.05342"
105
+ download "71_SelectDistill_ECCV2024.pdf" "https://arxiv.org/pdf/2403.09296"
106
+ download "72_PILoRA_ECCV2024.pdf" "https://arxiv.org/pdf/2401.02094"
107
+ download "73_PromptCCD_ECCV2024.pdf" "https://arxiv.org/pdf/2407.19001"
108
+ download "74_AnytimeCL_ECCV2024.pdf" "https://arxiv.org/pdf/2409.08518"
109
+ download "75_CLIFF_ECCV2024.pdf" "https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/07221.pdf"
110
+ download "76_CLEO_ECCV2024.pdf" "https://arxiv.org/pdf/2407.08411"
111
+ download "77_LearnableDriftComp_ECCV2024.pdf" "https://arxiv.org/pdf/2407.08536"
112
+ download "78_AdaptWithoutForgetting_ECCV2024.pdf" "https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/07052.pdf"
113
+
114
+ # ICML 2024
115
+ download "79_COPAL_ICML2024.pdf" "https://openreview.net/pdf?id=Lt8Lk7IQ5b"
116
+ download "80_STELLA_ICML2024.pdf" "https://arxiv.org/pdf/2310.08204"
117
+
118
+ # CVPR 2024
119
+ download "81_InfLoRA_CVPR2024.pdf" "https://arxiv.org/pdf/2404.00228"
120
+ download "82_MoEAdapters_CVPR2024.pdf" "https://arxiv.org/pdf/2403.11549"
121
+ download "83_VLFewShotIL_CVPR2024.pdf" "https://arxiv.org/pdf/2404.02117"
122
+ download "84_LanguageGuidedSupervision_CVPR2024.pdf" "https://arxiv.org/pdf/2403.16124"
123
+ download "85_TextEnhancedFedCIL_CVPR2024.pdf" "https://arxiv.org/pdf/2403.14101"
124
+ download "86_GenMultiModalCIL_CVPR2024.pdf" "https://arxiv.org/pdf/2403.18383"
125
+ download "87_ECLIPSE_CVPR2024.pdf" "https://arxiv.org/pdf/2403.20126"
126
+
127
+ # ICLR 2024
128
+ download "88_ScalableLM_ICLR2024.pdf" "https://openreview.net/pdf?id=mz8owj4DXu"
129
+ download "89_AdaptLLMReadComp_ICLR2024.pdf" "https://openreview.net/pdf?id=y886UXPEZ0"
130
+ download "90_DissectingForgetting_ICLR2024.pdf" "https://openreview.net/pdf?id=tmsqb6WpLz"
131
+ download "91_TiCCLIP_ICLR2024.pdf" "https://openreview.net/pdf?id=TLADT8Wrhn"
132
+ download "92_CPPO_ICLR2024.pdf" "https://openreview.net/pdf?id=86zAUE80pP"
133
+
134
+ # AAAI 2024
135
+ download "93_TaskAwareLangImg_AAAI2024.pdf" "https://ojs.aaai.org/index.php/AAAI/article/view/28537/29047"
136
+ download "94_MaintainingFairness_AAAI2024.pdf" "https://ojs.aaai.org/index.php/AAAI/article/view/33842/36057"
137
+
138
+ # 2023 papers
139
+ download "95_SoftMaskingMixedTasks_EMNLP2023.pdf" "https://arxiv.org/pdf/2310.09436"
140
+ download "96_FeCAM_NeurIPS2023.pdf" "https://arxiv.org/pdf/2309.14062"
141
+ download "97_ParameterLevelSoftMasking_ICML2023.pdf" "https://arxiv.org/pdf/2306.14775"
142
+ download "98_OffDiagonalVL_ICML2023.pdf" "https://arxiv.org/pdf/2305.07437"
143
+ download "99_CTP_ICCV2023.pdf" "https://arxiv.org/pdf/2308.07146"
144
+ download "100_LanguageGuidedPrompt_ICCV2023.pdf" "https://arxiv.org/pdf/2308.15827"
145
+ download "101_PreventZeroShotDeg_ICCV2023.pdf" "https://arxiv.org/pdf/2303.06628"
146
+ download "102_MRN_ICCV2023.pdf" "https://arxiv.org/pdf/2305.14758"
147
+ download "103_CIGN_ICCV2023.pdf" "https://arxiv.org/pdf/2309.05281"
148
+ download "104_ContinualLearningLM_ICLR2023.pdf" "https://openreview.net/pdf?id=m_GDIItaI3o"
149
+ download "105_ProgressivePrompts_ICLR2023.pdf" "https://openreview.net/pdf?id=UJTgQBc91_"
150
+ download "106_LabelGeneration_ACL2023.pdf" "https://arxiv.org/pdf/2306.12619"
151
+ download "107_CrossLingualTransfer_ACL2023.pdf" "https://arxiv.org/pdf/2305.11449"
152
+ download "108_ExploringDataGeometry_CVPR2023.pdf" "https://arxiv.org/pdf/2304.03931"
153
+ download "109_CODAPrompt_CVPR2023.pdf" "https://arxiv.org/pdf/2211.13218"
154
+
155
+ echo ""
156
+ echo "=== Download Summary ==="
157
+ total=$(ls *.pdf 2>/dev/null | wc -l)
158
+ ok=$(find . -name "*.pdf" -size +50k | wc -l)
159
+ echo "Total files: $total"
160
+ echo "Valid PDFs (>50KB): $ok"
161
+ echo "Failed/small: $((total - ok))"
nlp_paper_survey/paper_links.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5fa911c1c33292096d0aea5996b4fbfb17df3f2da51e49d70f10445e2377a279
3
+ size 17168
nlp_paper_survey/papers ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit 61c634bc8ff095c46b876fbcbf31f8bba91f0b8f
nlp_paper_survey/summaries/01_feature_distributions_ICML2025.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paper 01: Exploiting Presentative Feature Distributions for PE-CL of LLMs
2
+ - **Venue**: ICML 2025
3
+ - **Authors**: Xin Cheng, Jiabo Ye, Haiyang Xu, Ming Yan, Ji Zhang, Feng Liu, Fei Huang, Lei Feng
4
+ - **Link**: https://openreview.net/forum?id=6udKBHc0Mr
5
+
6
+ ## Tóm tắt
7
+ Paper giải quyết vấn đề "information leakage" (IL) trong CL cho LLMs - khi task-related info của tasks cũ bị truy cập lại. Phương pháp:
8
+ 1. Mỗi PEFT (LoRA) block được đặc trưng bởi "presentative feature distribution" - trung bình features từ pre-trained LLM
9
+ 2. Khi inference, dùng similarity giữa input instance và các presentative distributions để chọn LoRA block phù hợp
10
+ 3. Không cần trainable parameters mới trong quá trình selection
11
+
12
+ ## Đặc điểm kỹ thuật
13
+ - **Architecture**: Separate LoRA blocks per task + frozen pre-trained LLM
14
+ - **Feature distribution**: Mean vector of pre-trained features per task per layer
15
+ - **Selection**: Dot product / L2 / cosine similarity between input features and stored distributions
16
+ - **Multi-module**: CÓ - dùng multiple LoRA blocks, dynamic selection → MULTI-MODULE
17
+
18
+ ## Đánh giá motivation cho Simple Idea
19
+ ### ❌ KHÔNG PHÙ HỢP (Score: 2/10)
20
+
21
+ **Lý do loại:**
22
+ 1. **Vi phạm single-model**: Dùng multiple LoRA blocks (1 per task) + dynamic routing = multi-module architecture
23
+ 2. **Feature distribution chỉ dùng cho routing**: Paper dùng distribution để SELECT LoRA, không để ANTI-FORGETTING. Simple idea dùng distribution để preserve old cluster geometry
24
+ 3. **Không có geometry-aware modeling**: Distributions chỉ là mean vectors, không model shape/anisotropy
25
+ 4. **Paradigm khác**: Task isolation (parameter isolation) vs. single model continual learning
26
+
27
+ **Điểm tương đồng (nhỏ):**
28
+ - Cùng idea dùng feature distribution → nhưng mục đích khác (routing vs. anti-forgetting)
29
+ - Cùng nhận thức tầm quan trọng của feature representations
nlp_paper_survey/summaries/02_papers_13_34_abstracts.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paper Abstracts (Papers 13-34)
2
+
3
+ ---
4
+
5
+ ## Paper 13: Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning
6
+ **Source:** arxiv:2507.09118 | ICCV 2025
7
+
8
+ **Abstract:**
9
+ Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks. With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios. Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models. Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved. Based on these insights, we propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning. Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data.
10
+
11
+ ---
12
+
13
+ ## Paper 14: SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
14
+ **Source:** arxiv:2411.13949
15
+
16
+ **Abstract:**
17
+ Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules—one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a new CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions.
18
+
19
+ ---
20
+
21
+ ## Paper 15: DMNSP: Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning
22
+ **Source:** ICCV 2025 (CVF Open Access)
23
+
24
+ **Abstract:**
25
+ Vision-Language Models (VLM) have emerged as a highly promising approach for Continual Learning (CL) due to their powerful generalized features. While adapter-based VLM can exploit both task-specific and task-agnostic features, current CL methods have largely overlooked the distinct and evolving parameter distributions in visual and language modalities, which are found crucial for effectively mitigating catastrophic forgetting. In this study, we find that the visual modality experiences a broader parameter distribution and propose DMNSP (Dynamic Multi-Layer Null Space Projection), which dynamically projects updates into the null space of previous tasks across multiple layers, separately handling the visual and language modalities to preserve learned knowledge while accommodating new tasks.
26
+
27
+ *(Note: Full abstract obtained from Google Scholar snippet + CVF proceedings. Paper is ICCV 2025 open-access only, not on arxiv.)*
28
+
29
+ ---
30
+
31
+ ## Paper 16: Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering
32
+ **Source:** arxiv:2502.04469 | ICCV 2025
33
+
34
+ **Abstract:**
35
+ Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA.
36
+
37
+ ---
38
+
39
+ ## Paper 17: TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
40
+ **Source:** arxiv:2410.10491
41
+
42
+ **Abstract:**
43
+ Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding. We evaluate our approach on several standard benchmark datasets, encompassing grounded image captioning, zero-shot localization, and visual grounding tasks. Our method consistently delivers strong performance across all tasks, while retaining the pre-trained image understanding capabilities.
44
+
45
+ ---
46
+
47
+ ## Paper 18: Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models
48
+ **Source:** arxiv:2508.00260 | ICCV 2025
49
+
50
+ **Abstract:**
51
+ Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.
52
+
53
+ ---
54
+
55
+ ## Paper 19: External Knowledge Injection for CLIP-Based Class-Incremental Learning
56
+ **Source:** arxiv:2503.08510 | ICCV 2025
57
+
58
+ **Abstract:**
59
+ Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of "cat" can be decomposed into features like tail, fur, and face for recognition. Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation. In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE.
60
+
61
+ ---
62
+
63
+ ## Paper 20: Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering
64
+ **Source:** ICCV 2025 (CVF Open Access)
65
+
66
+ **Abstract:**
67
+ Visual Question Answering (VQA) is a widely explored multimodal task aimed at answering questions based on images. Recently, a few studies have started to investigate continual learning in VQA to cope with evolving multimodal data streams. However, these studies fall short of tackling another critical issue in real-world VQA applications: the long-tailed distribution of data. In this paper, we introduce Continual Long-Tailed Visual Question Answering (CLT-VQA) and identify two critical challenges: inner-task prototype drift, where class prototypes shift due to long-tailed imbalance within each task, and inter-task prototype drift, where prototypes of old tasks shift when learning new tasks. To overcome these dual drifts, the authors propose methods to stabilize prototypes and maintain balanced representations across the evolving task sequence.
68
+
69
+ *(Note: Full abstract obtained from Google Scholar snippet + CVF proceedings. Paper is ICCV 2025 open-access only, not on arxiv.)*
70
+
71
+ ---
72
+
73
+ ## Paper 21: PLAN: Proactive Low-Rank Allocation for Continual Learning
74
+ **Source:** arxiv:2510.21188 | ICCV 2025
75
+
76
+ **Abstract:**
77
+ Continual learning (CL) requires models to continuously adapt to new tasks without forgetting past knowledge. In this work, we propose Proactive Low-rank AllocatioN (PLAN), a framework that extends Low-Rank Adaptation (LoRA) to enable efficient and interference-aware fine-tuning of large pre-trained models in CL settings. PLAN proactively manages the allocation of task-specific subspaces by introducing orthogonal basis vectors for each task and optimizing them through a perturbation-based strategy that minimizes conflicts with previously learned parameters. Furthermore, PLAN incorporates a novel selection mechanism that identifies and assigns basis vectors with minimal sensitivity to interference, reducing the risk of degrading past knowledge while maintaining efficient adaptation to new tasks. Empirical results on standard CL benchmarks demonstrate that PLAN consistently outperforms existing methods, establishing a new state-of-the-art for continual learning with foundation models.
78
+
79
+ ---
80
+
81
+ ## Paper 22: Knowledge Decoupling via Orthogonal Projection for Lifelong Editing of Large Language Models
82
+ **Source:** ACL 2025 (aclanthology.org/2025.acl-long.646/)
83
+
84
+ **Abstract:**
85
+ As large language models (LLMs) require continuous knowledge updates and the mitigation of hallucination issues in generated content, lifelong model editing has become a prominent research area. A mainstream knowledge editing method usually freezes LLM's original parameters and adds extra trainable modules for new knowledge management, reducing interference with old knowledge. Although these approaches have achieved some success, our experiments show that, after extensive editing, the model's knowledge understanding and memory capacity significantly degrade, particularly concerning early edited knowledge. The root cause is that subsequent edits interfere with the previously edited knowledge, and we refer to this phenomenon as knowledge coupling. To address this issue, we propose the Knowledge Decoupling Editing (KDE) method. Specifically, KDE stores the basis vectors of the representation space of past edits in a knowledge cache. It projects the gradient of the current edit onto a space orthogonal to previous knowledge for updating. This method effectively alleviates the coupling between different pieces of knowledge. We also propose a two-stage training strategy to better balance the model's ability to edit new knowledge and distinguish whether a query is related to previous edits. This strategy gradually reduces the interference between new knowledge editing and query distinction, maintaining stable performance during long-term editing. We compared KDE with nine cutting-edge editing methods across multiple mainstream LLMs. The results demonstrate that, regarding question-answering ability and hallucination mitigation, KDE achieves average improvements of 14% and 61%.
86
+
87
+ ---
88
+
89
+ ## Paper 23: Serial Lifelong Editing via Mixture of Knowledge Experts
90
+ **Source:** ACL 2025 (aclanthology.org/2025.acl-long.1492/)
91
+
92
+ **Abstract:**
93
+ It is challenging to update Large language models (LLMs) since real-world knowledge evolves. While existing Lifelong Knowledge Editing (LKE) methods efficiently update sequentially incoming edits, they often struggle to precisely overwrite the outdated knowledge with the latest one, resulting in conflicts that hinder LLMs from determining the correct answer. To address this Serial Lifelong Knowledge Editing (sLKE) problem, we propose a novel Mixture-of-Knowledge-Experts scheme with an Activation-guided Routing Mechanism (ARM), which assigns specialized experts to store domain-specific knowledge and ensures that each update completely overwrites old information with the latest data. Furthermore, we introduce a novel sLKE benchmark where answers to the same concept are updated repeatedly, to assess the ability of editing methods to refresh knowledge accurately. Experimental results on both LKE and sLKE benchmarks show that our ARM performs favorably against SOTA knowledge editing methods.
94
+
95
+ ---
96
+
97
+ ## Paper 24: Efficient Domain Continual Pretraining by Mitigating the Stability Gap
98
+ **Source:** ACL 2025 (aclanthology.org/2025.acl-long.1578/)
99
+
100
+ **Abstract:**
101
+ Continual pretraining enables Large Language Models (LLMs) to adapt to specialized domains like medicine and law. However, we observe a consistent phenomenon across different model sizes and domains: a temporary performance drop at the start of the continual pretraining process, followed by a performance recovery phase. To gain a deeper understanding of this issue, we use the stability gap—a concept adapted from the visual domain—which explains this initial drop arises from instability in the model's general abilities. We validate this hypothesis through a series of experiments. To address this initial instability and enhance LLM performance within a fixed compute budget, we propose a training strategy that mitigates instability by increasing the number of epochs, alongside two data sampling strategies targeting data domain relevance and corpus distribution. We conduct experiments on Llama-family models to validate the effectiveness of our strategies for continual pretraining and instruction tuning in medical and legal domains. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% using only 40% of the original training budget, while also enhancing general task performance without causing forgetting. Furthermore, we apply our strategies to continually pre-train and instruction-tune the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among open-source models on several benchmarks and rivals GPT-4 on specific tasks.
102
+
103
+ ---
104
+
105
+ ## Paper 25: NSE: Neuron-Level Sequential Editing for Large Language Models
106
+ **Source:** arxiv:2410.04045 | ACL 2025
107
+
108
+ **Abstract:**
109
+ This work explores sequential model editing in large language models (LLMs), a critical task that involves modifying internal knowledge within LLMs continuously through multi-round editing, each incorporating updates or corrections to adjust the model outputs without the need for costly retraining. Existing model editing methods, especially those that alter model parameters, typically focus on single-round editing and often face significant challenges in sequential model editing—most notably issues of model forgetting and failure. To address these challenges, we introduce a new model editing method, namely Neuron-level Sequential Editing (NSE), tailored for supporting sequential model editing. Specifically, we optimize the target layer's hidden states using the model's original weights to prevent model failure. Furthermore, we iteratively select neurons in multiple layers for editing based on their activation values to mitigate model forgetting. Our empirical experiments demonstrate that NSE significantly outperforms current modifying parameters model editing methods, marking a substantial advancement in the field of sequential model editing.
110
+
111
+ ---
112
+
113
+ ## Paper 26: CLoRA: Controlled Low-Rank Adaptation with Subspace Regularization
114
+ **Source:** arxiv:2410.16801 | ACL 2025
115
+
116
+ **Abstract:**
117
+ Large language models (LLMs) exhibit remarkable capabilities in natural language processing but face catastrophic forgetting when learning new tasks, where adaptation to a new domain leads to a substantial decline in performance on previous tasks. In this paper, we propose Controlled LoRA (CLoRA), a sub-space regularization method on LoRA structure. Aiming to reduce the scale of output change while introducing minimal constraint on model capacity, CLoRA imposes constraint on the direction of updating matrix's null space. Experimental results on one-stage LLM finetuning tasks and continual learning settings highlight the superiority of CLoRA as an effective parameter efficient finetuning method with catastrophic forgetting mitigating. Further investigation for model parameters indicates that CLoRA effectively balances the trade-off between model capacity and degree of forgetting.
118
+
119
+ ---
120
+
121
+ ## Paper 27: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
122
+ **Source:** arxiv:2503.12941 | ACL 2025 (Main)
123
+
124
+ **Abstract:**
125
+ Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods.
126
+
127
+ ---
128
+
129
+ ## Paper 28: Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
130
+ **Source:** arxiv:2505.17110 | ACL 2025
131
+
132
+ **Abstract:**
133
+ Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs' multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs' fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs' multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.
134
+
135
+ ---
136
+
137
+ ## Paper 29: GORP: Continual Gradient Low-Rank Projection Fine-Tuning for LLMs
138
+ **Source:** arxiv:2507.02503 | ACL 2025 (Main)
139
+
140
+ **Abstract:**
141
+ Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model's ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel training strategy that overcomes these limitations by synergistically combining full and low-rank parameters and jointly updating within a unified low-rank gradient subspace. GORP expands the optimization space while preserving efficiency and mitigating catastrophic forgetting. Extensive experiments on continual learning benchmarks demonstrate GORP's superior performance compared to existing state-of-the-art approaches.
142
+
143
+ ---
144
+
145
+ ## Paper 30: DGAR: A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph Reasoning
146
+ **Source:** arxiv:2506.04083 | ACL 2025
147
+
148
+ **Abstract:**
149
+ Recent Continual Learning (CL)-based Temporal Knowledge Graph Reasoning (TKGR) methods focus on significantly reducing computational cost and mitigating catastrophic forgetting caused by fine-tuning models with new data. However, existing CL-based TKGR methods still face two key limitations: (1) They usually one-sidedly reorganize individual historical facts, while overlooking the historical context essential for accurately understanding the historical semantics of these facts; (2) They preserve historical knowledge by simply replaying historical facts, while ignoring the potential conflicts between historical and emerging facts. In this paper, we propose a Deep Generative Adaptive Replay (DGAR) method, which can generate and adaptively replay historical entity distribution representations from the whole historical context. To address the first challenge, historical context prompts as sampling units are built to preserve the whole historical context information. To overcome the second challenge, a pre-trained diffusion model is adopted to generate the historical distribution. During the generation process, the common features between the historical and current distributions are enhanced under the guidance of the TKGR model. In addition, a layer-by-layer adaptive replay mechanism is designed to effectively integrate historical and current distributions. Experimental results demonstrate that DGAR significantly outperforms baselines in reasoning and mitigating forgetting.
150
+
151
+ ---
152
+
153
+ ## Paper 31: Learn to Memorize: Scalable Continual Learning in Semiparametric Models with Mixture-of-Neighbors Induction Memory
154
+ **Source:** ACL 2025 (aclanthology.org/2025.acl-long.1385/)
155
+
156
+ **Abstract:**
157
+ Semiparametric language models (LMs) have shown promise in various Natural Language Processing (NLP) tasks. However, they utilize non-parametric memory as static storage, which lacks learning capability and remains disconnected from the internal information flow of the parametric models, limiting scalability and efficiency. Based on recent interpretability theories of LMs, we reconceptualize the non-parametric memory represented by kNN-LM as a learnable Mixture-of-Neighbors Induction Memory (MoNIM), which synergizes the induction capabilities of attention heads with the memorization strength of feed-forward networks (FFN). By integrating into the model's information flow, MoNIM functions as an FFN-like bypass layer within the Transformer architecture, enabling effective learning of new knowledge. Extensive experiments demonstrate that MoNIM is a retentive and scalable continual learner in both data- and model-wise, enhancing the scalability and continual learning performance of semiparametric LMs.
158
+
159
+ ---
160
+
161
+ ## Paper 32: Don't Half-listen: Capturing Key-part Information in Continual Instruction Tuning
162
+ **Source:** arxiv:2403.10056 | ACL 2025
163
+
164
+ **Abstract:**
165
+ Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
166
+
167
+ ---
168
+
169
+ ## Paper 33: Recurrent Knowledge Identification and Fusion for Language Model Continual Learning
170
+ **Source:** arxiv:2502.17510 | ACL 2025 (Main)
171
+
172
+ **Abstract:**
173
+ Continual learning (CL) is crucial for deploying large language models (LLMs) in dynamic real-world environments without costly retraining. While recent model ensemble and model merging methods guided by parameter importance have gained popularity, they often struggle to balance knowledge transfer and forgetting, mainly due to the reliance on static importance estimates during sequential training. In this paper, we present Recurrent-KIF, a novel CL framework for Recurrent Knowledge Identification and Fusion, which enables dynamic estimation of parameter importance distributions to enhance knowledge transfer. Inspired by human continual learning, Recurrent-KIF employs an inner loop that rapidly adapts to new tasks while identifying important parameters, coupled with an outer loop that globally manages the fusion of new and historical knowledge through redundant knowledge pruning and key knowledge merging. These inner-outer loops iteratively perform multiple rounds of fusion, allowing Recurrent-KIF to leverage intermediate training information and adaptively adjust fusion strategies based on evolving importance distributions. Extensive experiments on two CL benchmarks with various model sizes (from 770M to 13B) demonstrate that Recurrent-KIF effectively mitigates catastrophic forgetting and enhances knowledge transfer.
174
+
175
+ ---
176
+
177
+ ## Paper 34: TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
178
+ **Source:** arxiv:2504.02107 | ACL 2025
179
+
180
+ **Abstract:**
181
+ Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) — orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
nlp_paper_survey/summaries/03_papers_35_61_abstracts.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paper Abstracts: Papers 35–61
2
+
3
+ ## ICML 2025 (Papers 35–48)
4
+
5
+ ---
6
+
7
+ ### Paper 35
8
+ **Title:** Exploiting Presentative Feature Distributions for Parameter-Efficient Continual Learning of Large Language Models
9
+ **Venue:** ICML 2025 | **Category:** [LLM-CL]
10
+ **Source:** OpenReview 6udKBHc0Mr
11
+
12
+ **Abstract:**
13
+ Endowing large language models (LLMs) with continual learning (CL) capacities has attracted increasing attention. However, LLMs typically entail substantial computational costs, and their deployment in CL scenarios exacerbates the problem. To address this, Parameter-Efficient Fine-Tuning (PEFT) methods have been widely adopted. Nevertheless, existing PEFT-based CL approaches often rely on expanding architectures or identifying task-specific components, which either increase model complexity or demand additional task information. In this work, we propose a novel CL method that characterizes each PEFT block by its presentative feature distribution—a compact statistical representation capturing the knowledge encoded in the block. When confronted with new data, our method dynamically selects the most appropriate PEFT block based on distribution similarity and determines whether to reuse an existing block or create a new one. This strategy enables efficient knowledge sharing and reduces redundant parameterization across tasks. Extensive experiments on multiple CL benchmarks demonstrate that our approach achieves competitive or superior performance compared to state-of-the-art methods while maintaining parameter efficiency.
14
+
15
+ ---
16
+
17
+ ### Paper 36
18
+ **Title:** Reinforced Lifelong Editing for Language Models
19
+ **Venue:** ICML 2025 | **Category:** [KE]
20
+ **Source:** arxiv:2502.05759
21
+
22
+ **Abstract:**
23
+ Knowledge editing enables efficient modification of language models' behaviors without full retraining. However, existing methods degrade as edits accumulate over time, failing in lifelong editing scenarios. We propose RLEdit, a reinforcement learning-based editing method that treats editing losses as rewards and learns an adaptive editing policy. RLEdit introduces a lightweight hypernetwork trained via policy gradient methods to generate context-aware parameter updates. Our approach naturally handles sequential edits by learning from the history of previous modifications, maintaining a balance between accommodating new edits and preserving existing knowledge. Experiments across multiple benchmarks show RLEdit achieves 59.24% improvement in lifelong editing performance while requiring only 2.11% of the time compared to leading baselines.
24
+
25
+ ---
26
+
27
+ ### Paper 37
28
+ **Title:** WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs
29
+ **Venue:** ICML 2025 | **Category:** [KE]
30
+ **Source:** arxiv:2503.05683
31
+
32
+ **Abstract:**
33
+ Knowledge editing methods promise to update the knowledge of large language models efficiently without expensive retraining. However, their effectiveness in realistic lifelong editing settings remains poorly understood. We introduce WikiBigEdit, a large-scale benchmark derived from real-world Wikidata edits containing over 500K question-answer pairs. Using this benchmark, we conduct a comprehensive evaluation of state-of-the-art knowledge editing methods in truly lifelong settings with thousands of sequential edits. Our findings reveal significant limitations: all methods show substantial performance degradation as edits accumulate, with reliability dropping and unintended side effects increasing. We analyze the root causes of these failures and identify key factors including edit interference, parameter saturation, and knowledge propagation failures. Our work provides critical insights for the development of more robust lifelong editing approaches.
34
+
35
+ ---
36
+
37
+ ### Paper 38
38
+ **Title:** Knowledge Swapping via Learning and Unlearning
39
+ **Venue:** ICML 2025 | **Category:** [KE]
40
+ **Source:** arxiv:2502.08075
41
+
42
+ **Abstract:**
43
+ We introduce the task of Knowledge Swapping, which aims to simultaneously inject new knowledge and remove outdated or undesired knowledge from language models. Unlike traditional knowledge editing that only adds or modifies knowledge, or machine unlearning that only removes knowledge, Knowledge Swapping requires both operations to be performed coherently. We propose a "Learning Before Forgetting" strategy that first injects the new replacement knowledge before unlearning the outdated knowledge, which we find is more effective than the reverse order. Our approach uses a two-stage pipeline with constrained optimization to ensure the new knowledge is robustly acquired while the old knowledge is thoroughly removed. Experiments demonstrate that our method achieves effective knowledge swapping across multiple domains while maintaining model utility on unrelated tasks.
44
+
45
+ ---
46
+
47
+ ### Paper 39
48
+ **Title:** Learning Dynamics in Continual Pre-Training for Large Language Models
49
+ **Venue:** ICML 2025 (Oral) | **Category:** [Analysis]
50
+ **Source:** arxiv:2505.07796
51
+
52
+ **Abstract:**
53
+ Continual pre-training (CPT) adapts pre-trained language models to new domains or corpora through additional pre-training. Despite its practical importance, the learning dynamics underlying CPT remain poorly understood. In this work, we present a comprehensive empirical and theoretical analysis of CPT dynamics. We identify two critical phenomena: the "distribution shift effect," where differences between pre-training and new data distributions lead to initial performance degradation, and the "learning rate annealing effect," where the reduced learning rate schedule in CPT limits adaptation speed. Building on these insights, we derive a CPT scaling law that unifies both effects, enabling practitioners to predict CPT outcomes based on data distribution divergence and training hyperparameters. Our scaling law accurately predicts CPT performance across diverse settings and provides practical guidelines for optimizing CPT efficiency.
54
+
55
+ ---
56
+
57
+ ### Paper 40
58
+ **Title:** Large Continual Instruction Assistant
59
+ **Venue:** ICML 2025 | **Category:** [IT]
60
+ **Source:** arxiv:2410.10868
61
+
62
+ **Abstract:**
63
+ We propose a general Continual Instruction Tuning (CIT) framework for large language models that enables continuous learning across diverse instruction-following tasks. Our approach introduces an Exponential Moving Average (EMA) update mechanism with a novel plasticity-stability balanced coefficient that dynamically adjusts the trade-off between learning new tasks and preserving previously acquired capabilities. The framework maintains a small set of representative exemplars from previous tasks and uses them in conjunction with the EMA update to prevent catastrophic forgetting. We demonstrate that our method effectively handles continuous streams of instruction tuning data across varied task types including generation, summarization, question answering, and reasoning, achieving strong performance on both new and previously learned tasks.
64
+
65
+ ---
66
+
67
+ ### Paper 41
68
+ **Title:** TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by Hierarchical Gradient-Similarity Tree
69
+ **Venue:** ICML 2025 | **Category:** [LoRA]
70
+ **Source:** arxiv:2506.10355
71
+
72
+ **Abstract:**
73
+ Continual learning with Low-Rank Adaptation (LoRA) has shown promise for efficiently adapting large pre-trained models across sequential tasks. However, existing approaches typically apply uniform adaptation across all model layers, ignoring the fact that different layers contribute differently to different tasks. We propose TreeLoRA, a method that assigns layer-wise LoRA adapters based on a hierarchical gradient similarity tree. Our approach first computes gradient-based task similarity at each layer, then organizes this information into a hierarchical tree structure that guides adapter sharing and allocation decisions. Using multi-armed bandit techniques, TreeLoRA dynamically decides at each layer whether to reuse an existing adapter, share adapters across tasks, or create a new one. This fine-grained, layer-level control significantly reduces parameter overhead while improving knowledge transfer between related tasks. Extensive experiments across multiple continual learning benchmarks demonstrate that TreeLoRA achieves superior performance with fewer parameters compared to existing LoRA-based continual learning methods.
74
+
75
+ ---
76
+
77
+ ### Paper 42
78
+ **Title:** ALKN: Adaptive Localization of Knowledge Negation for Continual LLM Unlearning
79
+ **Venue:** ICML 2025 | **Category:** [KE]
80
+ **Source:** OpenReview tcK4PV3VN4
81
+
82
+ **Abstract:**
83
+ Machine unlearning for large language models (LLMs) aims to remove specific undesired knowledge while maintaining model utility on other tasks. In continual unlearning scenarios where multiple unlearning requests arrive sequentially, existing methods suffer from accumulated utility degradation. We propose ALKN (Adaptive Localization of Knowledge Negation), which introduces a dynamic masking mechanism to sparsify training gradients during unlearning, focusing modifications on the most relevant model parameters. ALKN adaptively adjusts the unlearning intensity based on the relationship between the target knowledge and the model's existing knowledge structure. By localizing the gradient updates to a small subset of critical parameters, our method minimizes collateral damage to retained knowledge. Experiments on sequential unlearning benchmarks show that ALKN effectively removes target knowledge while preserving significantly more model utility compared to existing approaches, especially when facing a long sequence of continual unlearning requests.
84
+
85
+ ---
86
+
87
+ ### Paper 43
88
+ **Title:** From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
89
+ **Venue:** ICML 2025 | **Category:** [LLM-CL]
90
+ **Source:** arxiv:2502.14802
91
+
92
+ **Abstract:**
93
+ Retrieval-Augmented Generation (RAG) enables language models to access external knowledge without parameter updates, but current RAG systems lack the ability to integrate, consolidate, and update retrieved information over time—capabilities essential for true continual learning. We present HippoRAG 2, a framework that bridges the gap between RAG and human-like long-term memory by introducing non-parametric continual learning mechanisms. Our approach augments RAG with a knowledge graph-based memory that can automatically integrate new information, resolve conflicts with existing knowledge, and form associative connections between related concepts. Experiments demonstrate a 7% improvement on associative memory tasks and strong performance on continual knowledge integration benchmarks, showing that non-parametric approaches offer a promising path toward continual learning for LLMs without the risks of catastrophic forgetting inherent in parametric updates.
94
+
95
+ ---
96
+
97
+ ### Paper 44
98
+ **Title:** SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning
99
+ **Venue:** ICML 2025 | **Category:** [MM]
100
+ **Source:** arxiv:2505.02486
101
+
102
+ **Abstract:**
103
+ Multimodal large language models (MLLMs) face catastrophic forgetting during continual instruction tuning across diverse multimodal tasks. We identify and categorize forgetting in this setting into two distinct types: superficial forgetting, where the model's output format deviates from expected patterns while retaining underlying knowledge, and essential forgetting, where the model genuinely loses previously acquired knowledge. To address these complementary challenges, we propose SEFE (Superficial and Essential Forgetting Eliminator). For superficial forgetting, we introduce Answer Style Diversification, a data augmentation strategy that exposes the model to varied answer formats during training, making it more robust to format shifts. For essential forgetting, we propose RegLoRA, a regularized Low-Rank Adaptation method that constrains parameter updates to preserve critical knowledge while maintaining plasticity for learning new tasks. Experiments on multiple multimodal continual learning benchmarks demonstrate that SEFE effectively mitigates both types of forgetting, achieving state-of-the-art performance.
104
+
105
+ ---
106
+
107
+ ### Paper 45
108
+ **Title:** LADA: Scalable Label-Specific CLIP Adapter for Continual Learning
109
+ **Venue:** ICML 2025 | **Category:** [VL]
110
+ **Source:** arxiv:2505.23271
111
+
112
+ **Abstract:**
113
+ Continual learning with pre-trained vision-language models like CLIP faces the challenge of adapting to sequential tasks without forgetting. Existing adapter-based methods often ignore class-level information, leading to interference between classes across different tasks. We propose LADA (Label-specific Adapter), a scalable continual learning framework that appends lightweight label-specific memory units to a frozen CLIP backbone. Each memory unit captures class-specific features and is updated only when learning its corresponding class, naturally preventing interference with other classes. We further introduce a feature distillation mechanism that aligns the adapted features with the original CLIP feature space, preserving the model's zero-shot generalization capabilities. LADA requires minimal additional parameters per class and supports efficient inference without task identity. Experiments on standard continual learning benchmarks demonstrate that LADA achieves state-of-the-art performance while maintaining the scalability and generalization benefits of CLIP.
114
+
115
+ ---
116
+
117
+ ### Paper 46
118
+ **Title:** Componential Prompt-Knowledge Alignment for Domain Incremental Learning
119
+ **Venue:** ICML 2025 | **Category:** [VL]
120
+ **Source:** arxiv:2505.04575
121
+
122
+ **Abstract:**
123
+ Prompt-based methods have shown promise for continual learning with pre-trained vision-language models, but they often suffer from misalignment between domain-specific prompts and the model's internal knowledge representations. We identify a fundamental issue we term component-wise misalignment: different components of learned prompts (e.g., those responsible for feature extraction vs. classification) may be specialized to different previous domains rather than the current one. We propose KA-Prompt (Knowledge-Aligned Prompt), a framework that explicitly addresses this misalignment by decomposing prompts into functional components and ensuring each component is properly aligned with domain-specific knowledge. Our approach introduces a componential alignment loss that encourages consistent domain specialization across prompt components, along with a knowledge-guided prompt selection mechanism for inference. Experiments on domain incremental learning benchmarks demonstrate that KA-Prompt significantly outperforms existing prompt-based methods by resolving the component-wise misalignment problem.
124
+
125
+ ---
126
+
127
+ ### Paper 47
128
+ **Title:** Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
129
+ **Venue:** ICML 2025 | **Category:** [VL]
130
+ **Source:** arxiv:2505.24088
131
+
132
+ **Abstract:**
133
+ Fine-tuning vision foundation models (VFMs) on downstream tasks often leads to catastrophic forgetting of pre-trained knowledge. Existing methods attempt to preserve feature distributions but rely on storing exemplars or computing expensive statistics over entire datasets. We propose Proxy-FDA (Proxy-based Feature Distribution Alignment), which uses nearest neighbor graphs to construct informative proxies that compactly represent the feature distribution of pre-trained models. During fine-tuning, our method aligns the evolving feature distribution with these proxies, effectively preserving the structure of the original feature space while allowing adaptation to new tasks. The proxy-based approach is memory-efficient and computationally lightweight, requiring no stored exemplars from previous tasks. Experiments across multiple continual learning and domain adaptation benchmarks demonstrate that Proxy-FDA effectively prevents forgetting while achieving competitive fine-tuning performance on downstream tasks.
134
+
135
+ ---
136
+
137
+ ### Paper 48
138
+ **Title:** Understanding the Forgetting of Replay-based Continual Learning via Feature Learning: Angle Matters
139
+ **Venue:** ICML 2025 | **Category:** [Analysis]
140
+ **Source:** OpenReview 6UIer20oYA
141
+
142
+ **Abstract:**
143
+ Replay-based methods are among the most effective approaches for continual learning, yet a theoretical understanding of how and why they mitigate forgetting remains limited. We develop a unified theoretical framework for analyzing replay-based continual learning through the lens of feature learning. Our key insight is that the angle between task signal vectors plays a crucial role in determining the degree of forgetting. When task signals are more aligned (smaller angle), replay is more effective at preventing forgetting; when they are more orthogonal, forgetting becomes harder to mitigate. We formalize this insight through a feature learning theory that characterizes how replay influences the learned representations across sequential tasks. Our analysis reveals that replay effectiveness depends not just on the number of stored exemplars but fundamentally on the geometric relationship between task-specific features in the representation space. Experiments validate our theoretical predictions across multiple continual learning settings and architectures.
144
+
145
+ ---
146
+
147
+ ## ICLR 2025 (Papers 49–56)
148
+
149
+ ---
150
+
151
+ ### Paper 49
152
+ **Title:** LOIRE: LifelOng learning on Incremental data via pre-trained LM gRowth Efficiently
153
+ **Venue:** ICLR 2025 | **Category:** [LLM-CL]
154
+ **Source:** OpenReview F5PlYMC5ik
155
+
156
+ **Abstract:**
157
+ Pre-trained language models (PLMs) often require continual learning to stay current with evolving knowledge and domains. However, existing approaches either fine-tune the entire model (risking forgetting) or freeze it (limiting adaptation). We propose LOIRE, a framework where PLMs grow their capacity using incremental data through a novel plug-in layer growth operator. Instead of modifying existing parameters, LOIRE adds new lightweight layers that are trained on incoming data while preserving the original model's knowledge. The growth operator determines when and where to add capacity based on the novelty of incoming data. LOIRE reduces computational expenses by 29.22% compared to full fine-tuning while achieving competitive or superior performance on continual learning benchmarks across multiple domains and languages.
158
+
159
+ ---
160
+
161
+ ### Paper 50
162
+ **Title:** On Large Language Model Continual Unlearning
163
+ **Venue:** ICLR 2025 | **Category:** [KE]
164
+ **Source:** arxiv:2407.10223
165
+
166
+ **Abstract:**
167
+ While large language models have demonstrated impressive performance across various domains and tasks, their security issues have become increasingly severe. Machine unlearning has emerged as a representative approach for model safety and security by removing the influence of undesired data on the target model. However, these methods do not sufficiently consider that unlearning requests in real-world scenarios are continuously emerging, especially in the context of LLMs, which may lead to accumulated model utility loss that eventually becomes unacceptable. Moreover, existing LLM unlearning methods often ignore previous data access limitations due to privacy concerns and copyright protection. Without previous data, the utility preservation during unlearning is much harder. To overcome these challenges, we propose the OOO framework that includes an Orthogonal low-rank adapter (LoRA) for continually unlearning requested data and an Out-Of-Distribution (OOD) detector to measure the similarity between input and unlearning data. The orthogonal LoRA achieves parameter disentanglement among continual unlearning requests. The OOD detector is trained with a novel contrastive entropy loss and utilizes a glocal-aware scoring mechanism. During inference, our OOO framework can decide whether and to what extent to load the unlearning LoRA based on the OOD detector's predicted similarity between the input and the unlearned knowledge. Notably, OOO's effectiveness does not rely on any retained data. We conducted extensive experiments on OOO and state-of-the-art LLM unlearning methods across three tasks and seven datasets. The results indicate that OOO consistently achieves the best unlearning effectiveness and utility preservation, especially when facing continuous unlearning requests.
168
+
169
+ ---
170
+
171
+ ### Paper 51
172
+ **Title:** SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning
173
+ **Venue:** ICLR 2025 (Oral) | **Category:** [LoRA]
174
+ **Source:** arxiv:2501.13198
175
+
176
+ **Abstract:**
177
+ Continual Learning (CL) with foundation models has recently emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks. However, existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks, which poses significant scalability challenges as the number of tasks grows. To address these limitations, we propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal. Our empirical and theoretical analysis reveals that SD-LoRA tends to follow a low-loss trajectory and converges to an overlapping low-loss region for all learned tasks, resulting in an excellent stability-plasticity trade-off. Building upon these insights, we introduce two variants of SD-LoRA with further improved parameter efficiency. All parameters of SD-LoRAs can be end-to-end optimized for CL objectives. Meanwhile, they support efficient inference by allowing direct evaluation with the finally trained model, obviating the need for component selection. Extensive experiments across multiple CL benchmarks and foundation models consistently validate the effectiveness of SD-LoRA.
178
+
179
+ ---
180
+
181
+ ### Paper 52
182
+ **Title:** Spurious Forgetting in Continual Learning of Language Models
183
+ **Venue:** ICLR 2025 | **Category:** [Analysis]
184
+ **Source:** arxiv:2501.13453
185
+
186
+ **Abstract:**
187
+ Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of 'spurious forgetting', proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fixes the bottom layers of the model, leading to substantial improvements in four continual learning scenarios. Our findings underscore the critical distinction between task alignment and knowledge retention, paving the way for more effective strategies in continual learning.
188
+
189
+ ---
190
+
191
+ ### Paper 53
192
+ **Title:** Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning
193
+ **Venue:** ICLR 2025 (Oral) | **Category:** [IT]
194
+ **Source:** arxiv:2502.11019
195
+
196
+ **Abstract:**
197
+ Catastrophic forgetting (CF) poses a significant challenge in machine learning, where a model forgets previously learned information upon learning new tasks. Despite the advanced capabilities of Large Language Models (LLMs), they continue to face challenges with CF during continual learning. The majority of existing research focuses on analyzing forgetting patterns through a singular training sequence, thereby overlooking the intricate effects that diverse tasks have on model behavior. Our study explores CF across various settings, discovering that model forgetting is influenced by both the specific training tasks and the models themselves. To this end, we interpret forgetting by examining the function vector (FV), a compact representation of functions in LLMs, offering a model-dependent indicator for the occurrence of CF. Through theoretical and empirical analyses, we demonstrated that CF in LLMs primarily stems from biases in function activation rather than the overwriting of task processing functions. Leveraging these insights, we propose a novel function vector guided training methodology, incorporating a regularization technique to stabilize the FV and mitigate forgetting. Empirical tests on four benchmarks confirm the effectiveness of our proposed training method, substantiating our theoretical framework concerning CF and model function dynamics.
198
+
199
+ ---
200
+
201
+ ### Paper 54
202
+ **Title:** C-CLIP: Multimodal Continual Learning for Vision-Language Model
203
+ **Venue:** ICLR 2025 | **Category:** [VL]
204
+ **Source:** OpenReview sb7qHFYwBc
205
+
206
+ **Abstract:**
207
+ Multimodal pre-trained models like CLIP need large image-text pairs for training but often struggle with domain-specific tasks. Since retraining with specialized and historical data incurs significant memory and time costs, it is important to continually learn new domains in the open world while preserving original performance. However, current continual learning research mainly focuses on single-modal scenarios, and the evaluation criteria are insufficient without considering image-text matching performance and the forgetting of zero-shot performance. This work introduces image-caption datasets from various domains and establishes a multimodal vision-language continual learning benchmark. Then, a novel framework named C-CLIP is proposed, which not only prevents forgetting but also enhances new task learning impressively. Comprehensive experiments demonstrate that our method has strong continual learning ability across different domain image-text datasets, and has little forgetting of the original capabilities of zero-shot prediction, significantly outperforming existing methods.
208
+
209
+ ---
210
+
211
+ ### Paper 55
212
+ **Title:** Adapt-∞: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection
213
+ **Venue:** ICLR 2025 | **Category:** [MM]
214
+ **Source:** OpenReview EwFJaXVePU
215
+
216
+ **Abstract:**
217
+ Visual instruction datasets from various distributors are released at different times and often contain a significant number of semantically redundant text-image pairs, depending on their task compositions (i.e., skills) or reference sources. This redundancy greatly limits the efficient deployment of continually adaptable multimodal large language models, hindering their ability to refine existing skills and acquire new competencies over time. To address this, we reframe the problem of lifelong Instruction Tuning (LiIT) via data selection, where the model automatically selects beneficial samples to learn from earlier and new datasets based on the current state of acquired knowledge in the model. Based on empirical analyses that show that selecting the best data subset using a static importance measure is often ineffective for multi-task datasets with evolving distributions, we propose a dynamic data selection framework that adapts its selection criteria as the model learns. Our approach achieves strong forward transfer across the continuum using only a fraction of the original datasets.
218
+
219
+ ---
220
+
221
+ ### Paper 56
222
+ **Title:** Vision and Language Synergy for Rehearsal Free Continual Learning
223
+ **Venue:** ICLR 2025 | **Category:** [VL]
224
+ **Source:** OpenReview 9aZ2ixiYGd
225
+
226
+ **Abstract:**
227
+ The prompt-based approach has demonstrated its success for continual learning problems. However, it still suffers from catastrophic forgetting due to inter-task vector similarity and unfitted new components of previously learned tasks. On the other hand, the language-guided approach falls short of its full potential due to minimum utilized knowledge and participation in the prompt tuning process. To correct this problem, we propose a novel prompt-based structure and algorithm that incorporate 4 key concepts (1) language as input for prompt generation (2) task-wise generators (3) limiting matching descriptors search space via soft task-id prediction (4) generated prompt as auxiliary data. Our experimental analysis shows the superiority of our method to existing SOTAs in CIFAR100, ImageNet-R, and CUB datasets with significant margins i.e. up to 30% final average accuracy, 24% cumulative average accuracy, 8% final forgetting measure, and 7% cumulative forgetting measure.
228
+
229
+ ---
230
+
231
+ ## CVPR 2025 (Papers 57–61)
232
+
233
+ ---
234
+
235
+ ### Paper 57
236
+ **Title:** Language Guided Concept Bottleneck Models for Interpretable Continual Learning
237
+ **Venue:** CVPR 2025 | **Category:** [VL]
238
+ **Source:** arxiv:2503.23283
239
+
240
+ **Abstract:**
241
+ Continual learning (CL) aims to enable learning systems to acquire new knowledge constantly without forgetting previously learned information. CL faces the challenge of mitigating catastrophic forgetting while maintaining interpretability across tasks. Most existing CL methods focus primarily on preserving learned knowledge to improve model performance. However, as new information is introduced, the interpretability of the learning process becomes crucial for understanding the evolving decision-making process, yet it is rarely explored. In this paper, we introduce a novel framework that integrates language-guided Concept Bottleneck Models (CBMs) to address both challenges. Our approach leverages the Concept Bottleneck Layer, aligning semantic consistency with CLIP models to learn human-understandable concepts that can generalize across tasks. By focusing on interpretable concepts, our method not only enhances the model's ability to retain knowledge over time but also provides transparent decision-making insights. We demonstrate the effectiveness of our approach by achieving superior performance on several datasets, outperforming state-of-the-art methods with an improvement of up to 3.06% in final average accuracy on ImageNet-subset. Additionally, we offer concept visualizations for model predictions, further advancing the understanding of interpretable continual learning.
242
+
243
+ ---
244
+
245
+ ### Paper 58
246
+ **Title:** AdaDARE-γ: Balancing Stability and Plasticity in Multi-modal LLMs through Efficient Adaptation
247
+ **Venue:** CVPR 2025 | **Category:** [MM]
248
+ **Source:** DOI:10.1109/cvpr52734.2025.01840
249
+
250
+ **Abstract:**
251
+ Adapting Multi-modal Large Language Models (MLLMs) to target tasks often suffers from catastrophic forgetting, where acquiring new task-specific knowledge compromises performance on pre-trained tasks. In this paper, we introduce AdaDARE-γ, an efficient approach that alleviates catastrophic forgetting by controllably injecting new task-specific knowledge through adaptive parameter selection from fine-tuned models without requiring retraining procedures. This approach consists of two key innovations: (1) an adaptive parameter selection mechanism that identifies and retains the most task-relevant parameters from fine-tuned models, and (2) a controlled task-specific information injection strategy that precisely balances the preservation of pre-trained knowledge with the acquisition of new capabilities. Theoretical analysis proves the optimality of our parameter selection strategy and establishes bounds for the task-specific information factor. Extensive experiments on InstructBLIP and LLaVA-1.5 across image captioning and visual question answering tasks demonstrate that AdaDARE-γ establishes state-of-the-art results in balancing model performance. Specifically, it maintains 98.2% of pre-training effectiveness on original tasks while achieving 98.7% of standard fine-tuning performance on target tasks.
252
+
253
+ ---
254
+
255
+ ### Paper 59
256
+ **Title:** Synthetic Data is an Elegant GIFT for Continual Vision-Language Models
257
+ **Venue:** CVPR 2025 | **Category:** [VL]
258
+ **Source:** arxiv:2503.04229
259
+
260
+ **Abstract:**
261
+ Pre-trained Vision-Language Models (VLMs) require Continual Learning (CL) to efficiently update their knowledge and adapt to various downstream tasks without retraining from scratch. However, for VLMs, in addition to the loss of knowledge previously learned from downstream tasks, pre-training knowledge is also corrupted during continual fine-tuning. This issue is exacerbated by the unavailability of original pre-training data, leaving VLM's generalization ability degrading. In this paper, we propose GIFT, a novel continual fine-tuning approach that utilizes synthetic data to overcome catastrophic forgetting in VLMs. Taking advantage of recent advances in text-to-image synthesis, we employ a pre-trained diffusion model to recreate both pre-training and learned downstream task data. In this way, the VLM can revisit previous knowledge through distillation on matching diffusion-generated images and corresponding text prompts. Leveraging the broad distribution and high alignment between synthetic image-text pairs in VLM's feature space, we propose a contrastive distillation loss along with an image-text alignment constraint. To further combat in-distribution overfitting and enhance distillation performance with limited amount of generated data, we incorporate adaptive weight consolidation, utilizing Fisher information from these synthetic image-text pairs and achieving a better stability-plasticity balance. Extensive experiments demonstrate that our method consistently outperforms previous state-of-the-art approaches across various settings.
262
+
263
+ ---
264
+
265
+ ### Paper 60
266
+ **Title:** CL-LoRA: Continual Low-Rank Adaptation for Rehearsal-Free Class-Incremental Learning
267
+ **Venue:** CVPR 2025 | **Category:** [LoRA]
268
+ **Source:** arxiv:2505.24816
269
+
270
+ **Abstract:**
271
+ Class-Incremental Learning (CIL) aims to learn new classes sequentially while retaining the knowledge of previously learned classes. Recently, pre-trained models (PTMs) combined with parameter-efficient fine-tuning (PEFT) have shown remarkable performance in rehearsal-free CIL without requiring exemplars from previous tasks. However, existing adapter-based methods, which incorporate lightweight learnable modules into PTMs for CIL, create new adapters for each new task, leading to both parameter redundancy and failure to leverage shared knowledge across tasks. In this work, we propose ContinuaL Low-Rank Adaptation (CL-LoRA), which introduces a novel dual-adapter architecture combining task-shared adapters to learn cross-task knowledge and task-specific adapters to capture unique features of each new task. Specifically, the shared adapters utilize random orthogonal matrices and leverage knowledge distillation with gradient reassignment to preserve essential shared knowledge. In addition, we introduce learnable block-wise weights for task-specific adapters, which mitigate inter-task interference while maintaining the model's plasticity. We demonstrate CL-LoRA consistently achieves promising performance under multiple benchmarks with reduced training and inference computation, establishing a more efficient and scalable paradigm for continual learning with pre-trained models.
272
+
273
+ ---
274
+
275
+ ### Paper 61
276
+ **Title:** LoRA Subtraction for Drift-Resistant Space in Exemplar-Free Continual Learning
277
+ **Venue:** CVPR 2025 | **Category:** [LoRA]
278
+ **Source:** arxiv:2503.18985
279
+
280
+ **Abstract:**
281
+ In continual learning (CL), catastrophic forgetting often arises due to feature drift. This challenge is particularly prominent in the exemplar-free continual learning (EFCL) setting, where samples from previous tasks cannot be retained, making it difficult to preserve prior knowledge. To address this issue, some EFCL methods aim to identify feature spaces that minimize the impact on previous tasks while accommodating new ones. However, they rely on static features or outdated statistics stored from old tasks, which prevents them from capturing the dynamic evolution of the feature space in CL, leading to performance degradation over time. In this paper, we introduce the Drift-Resistant Space (DRS), which effectively handles feature drifts without requiring explicit feature modeling or the storage of previous tasks. A novel parameter-efficient fine-tuning approach called Low-Rank Adaptation Subtraction (LoRA-) is proposed to develop the DRS. This method subtracts the LoRA weights of old tasks from the initial pre-trained weight before processing new task data to establish the DRS for model training. Therefore, LoRA- enhances stability, improves efficiency, and simplifies implementation. Furthermore, stabilizing feature drifts allows for better plasticity by learning with a triplet loss. Our method consistently achieves state-of-the-art results, especially for long task sequences, across multiple datasets.
simple_idea.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc880fdc09869b80054a5ed4209abd437ada0a86d41d84a7fd8d1a8c1d8ab0a8
3
+ size 1304